There are several steps involved in the process of building a data lake. In this article we will cover the basic process of building a scalable data lake.
Choose a location for your data lake
Before you get started in building a data lake you have to decide where the data lake will actually reside. This usually entails a choice between on premises data storage and cloud data storage. Whatever the choice your data store needs to be scalable, available, and performant, secure, and redundant.
Amazon Web Services, Microsoft Azure, and Google Cloud Platform all offer great platforms for building a cloud-based data lake.
Ingest data into your data lake
Once you have decided where your data lake will be located it is time to begin “loading” data into the data lake. We prefer the term “ingest” over the term “load” because “loading” often carries with it connotations of the Extract/Transform/Load process familiar to data warehouses. The process of loading a data warehouse involves a lot of data preparation, but data lakes ingest data in its most raw form.
This data ingestion can take many forms:
Directing log files to the data lake. Log file storage and analysis is a very common use case for a data lake. You can configure applications to have their log files either written directly to data lake storage, or cached and bundled at defined intervals to be sent to the data lake.
Pointing application-generated data to the data lake. Many applications take in or generate raw binary data – images, schematics, audio, etc – and this data can use the data lake as its back end storage source. This is a particularly good workflow for machine learning as often the data sent in can be crunched by machine learning algorithms on the backend for further insight.
Extracting data from other systems – you might have currently-running data stores such as application databases, data marts, or data warehouses, and you want that data in more “raw” form so that it can be massaged into different formats.
Transform data in the data lake into analytics formats
Once the data is in the data lake, the data is often transformed into other formats and used to load tools more specifically targeted to the usefulness of the data lake data.
There are several common use cases for the data in a data lake; these use cases are outlined below:
Low Latency Search – Often data from data lakes is in document form and an organization needs a way to search this information quickly. These documents are stored in their base form in the data lake and then imported into search tools such as Elastic.
Data Warehousing – Data lake information is often used to feed data warehouses. In our first article we discussed the difference between data lakes and data warehouses. Data lakes are NOT a replacement for data warehouses at most levels. Instead, the data lake sits in front of at least one (sometimes numerous) data warehouses and data marts that draw data from the data lake. Traditional data cleansing procedures are performed on the data as it is copied from the data lake to the target data warehouse.
Data Visualization/Dashboards – Data lake data is often fed into specialized data visualization systems that are then used to build complex dashboards and data visualizations. A number of tools exist to visualize this data.
Machine Learning/AI – In the past decade the cost of computing power has fallen to the point where machine learning/AI algorithms are available to most organizations. Machine learning algorithms are data gluttons – the more diverse data you can feed into the algorithms the better you can hone your machine learning algorithms and the more faithful their generated models will be to real-world conditions. The use of data lakes in feeding machine learning will only increase in the next five years and organizations that do not have a plan to use such technology will face stiff competition from those that do.
Once the data is loaded into the data lake it can be operated upon for further insight. There are several common “use cases” for data manipulation
The extraction and manipulation phase of a data lake takes in data from a data lake and changes it into different formats for later analysis. To use the example of logs, the data lake might have file parsers that run over the contents of the raw logs, extract interesting entries, and move that data into a different data analysis tool that will allow end users to track interesting events.
Some extraction/manipulation is used to prepare large amounts of data for machine learning algorithms. The output from this manipulation is fed into tools such as TensorFlow, PyTorch, or Jupyter Notebook and used for further analysis.
Some data is taken from the data lake, which becomes the authoritative raw data source, and repurposed into data marts. For example, sales data might rest in the data lake in raw format, and then extracted into a data mart for executive analysis, and also a CRM tool for customer relationship tracking.
Recycle analytics data for further insight
The process of ingesting, extracting, and analyzing often generates its own data and forms a circle. We refer to this circle as data composting. Data comes in from applications, is analyzed by machine learning algorithms or other analytics tools, and that process often generates its own new data which is fed back into the data lake and used in future iterations of data analytics.
Data Lake Tools
There are a number of tools available to you as you build your data lake. These tools facilitate the various steps of building the data lake, from storage and ingestion to analysis and search. We will explore several of these toolsets here.
Apache Toolsets for data lakes
Hadoop is an open source framework for handling big data. At its core Hadoop is known as a distributed file system. What this means is that you can feed data into one point of entry, and then Hadoop divides the storage of that information across many computing nodes. This provides several advantages. First, this ensures data redundancy. One file might be replicated across dozens of nodes in a Hadoop framework, which ensures that your data lake will continue to be fully functional even if several of the Hadoop nodes are not operational. This protects the availability of the data lake in the event of hardware or software problems.
Hadoop’s distributed nature powers its ability to perform fast querying over large-scale data sets. Hadoop’s core querying technology, MapReduce, allows a user to define instructions for operations upon data. Once Hadoop has the instructions, it coordinates the execution of instructions across its many data nodes. Because of the distributed nature of the data within the Hadoop architecture, this means in practice that Hadoop queries are run across many nodes simultaneously. This “divide and conquer” approach allows for fast querying over enormous datasets. A search or calculation that might take a very long time on one computing node can be reduced to a fraction of the time when distributed over many nodes.
Hadoop contains some querying facilities, but Spark is focused almost exclusively on performing intensive queries on large data sets. It allows you to connect to large data sets anywhere, and perform parallel operations on those datasets. It supports a number of statistical and querying languages such as Python, Java, R, and SQL.
Elastic Tools (ELK stack) for Data Lakes
Elastic is optimized to work around JSON documents and allows for fast and easy management of documents within the document store. Fast document search is the primary reason why Elastic is deployed as part of a data lake. Like Hadoop, Elastic can be configured to spread its data over many computing/storage nodes, but it presents a unified interface and API where data can easily be added, removed, updated, and searched, all using intuitive JSON language.
Elastic offers Beats and Logstash as easy ways to move data into the Elastic Search engine. These tools function as easy to configure services and monitors. These monitors run on your servers, take high frequency data such as server logs and network traffic, and send them to Elasticsearch.
Kibana is the data visualization arm of the elastic family. It allows for a user of Elastic to create dashboards based on queries run against the backend Elasticsearch store. Kibana is used frequently in server monitoring and tracking web hits for various properties, but it can be configured to provide dashboards of all sorts of data.
TensorFlow and PyTorch for data lakes
TensorFlow and PyTorch are machine learning frameworks that allow data scientists to build machine learning models with ease.