What is Big Data and how to approach it?
Thirty years ago only a very small sliver of organizations had to worry about this “big data”. Data storage was relatively expensive, and developing the processes to capture that data was time consuming. Most organizations were happy to develop a few procedures to move their data into data warehouses.
Much has changed in the ensuing decades. Regulations have dictated that organizations save much more data than before. The price of storage and compute power have dropped to the point where it feels like a mistake NOT to save as much data as possible. Cloud platforms have emerged that have enabled a “save everything” approach. Many organizations that would not have called themselves “data driven” a decade ago are struggling with the big data problem.
Traditionally data warehouses have been used to deal with the increasing volume of data. However, over the past decade, the data landscape has changed and data warehouses are no longer the best “first pass” data storage. Data warehouses are optimized around well-structured data that has been cleansed and formatted for analytics. They do not work well with image data, JSON/XML, or high frequency data such as logs and IoT endpoints.
Solution: the Data Lake
The concept of a data lake has arisen in response to several relatively new trends in organizational data storage:
The price of permanent data storage has become incredibly low, making it cost-effective to save data that in previous years might have been discarded in the name of preserving storage space.
The march of data collection and analysis (as well as emerging regulations) dictate that what might not be useful today might have some use tomorrow. Instead of choosing carefully a few fields from a data file, organizations are opting to store ALL the data in the anticipation that they may find some use for it in the future.
Much of the organizational data is in formats that are not “relational database friendly”. This includes image and audio data, office document formats, raw JSON/XML data, and other data that is not easily reducible to rows and columns.
High frequency data endpoints (server logs, website hits, IoT events) are becoming a greater and greater share of data being generated. This data needs to be stored, but the speed at which it piles up dictates that it needs a very simple storage mechanism.
Relational data stores, such as traditional data warehouses, are now being seen as just one of a number of options for data storage and retrieval. Increasingly the same data is being stored in several different applications depending on usage. For example, product data might be fed into a document data store for fast search, a data mart to drive executive dashboards, and a traditional relational data warehouse for longer term analytics.
Data Warehouses vs. Data Lakes
At its root a data lake is nothing more than a highly scalable and performant raw data store where data can be staged. What you do with that data is up to you.
The following are some of the differences between data lakes and more traditional data warehouses:
Data is usually highly structured in nature and exists in the form of rows and columns.
Elaborate Extract/Transform/Load (ETL) procedures are developed to move data into the data warehouse.
Data is structured with a particular end application (usually analytics) in mind.
The back-end storage of the data warehouse often necessitates a commitment to a particular database vendor.
Data warehouses often have difficulty scaling beyond a certain size.
Data does not need to be highly structured and can exist in any raw format.
Data storage is as simple as writing files to disk.
Data can be massaged into other formats and exported for many different kinds of use cases.
The back-end is usually simple disk storage and can be moved from one place to another.
Scale is only limited to available storage space.
What makes a great Data Lake?
Data lake implementation details can vary. Some lakes are implemented on premises. In recent years organizations have started to shift toward cloud-based data lakes. Regardless of the implementation details, well-architected data lakes share a number of characteristics.
Scalable – There is not much point in creating a data lake if your data storage solution is not highly scalable. Data lakes can grow to petabytes and even exabytes in size depending on your storage needs. Cloud vendors have for the most part solved this problem by offering access to virtually unlimited storage. On premises data lakes will have to deal with this problem by acquiring extensive NAS storage.
Available – This means that the data lake can be accessed by all the people and applications that need to use it –whether for writing data for storage or reading data for analytics or other uses. Cloud vendors solve this problem through partitioning and access roles. On premises solutions will involve similar provisioning of network/storage access.
Performant – Because of the amount of data being ingested, data lake storage needs to be highly performant, both from the network access perspective and the actual disk storage perspective. This is one area where on-premises storage perhaps has an advantage over cloud storage. Internal network connections are often much faster and offer far lower latency than the external network connections required to move data to the cloud. Some of this difference is mitigated if many of your “data generating applications” already exist in cloud infrastructure. Flexibility in disk storage is an advantage of cloud providers – plans exist where you can direct your “freshest” data to highly responsive data stores, and as data ages, it can be archived to slower, lower-priority (and less expensive) storage.
Secure – With so much critical data in one place, security issues become paramount. Unfortunately data security often runs at cross-purposes with other elements in this checklist. There are several layers to security. Cloud providers often handle operating system and network security for you, but, in theory, their data stores are accessible to all with the right permissions. In contrast, an on-premises solution might allow you to hide the data behind your network firewalls, but you are responsible at the end of the day for patching, monitoring and mitigating all attacks against your network and infrastructure.
Redundant – Because so much organizational data is being stored in the data lake, it is critical to provide some sort of redundancy to the data lake. Cloud vendors handle this process by allowing you to mirror data storage over a number of availability zones. This physical separation of data copies reduces the risk of a catastrophic event wiping out years of organizational data.