Data Lakes and Swamps, Oh My
Data Lakes and Swamps, Oh My
An introduction to the concepts of data lakes and data swamps, and one developer's take on how data lakes can prove useful.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I was lamenting to my friend and fellow MVP Shamir Charania (blog|Twitter) that I didn't have a topic for this week's blog post, so he and his colleague suggested I write about data lakes, and specifically the Azure Data Lake.
What Is a Data Lake?
This is what Wikipedia says:
A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data swamp is a deteriorated data lake either inaccessible to its intended users or providing little value.
In my opinion, the Wikipedia definition has too many words, so let's rewrite it:
A data lake is a repository of enterprise data stored in its original format. This may take the form of one or more of the following:
- structured data from relational databases (rows and columns).
- semi-structured data (CSV, log files, XML, JSON).
- unstructured data (emails, documents, PDFs).
- binary data (images, audio, video).
(I thought the term "data swamp" was a joke, but it's 2018 and nothing shocks me anymore.)
If that definition of a data lake sounds like a file system, I'd agree. If it sounds like SharePoint, I'm not going to argue either.
However, the main premise of a data lake is a single point of access for all of an organization's data, which can be effectively managed and maintained. To differentiate "data lake" from "file system," then, we need to talk about scale. Data lakes are measured in petabytes of data.
Whoa, What's a Petabyte?
For dinosaurs like me who still think in binary, a petabyte (referred to by some as a pebibyte) is 1,024 terabytes (tebibytes), or 1,125,899,906,842,624 bytes (yes, that's 16 digits).
In the metric system, a petabyte is 1,000 terabytes, or 1,000,000,000,000,000 bytes.
No matter which counting system we use, a petabyte is one million billion bytes. That's a lot of data.
Who, What, How?
Internet companies including search engines (Google, Bing), social media companies (Facebook, Twitter), and email providers (Yahoo!, Outlook.com) are managing data stores measured in petabytes. On a daily basis, these organizations handle all sorts of structured and unstructured data.
Assuming they put all their data in one repository, that could technically be thought of as a data lake. These organizations have adapted existing tools and even created new technologies to manage data of this magnitude in a field called big data.
The short version: big data is not a 100 GB SQL Server database or data warehouse. Big data is a relatively new field that came about because traditional data management tools are simply unable to deal with such large volumes of data. Even so, a single SQL Server database can allegedly be more than 500 petabytes in size, but Michael J. Swart warns us: if you're using over 10% of what SQL Server restricts you to, you're doing it wrong.
Big data is where we hear about processes like Google's MapReduce. The Apache Foundation created their own open-source implementation of MapReduce called Hadoop. Later, Apache Spark was developed to solve some of the limitations inherent in the MapReduce cluster computing paradigm.
What Is the Azure Data Lake?
From a high level of abstraction, we can think of the Azure Data Lake as an infinitely large hard drive. It leverages the resilience, reliability, and security of Azure Storage you already know and love. Then, using Hadoop and other toolsets in the Azure environment, data can be queried, manipulated and analyzed in the same way we might do it on-premises, but leveraging the massive parallel processing of cloud computing combined with virtually limitless storage.
Note: Microsoft is not the only player in this space. Other cloud vendors like Google Compute (GC) and Amazon Web Services (AWS) offer roughly equivalent services for roughly equivalent prices.
Our New Definition
With all of that taken into consideration, here is my new definition for "data lake":
A data lake is a single repository for all enterprise data, in its natural format, which can be effectively managed and maintained using a number of big data technologies.
Published at DZone with permission of Randolph West , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.