Getting Started With Data Lakes

Table of Contents

Introduction What are Data Lakes? The Data Lake: A Pragmatic Solution for Enterprise Analytics Building a Data Lake Conclusion

Section 1

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 2

What are Data Lakes?

With the consistent rise of data types and their varying business objectives, data warehouses initially evolved as a solution model that removed data silos by integrating isolated data sources. While doing so, DWHs allowed the querying and viewing of data within these sources holistically, as well as enabled efficient data audit and governance. However, several shortcomings of this model soon became manifest — one primary issue was its inability to store raw, unstructured data. Besides, the hardware and software supporting warehouses were expensive, thereby preventing economies of scale because of how tightly coupled the storage was to the compute engine.

The Emergence of Data Lakes

Such challenges pertinent to data warehouses called for an evolved architecture that could store large data sets in a distributed environment and parallelly process them. These challenges saw the emergence of data lakes as an enhanced architecture that provided organizations with a single repository of structured and unstructured data spread across various relational databases, DWHs, and other data stores for efficient analysis and operational insights.

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 3

The Data Lake: A Pragmatic Solution for Enterprise Analytics

The traditional underlying tools for business intelligence and enterprise analytics were the data warehouse and the data mart. Although these structures served their purpose, there were inherent issues associated with these models when it came to analytics. The data lake model was a gradual successor that solved some of these problems, some of which are noted below.

Table: Issues in traditional systems that data lakes solve

Problem	Solution
The ingestion process in traditional systems requires substantial effort and cost due to the need to structure data.	Data lakes allow all data to flow into them, so the ingestion process does not require much effort and maintains low costs.
DWHs store structured data, thus further definition of artificial objects or synthetic data to aggregate valuable information is not possible.	Data lakes allow users to distill data on demand based on varying business needs; analysts can further discover new patterns and relationships in the data store.
Defining analysis configurations on data upfront is compulsory with DWHs and data marts, so systems are not flexible enough to meet evolving business needs.	Data lakes enable analytics on an ad-hoc basis.
Integrating DWHs and data marts with decision systems is technically feasible, though the process is considered mostly ineffective due to time delays in receiving data from the source.	Data lakes can integrate with real-time decision systems to support precise analysis for mission-critical processes.

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 4

Building a Data Lake

The process of building a data lake requires advanced planning and preparation. Before organizations adopt and implement a data lake, stakeholders should examine their data sources, types, schemas, and total volume, as well as consider the right technology to use. The choice of technology should be based on the following principles:

Elasticity
Independent extensions
Separation of computing and storage

Another vital aspect to account for prior to standing up a data lake is data governance, which is where the layers of a data lake come into play. The three layers that set shared functionality across all tiers:

Information lifecycle management layer – responsible for determining rules that govern what should or should not go into the data lake.
Data tends to lose value over time and keeping it in the data lake could lead to a data swamp. It is necessary to define strategies for determining how long data should remain in the data lake and to deploy tools that implement these policies by archiving data that has become stale.
Metadata layer – responsible for capturing the metadata of data going into the data lake.
For enabling data to be accessible, metadata determines the relationship between the stored data and what data can be used by whom.
Data governance and security layer – responsible for authentication, authorization, and access control of the data lake.
This includes regulation of which users can access and/or create what, as well as the types of modifications that can be made to data artifacts. Additionally, this layer is used to create security profiles to protect various data assets.

This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 5

Conclusion

Big data — characterized by its sheer volume, velocity, variety, veracity, and value — enables the opportunity to discover insights that can improve businesses' operational efficiency while supporting strategic decision-making. Mining this data source using different processing and analytics capabilities can provide tremendous value. However, what remains critical is that organizations ensure they have the right methodologies, framework, and tools in place to process this data — solutions that are cost-effective, scalable, as well as capable of supporting their specific business needs.

The data lake is one such model that enables efficient economies of scale to process vast amounts of varying data. Over the traditional data warehouse, data lakes provide substantial performance benefits for iterative processing. As apparent with its widely successful use cases for myriad big data processing and analytics capabilities across diverse sectors, data lake adoption is expected to continue growing in the years to come.