DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Data Warehouses: The Undying Titans of Information Storage
  • When Doris Meets Iceberg: A Data Engineer's Redemption
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices
  • An Introduction To Open Table Formats

Trending

  • Why Your RAG Pipeline Will Fail Without an MCP Server
  • Top JavaScript/TypeScript Gen AI Frameworks for 2026
  • Beyond Conversation: Mastering Context with Claude Code Skills and Agents
  • Designing Agentic Systems Like Distributed Systems
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How Data Lakes Work

How Data Lakes Work

How data lakes can help eliminate costs and time involved in working with large amounts of data.

By 
Ben Sharma user avatar
Ben Sharma
·
Jan. 31, 17 · Opinion
Likes (7)
Comment
Save
Tweet
Share
12.1K Views

Join the DZone community and get the full member experience.

Join For Free

Excerpt from ebook, Architecting Data Lakes: Data Management Architectures for Advanced Business Use Cases, by Ben Sharma and Alice LaPlante.

Many IT organizations are simply overwhelmed by the sheer volume of data sets—small, medium, and large—that are stored in Hadoop, which although related, are not integrated. However, when done right, with an integrated data management framework, data lakes allow organizations to gain insights and discover relationships between data sets.

Data lakes created with an integrated data management framework eliminate the costly and cumbersome data preparation process of ETL that traditional EDW requires. Data is smoothly ingested into the data lake, where it is managed using metadata tags that help locate and connect the information when business users need it. This approach frees analysts for the important task of finding value in the data without involving IT in every step of the process, thus conserving IT resources. Today, all IT departments are being mandated to do more with less. In such environments, well-governed and managed data lakes help organizations more effectively leverage all their data to derive business insight and make good decisions.

Zaloni has created a data lake reference architecture that incorporates best practices for data lake building and operation under a data governance framework, as shown in Figure 2-1.

Image title


Figure 2-1. Zaloni’s data lake architecture

The main advantage of this architecture is that data can come into the data lake from anywhere, including online transaction processing (OLTP) or operational data store (ODS) systems, an EDW, logs or other machine data, or from cloud services. These source systems include many different formats, such as file data, database data, ETL, streaming data, and even data coming in through APIs.

The data is first loaded into a transient loading zone, where basic data quality checks are performed using MapReduce or Spark by leveraging the Hadoop cluster. Once the quality checks have been performed, the data is loaded into Hadoop in the raw data zone, and sensitive data can be redacted so it can be accessed without revealing personally identifiable information (PII), personal health information (PHI), payment card industry (PCI) information, or other kinds of sensitive or vulnerable data.

Data scientists and business analysts alike dip into this raw data zone for sets of data to discover. An organization can, if desired, perform standard data cleansing and data validation methods and place the data in the trusted zone. This trusted repository contains both master data and reference data.

Master data is the basic data sets that have been cleansed and validated. For example, a healthcare organization may have master data sets that contain basic member information (names, addresses) and members’ additional attributes (dates of birth, social security numbers). An organization needs to ensure that this reference data kept in the trusted zone is up to date using change data capture (CDC) mechanisms.

Reference data, on the other hand, is considered the single source of truth for more complex, blended data sets. For example, that healthcare organization might have a reference data set that merges information from multiple source tables in the master data store, such as the member basic information and member additional attributes to create a single source of truth for member data. Anyone in the organization who needs member data can access this reference data and know they can depend on it.

From the trusted area, data moves into the discovery sandbox, for wrangling, discovery, and exploratory analysis by users and data scientists.

Finally, the consumption zone exists for business analysts, researchers, and data scientists to dip into the data lake to run reports, do “what if” analytics, and otherwise consume the data to come up with business insights for informed decision-making.

Most importantly, underlying all of this must be an integration platform that manages, monitors, and governs the metadata, the data quality, the data catalog, and security. Although companies can vary in how they structure the integration platform, in general, governance must be a part of the solution.

Data science Data lake Data management Reference data

Published at DZone with permission of Ben Sharma. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Data Warehouses: The Undying Titans of Information Storage
  • When Doris Meets Iceberg: A Data Engineer's Redemption
  • Optimizing Data Management for AI Success: Industry Insights and Best Practices
  • An Introduction To Open Table Formats

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook