Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Lake vs. Data Warehouse

DZone's Guide to

Data Lake vs. Data Warehouse

While we say data lake vs. data warehouse, these two technologies are actually quite complimentary. Read on to see why.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Data lakes and data warehouses are critical technologies for business analysis, but the differences between the two can be confusing. How are they different? Is one more stable than the other? Which one is going to help your business the most? This article seeks to demystify these two systems for handling your data.

What Is a Data Lake?

A data lake is a centralized repository designed to store all your structured and unstructured data. Further, a data lake can store any type of data using its native format, without size limits. Data lakes were developed primarily to handle the volumes of big data, and thus they excel at handling unstructured data. You typically move all the data into a data lake without transforming it. Each data element in a lake is assigned a unique identifier, and it is extensively tagged so that you can later find the element via a query. The benefit of this is that you never lose data, it can be available for extensive periods of time, and your data is very flexible because it does not need to adhere to a particular schema before it is stored.

What Is a Data Warehouse?

A data warehouse is a large-capacity repository that sits on top of multiple databases. It is designed to store medium to large amounts of structured data for frequent and repeatable analysis. Typically, a data warehouse is used to bring together data from various structured sources for analysis, usually for business purposes. Some data warehouses can handle unstructured data, but this is not common. Work is involved to ensure that the data types are compatible before you can integrate the data. Because the data stored in a warehouse is structured, the size of the data is constrained, and the schema is determined before data can be added to the warehouse.

Data Lakes vs. Data Warehouses

Picture a warehouse: there’s a limited amount of space, and the boxes must fit into a particular slot on the shelf. Each box needs to be stored in order so that you can later find it, and you will likely need to design the warehouse so that old inventory is purged periodically. Most of these same constraints apply to a data warehouse: the size is fixed, and each piece of data must be stored according to a schema that is carefully designed before you can add the data to the warehouse. Data warehouses are optimized for structured data.

By contrast, a data lake is amorphous, the boundaries can grow or shrink based on the contents. Like a lake, if more data is poured in, the data lake expands, and when data is removed it shrinks. The data does not need to be structured because you use extensive tagging to find the data when you need it. Data lakes are optimized for unstructured data.

The following table shows some of the key differences between data lakes and data warehouses.


Data Lake Data Warehouse
Storage Data is unstructured, and all data is kept in its raw form. All data is stored, and it is only transformed when it is analyzed. Data is typically extracted from transactional systems. The data is cleaned and transformed before you load it to the data warehouse.
Data capture Captures semi-structured and unstructured data. Captures structured data and organizes it in schemas.
Purpose A data lake is ideal for deep analysis of unstructured data. For example, a data scientist might use advanced analytical tools with capabilities such as predictive modeling and statistical analysis. A data warehouse is ideal for operational uses such as monthly reports because it is highly structured.
Schema Typically, the schema is defined after you store the data. This requires less initial work and provides more flexibility. Typically, the schema is defined before you store data. This requires you to cleanse and normalize the data, and it means that the schema is far less flexible.
Better for… Unstructured data, explorations, innovation, flexibility. Structured data, high performance, repeatability, constant use.


What’s Right for You, a Data Lake or a Data Warehouse?

The simple answer is that you probably need both.

Data warehouses are ideal for the kind of repeatable reporting that is common in business practices, such as monthly sales reports, tracking of sales per region, or website traffic. A data lake is useful when you have a less straightforward analysis to perform. For example, maybe you want to perform behavior analysis of the traffic on your website. These are complementary rather than competing tools.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,data warehouse ,data lake ,data management

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}