Demystifying Data Warehouses, Data Lakes, and Data Marts
Demystifying Data Warehouses, Data Lakes, and Data Marts
The What, When, Why and How of Data Warehouses, Lakes, and Marts and how they are differentiated.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
As data and analytics become a more integral part of business processes in an organization, so the non-DBAs among us might start to feel lost in a sea of technical terms which are frequently thrown around by technical teams. The disproportionately loud vendor noise that exists in this space further generates jargon, hype, and confusion (just try to get a straight answer to “what is big data”).
This post is meant to be used by business users as a (very) abridged guide to the various types of repositories your data might reside in: databases, data marts, data warehouses and data lakes, so that you have a basic understanding of each of these concepts and the role they play in what you’re actually after – real, up-to-date insight from your data.
Ready to get educated? Let’s get started.
Databases Are Basic Data Storage Mechanisms
In one form or another, the database is at the heart of most data storage and management systems. The relational database used with many applications and systems holds data in tables of rows and columns. In a table, a row corresponds to a record with a set sequence of data fields, while a column lists one given data field for all the records. The data is structured in that only the “right” kind of data can be used in a given field: for example, in a customer relational database, a shipping date cannot be used in a field for a delivery address, and so on.
The structure or “schema” of a relational database is defined before starting to record data. It is often left unchanged afterward. However, by organizing the database as separate tables with defined relationships between them, structured data can be accessed or reassembled in many different ways. By comparison, the need to handle unstructured data has led to the creation of other types of databases. To efficiently handle free-form text in emails or for variable-length video clips, for example, such a non-relational database may have very few fields or different numbers of fields for different records, and may allow fields to be changed “on the fly” after data storage operations begin.
Transactional Databases (Write-oriented)
For key business systems like sales, accounting, and production, it is critical that input of transactional data to the database is quick and reliable, without disruption to the flow of business. The database can be optimized for these “write” operations, by minimizing the duplication of data fields (normalizing the data) among the database tables.
A transactional system can also be efficient in retrieving specific information about an individual transaction, like the date of a shipment to a customer. On the other hand, the “write-oriented” design makes it less well-suited for collating data to provide information such as overall shipment figures over the last two years. This kind of “read-oriented” operation may require massive manipulation of data records or recombination of large tables when querying production databases, either of which could then have a significant negative impact on a transactional system performance.
Analytical Databases (Read-oriented)
Analytical databases systems are optimized for “read” operations and often run separately from transactional-operational systems. From time to time, they ingest data from the transactional systems and possibly other data sources, but otherwise, they perform relatively few “writes”. Analytical systems are used to consolidate data (do roll-ups), slice data (for example, all shipments over one year of a given product), dice data (for example, shipments of a given product to a specific set of customers for a given quarter), and drill down to reveal successive layers of detail from a higher-level statistic.
Data Warehouses for One Version of the Truth
Data in an enterprise exists in different formats in various sources and is not necessarily consistent from one source to another. To resolve differences and potential conflicts, a data warehouse consolidates data from the different sources and makes the data available in one unified, harmonized form. The process of getting to this “one version of the truth” for an enterprise or organization is divided into three main steps.
- Extract. Data is uploaded (perhaps daily, weekly, or even monthly) from different data sources. It may first go to a staging server, before going to the data warehouse.
- >Transform. The different datasets are made compatible with one another by adjusting formats and resolving conflicts.
- Load. The transformed data is loaded into the analytical repository for subsequent calculations, trend analyses, reporting, and other business intelligence functions.
A data warehouse is also known as a “schema on write” system because the data written into it by the ETL (extract-transform-load) process has also been given a certain structure or schema. It is often in effect another database, but one that is now optimized for analytic processing, instead of transaction handling.
Data Marts Serve Insights to Individual Departments
A data mart offers analytical capability for a restricted area of data, for example, for just one functional domain or department in an enterprise. Data marts can help avoid one department interfering with another department’s data. They can also simplify data analytics or meet a smaller, more specific requirement, before trying to tackle a data warehouse for the whole organization. Consequently, there are two points of view about how to implement data warehouses and data marts. One is to start with the data warehouse as an overarching construction. The dependent data marts are then restrictions or subsets of the data warehouse. The other is to make independent data marts from source data, then bring them together afterward to form an overall or larger data warehouse.
Data Lakes for Massive Storage that Changes the Rules
The realization that unstructured data and big data can also be analyzed for business insights have led to the concept of the data lake. In relation to a data warehouse, a data lake offers massive storage with high availability of data at lower costs, together with increased agility and flexibility of use. A data lake can help increase data democracy, allowing users more possibilities to ask new questions as they go. It supports all data from all sources, including relational databases, Hadoop file systems, and social media data. It also retains the original data, as well as transformed versions, for data traceability. However, there is still a need for data governance and understanding of the practical business uses to be made of it. Otherwise, indiscriminate dumping of data may turn a potentially useful data lake into an unhealthy data swamp.
Compared to a data warehouse, which uses a “schema on write” approach to hold structured, processed data, a data lake uses a “schema on read” approach in which structure and schema are only applied when the data is being read from the data lake to answer a business intelligence question or for other analytics purposes. In other words, whereas a data warehouse functions via an ETL process (extract, transform, load), a data lake uses ELT (extract, load, and then transform) instead.
Does Your Organization Need a Data Warehouse? Data Lake? Or Both?
The growth in data complexity and the high number of disparate datasets accessed or generated by organizations today are both factors to consider when choosing between solutions for gathering, storing, and analyzing data. Data lakes can be used for massive quantities of unstructured data, but also require tools that can easily interface with them. Data warehouses let organizations create a single source of truth, but come at the cost of agile analysis when they require the involvement of IT professionals to set up new queries or reports.
However, it is also possible to use both. A data lake can be used for sandboxing, allowing users to experiment with different data models and transformations, before setting up new schema in a data warehouse. It can also serve as a staging area, from which to supply data to a data warehouse to then produce cleansed data with known value. By using each possibility suitably, enterprises and organizations can get the best of both solutions.
Published at DZone with permission of Eran Levy , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.