Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Lake: The Central Data Store

DZone's Guide to

Data Lake: The Central Data Store

With the increased use of tools for different functionalities, generating meaningful reports for different stakeholders has become challenging. Data lakes can help.

Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

We live in the age of data, and as per Gartner, the volume of worldwide information is growing at a minimum rate of 59% annually. Volume alone is a significant challenge to manage, and variety and velocity make it even more difficult. It is also very evident that generation of larger and larger volumes of data will continue, especially if we consider the exponential growth of the number of handheld devices and Internet-connected devices.

For organizations with systems of engagement, this is true — but for others, data volume growth is not that high. Data volume is different for different organizations. In spite of this difference, one common factor across all of them is the importance of meaningful and useful analytics for different stakeholders. With the increased use of tools across organizations for different functionalities, the task of generating meaningful and useful reports for different stakeholders is becoming more and more challenging. 

What Is the Data Lake?

Nick Heudecker, research director at Gartner, has explained the data lake:

“In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”

Thus, a data lake helps organizations gain insight into their data by breaking the data silos. The term “data lake” was first used in 2010, and its definition/characteristics are still evolving. In general, “data lake” refers to a central repository capable of storing zettabytes of data drawn from various internal and external sources in a format as close as to the raw data.

Data Lake Challenges

A data lake is usually thought of as the collection and collation of all enterprise data from legacy systems and sources, data warehouses and analytics systems, third-party data, social media data, clickstream data, and anything else that might be considered useful information for the enterprise. Although the definition is interesting, is it actually possible or required for every organization?

Different organizations have different challenges and patterns of distributed data, and with diversified scenarios, every organization has their own need for the data lake. Though the needs, pattern, source, and architecture of the data are different, the challenges are the same with respect to building a central storage or lake of data:

  • Bringing data from different sources to a common, central pool.
  • Handling low volume but highly diversified data.
  • Storing the data in a low-cost infrastructure compared to a data warehouse or big data.
  • Near real-time synchronization of data with the central data store.
  • Traceability and governance of the central data.

Data Lake Implementation Considerations

In most cases, data lakes are deployed with the essence of a data-as-a-service model where it is considered as a centralized system-of-record, serving other systems at enterprise scale. A localized data lake not only expands to support multiple teams but also spawns multiple data lake instances to support larger needs. This centralized data then can be used by all different teams for their analytical needs.

With all these understandings, it’s time to discuss the various needs of data lakes in terms of integration and governance.

Integration Challenges

In order to deploy a data lake at the enterprise level, it needs to have certain capabilities that will allow it to be integrated within the overall data management strategy, IT applications, and data flow landscape of the organization.

  • In order to make the data of a data lake useful at a later point in time, it is very important to make sure that the lake is getting the right data at the right time. For example, a data lake may ingest monthly sales data from enterprise financial software. If the data lake takes in that data too early, it may get only a partial dataset or no data at all. This could result in inaccurate reporting down the line, leading the company in the wrong direction. Thus, the integration platform operating in the background for the population of data into the data lake should be capable of pushing data from various tools both in real-time and on-demand based on the business case.
  • Though the main purpose of the data lake is to store data, at times (based on different business cases and in order to facilitate other departments for using the data in future), some data needs to be distilled or processed before getting inserted into the data lake. Thus, the integration platform should not only have support for this but also ensure that the data processing is happening accurately and in the correct order.
  • Centralized data storage is useful only when the stored data can be extracted by all different departments for their own use. There should be a capability to integrate the data lake with other applications or downstream reporting/analytic systems. The data lake should also have support for REST APIs, which different applications can interact with to get or push their own piece of data.

Data Lake Governance Challenges

The data lake is not only about storing data centrally and furnishing it to different departments whenever required. With more and more users beginning to use data lakes directly or through downstream applications or analytical tools, the importance of governance for data lakes increases. Data lakes create a new level of challenges and opportunities by bringing in diversified datasets from various repositories to one single repository.

The major challenge is to ensure that data governance policies and procedures exist and are enforced in the data lake. There should be a clear definition of the owner for each dataset as and when they enter the lake. There should be a very well-documented policy or guideline regarding the required accessibility, completeness, consistency, and updating of each data.

For solving the above problem, there should be built-in mechanisms in the data lake to track and record any manipulation of data assets present in the data lake.

Is the Data Lake Same for Everyone?

The implementation of a data lake is not the same for all organizations, as data volume and requirements of data collection vary from organization to organization. In general, a data lake comes with the perception that the data volume should be at a level of petabytes or zettabytes or even more, and needs to be implemented using a NoSQL database. In reality, this amount of data volume together with the implementation of a NoSQL DB may not be needed or may not be possible for all organizations. The end goal of having a central data store catering to all analytical needs of the organization can be started with a SQL DB and with a considerable data volume.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data lake ,data integration ,data governance

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}