Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Lakes: Underpinning the Journey of Digital Transformation

DZone's Guide to

Data Lakes: Underpinning the Journey of Digital Transformation

Data is as good as gold, so the way your organization uses its data matters. Read to see why data lakes are so important for digital transformation.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

We all would agree that the buzz around “big data” is fading away, but why? Data is everywhere, all around us. The proliferation is data is taking place at a rapid pace and organizations consider data as an important asset to their business. Data is an intangible asset for any organization that can be used to harness meaningful insights.

As companies are propelled towards Digital Transformation leading to a reinvention of their business models, the emergence of these new business models will require all sorts of new data. The way data is used currently will change rapidly as well. In addition, as data science enters mainstream, commercial and code-sharing analytics platforms are providing algorithms and libraries that simplify the practical applications of data science. A majority of the time spent is in getting data and once we have the good data available, modeling becomes quick to implement.

Data science is the easy part, getting the right data, and getting the data ready for analytics is much more difficult.

The sea of data is vast and growing exponentially. To avoid drowning, it is the time to expand the existing data management infrastructures massively and quickly. Data lakes, an emerging class of data management technology, holds significant promise in this regard.

Data Lake storage platforms are designed to hold process and analyze structured and unstructured data. Data lakes are typically used in conjunction with traditional enterprise data warehouses (EDWs). In case of data lakes, data is held in its native format and processed only when needed.

With the advent of data lakes, there have been fundamental questions around the existence of data warehouses. Do we need data warehouses anymore? We should see data lakes as complement to data warehouses, not as a replacement. DWs are still, and will be, relevant for any organization and here are some of the advantages of data warehouses:

  • A good fit for relational type of structured data.
  • Easy to use.
  • Easy to access.
  • Great Security and Data Governance.
  • Use for Master Data Management.
  • Single version of truth.
  • Integration with reporting components.
  • Improve data quality.

On the other side, data lakes bring added advantages which traditional data warehouses are unable to provide. The close integration between data lakes and data warehouses can help organizations overcome known pitfalls in terms of being able to manage data effectively and in providing a solid data platform for all the analytical and data science initiatives which organizations would like to embark upon. 

Advantages of Data Lake:

  • Ability to harness structured, unstructured and streaming data.
  • Scalability.
  • Store data in its native format (no defined schema).
  • Storing and processing of big data efficiently.
  • Getting the right information to the right people at the right time.

Data lakes use a bottom up approach which essentially means storing any type of data in its native format in a big data repository (predominately Hadoop/Cloud-based ecosystems) and then applying further transformations based on business needs. Basically, what this tells us is that all data has a potential value which can be leveraged  for downstream applications such as machine learning, AI, and data analytics.

How to Start the Journey of Data Lake Implementation

As organizations decide to embark on the journey of implementing data lakes, IT and business leaders should be ready to face some challenges and tough questions around security protocols, construction of enterprise architecture, technology stacks, and more. This is the time to reimagine, rethink, and change our mind sets about systems, processes, and governance models as they may vary the way they used to work in the traditional world.

An agile approach to the implementation of data lakes can help companies foster a truly data-driven culture and accelerate the journey of digital transformation and data analytics. Rather than building data lakes for all legacy, data which can takes years, we should look for upcoming priority use cases, start putting data into the lake, and keep moving gradually in this direction. 

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data lake ,cloud data platform ,big data ,data warehouse

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}