Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Lake Archiving: Hadoop or the Cloud?

DZone's Guide to

Data Lake Archiving: Hadoop or the Cloud?

When it comes to data lake storage, are you using Hadoop or Cloud? What is really the best option? Let's find out!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The storage layer of the data lake is evolving. A few years ago, when we talked about the data lake, it was generally understood that Hadoop was the underlying platform for everything related to the data lake. Today, thanks to the cloud, that is no longer necessarily the case. Why? Deploying a data lake in the cloud enables you to decouple storage and compute functions and use the storage platform that is best suited and most cost-effective for your needs – which may not be Hadoop.

Recently, I participated in a webinar with O’Reilly Media. During the webinar we received some excellent questions. One of the questions led to a specific discussion about big data archiving solutions within a data lake architecture.

Archival storage is a key area to target for lowering costs, but it’s important to do it in an automated and process-driven way that allows for transparency and scalability – not to mention access, as more and more enterprises become interested in using historical data for data analytics. What are your options today to integrate a more cost-effective archival solution within the data lake architecture? How do you structure your data lifecycle management policies for the data lake to accommodate archiving?

Integrating a non-HDFS Archival Solution

You don't have to ingest data into HDFS in order to make it available for a Hadoop application, as long as you maintain the HDFS protocol. Therefore, to help save on costs, we typically advocate moving archival storage out of Hadoop and into the cloud. In the cloud, you can take advantage of a non-block-based file system, such as an object store like Amazon S3, or lower-cost, long-term storage for cold data like AWS Glacier, when speedy retrieval isn’t a requirement.

Block vs. Object Storage

Block storage has long been the norm, as traditional file systems break files down into blocks before storing them. Individual blocks don’t have metadata associated with them; it’s only when they’re accessed and combined with other blocks that they create a file that is defined in a way we can understand.

Object storage is a bit different. Each file is bundled with its metadata as a single “object,” named with an object ID and stored in a flat structure. You retrieve the whole object via its object ID. This enables fast access at scale. Notably, for data analytics, the metadata associated with the object is unlimited in terms of type and amount, and can be customized by users. Object storage provides a simple way to manage archival storage across locations, while providing an array of rich metadata.

Big Data Lifecycle Management

Once you decide on the best platform to meet your archival storage needs, next consider how you would implement a data lifecycle management strategy and a transparent, policy-based system. What are the rules, based on the age and/or usage of the data that define when data moves from a data block-based file system like HDFS into an archival platform like S3 or Glacier? How will you maintain the metadata so that you can still run queries on the archived data?  What is a reasonable timeframe to access the data – it may take longer, but that may be acceptable based on the layers you have defined.

Managing the lifecycle of data at the scale of “big” can be challenging. That’s why Zaloni offers Bedrock DLM, which gives enterprises the ability to create and automate global and specific data retention policies for data in the data lake based on whatever makes sense for the business, including age and relevancy. You can use Bedrock DLM to apply metadata, define storage tiers in Hadoop, delete old data and export data from HDFS to more cost-effective storage in the cloud.

Ready to consider an integrated, hybrid approach to storage for data in your data lake? Archival storage can be a low-hanging fruit when it comes to cutting costs, so it’s worthwhile to explore your options. And the good news is that there are tools to help you implement a sound, policy-based data lifecycle management strategy that is customized to meet your business needs.

If you’d like to listen to the O’Reilly Webinar where this topic was discussed, you can access the replay here.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data lake ,hadoop ,cloud

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}