Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Access Data From Both HDFS and Cloud Storage

DZone 's Guide to

How to Access Data From Both HDFS and Cloud Storage

An engineer briefly discusses the problems you can run into when merging data from two different sources and a free solution that can help.

· Big Data Zone ·
Free Resource

Problem

Sometimes big data analytics need process input data from two different storage systems at the same time. For instance, a data scientists may need to join two tables one from a HDFS cluster and one from S3.

Existing Solutions

Certain computation frameworks may be able to connect to storage systems including HDFS and popular cloud storages like S3 or Azure blob store. For example, in Hive one can point each external table to a different storage system in HIive metastore. While straightforward to set up, this approach may have performance issues. For example, running the computation in a data warehouse but reading data from cloud storage can be slow even the compute is colocated with HDFS. Similarly, running a workload in cloud environment may be efficient to read cloud storage but fetching data from the on-premise HDFS can be slow. This problem can be worse if the same input data is read repeatedly for machine-learning or scientific computing algorithms.

Another common solution is to prepare the input data by moving the input data from one cluster (e.g. the cloud storage) to HDFS. This typically requires data engineers to build ETL pipelines before hand and creates complexity in data management and synchronization.

How Alluxio Helps

Alluxio can serve as a data abstraction layer on top of both HDFS and S3 or other cloud storage systems, providing a unified file system namespace. In this case, data applications can assume one “logical file system” with different directories linking to external storage systems. Reading data from remote storage is transparent to the user application. Alluxio will also provide benefits like data caching to speed up repeated reads on the same set of files.

Topics:
big data ,hdfs ,data integration ,data access

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}