HDFS Concurrent Access

DZone 's Guide to

HDFS Concurrent Access

In this post, a developer discusses a project in which he had to implement a data lake, the problems his team faced, and how they resolved them.

· Big Data Zone ·
Free Resource

Last year, I implemented a data lake. As is standard, we had to ingest data into the data lake, followed by basic processing and advanced processing.

We were using bash scripts for some portions of the data processing pipeline, where we had to copy data from the Linux folders, into HDFS, followed by a few transformations in Hive.

To reduce the time taken for data load, we planned to execute these two operations in parallel - that of copying files into HDFS and that of performing the Hive transformations - ensuring that the two operations operated on separate data sets, identified by a unique key.

But, it was not to be. Though both operations executed without errors, Hive threw up errors once we started querying the transformed data.

Upon investigation, we found out that the errors were due to parallel execution. When data is being copied into HDFS (from the Linux folders), Hadoop uses temporary file names until the copy operation is complete. After the copy operation is complete, the temporary file names are removed and the actual file is available in Hadoop.

When Hive is executed in parallel (while the copy operation is in progress), Hive refers to these temporary files. Even though the temporary file names are removed from Hadoop, Hive continues to have a reference to them, causing the above-mentioned error.

Once we ensured that the HDFS copy operation and Hive transformation were not performed in parallel, our problem was solved.

big data ,hadoop ,hive ,data lakes

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}