Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Smattering of HDFS

DZone's Guide to

A Smattering of HDFS

HDFS has become a key tool for managing pools of Big Data and supporting Big Data analytics applications. Read on to learn more about it.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Hadoop is an open-source framework that allows us to store and process Big Data in a distributed environment across clusters of computers. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant, as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of Big Data and supporting Big Data analytics applications. It is the primary storage system used by Hadoop applications.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. HDFS uses a master/slave architecture where the master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.

What Are NameNodes and DataNodes?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a single point of failure for the HDFS cluster. When the NameNode goes down, the file system goes offline.

The DataNode is responsible for storing the files in HDFS. It manages the file blocks within the node. It sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations. A functional filesystem has more than one DataNode, with data replicated across them.

Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node.

HDFS Architecture

HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster has a NameNode that manages the file system namespace and regulates client access to files. In addition, DataNodes store data as blocks within files.

HDFSARCHITECTURE

Goals of HDFS

  • Fault detection and recovery. Detection of faults and quick, automatic recovery from them is the core architectural goal.
  • Huge datasets. HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Simple coherency model. HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends.
  • Large data sets. Since HDFS is tuned to support large files, it should support tens of millions of files in a single instance.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
hdfs ,hadoop ,big data

Published at DZone with permission of Manish Mishra, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}