A Smattering of HDFS
A Smattering of HDFS
HDFS has become a key tool for managing pools of Big Data and supporting Big Data analytics applications. Read on to learn more about it.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Hadoop is an open-source framework that allows us to store and process Big Data in a distributed environment across clusters of computers. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant, as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of Big Data and supporting Big Data analytics applications. It is the primary storage system used by Hadoop applications.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. HDFS uses a master/slave architecture where the master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.
What Are NameNodes and DataNodes?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a single point of failure for the HDFS cluster. When the NameNode goes down, the file system goes offline.
The DataNode is responsible for storing the files in HDFS. It manages the file blocks within the node. It sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations. A functional filesystem has more than one DataNode, with data replicated across them.
Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node.
HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster has a NameNode that manages the file system namespace and regulates client access to files. In addition, DataNodes store data as blocks within files.
Goals of HDFS
- Fault detection and recovery. Detection of faults and quick, automatic recovery from them is the core architectural goal.
- Huge datasets. HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
- Simple coherency model. HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends.
- Large data sets. Since HDFS is tuned to support large files, it should support tens of millions of files in a single instance.
Published at DZone with permission of Manish Mishra , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.