DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
  • Implementing Budget Policies and Budget Limits on Databricks
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments

Trending

  • Building a Spring AI Assistant With MCP Servers: A Step-by-Step Tutorial
  • A Deep Dive into Tracing Agentic Workflows (Part 2)
  • Migrate a Hardcoded LangGraph Agent to LaunchDarkly AI Configs in 20 Minutes
  • A Practical Blueprint for Deploying Agentic Solutions
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Resolving the Failure Issue of NameNode

Resolving the Failure Issue of NameNode

NameNode is a single point of failure for the HDFS cluster. In this article, learn how to resolve the failure issue of NameNode.

By 
Shubhra Sharma user avatar
Shubhra Sharma
·
Jun. 06, 17 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
9.3K Views

Join the DZone community and get the full member experience.

Join For Free

In the article Smattering of HDFS, we learned that NameNode is a single point of failure for the HDFS cluster. Each cluster has a single NameNode and if that machine became unavailable, the whole cluster would become unavailable until the NameNode was restarted or was brought up on a different machine. Now, in this article, we will learn about resolving the failure issue of NameNode.

Issues That Arise When NameNode Fails

When in use, the metadata for HDFS like namespace information, block information, etc. needs to be stored in main memory. But for persistence storage, it needs to be stored in disk. The NameNode stores two types of information:

  1. In-memory fsimage: The latest, most updated snapshot of the HDFS namespace.

  2. editLogs: The sequence of changes made to the filesystem after the NameNode starts.

The total availablity of the HDFS cluster is decreased in two major ways:

  1. In the case of a machine crash, the cluster would become unavailable until the machine was restarted.

  2. In the case of a maintenance task to be carried on the NameNode machine, cluster downtime would happen.

Standby NameNode: The Solution to NameNode Failure

The HDFS high availability feature provides a facility of running two NameNodes in the same cluster. There is an active-passive architecture for NameNode; that is, if NameNode goes down, within a few seconds, the passive NameNode (also known as Standby NameNode) comes up. At any point in time, one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby NameNode is simply acting as a slave, maintaining enough state to provide fast failover if necessary.

For namespace information backup, the fsImage is stored along with the editLog. The editLog is like the journal ledger of NameNode. Through it, the in-memory fsImage can be reconstructed. It is needed to make the backup of editLog .

In Gen2 Hadoop architecture, there is a facility of Quorum Journal Manager (QJM), which is a set of at least three machines known as journal nodes, where editLogs are stored for backup. To minimize the time to start the passive NameNode in the case of an Active NameNode crash, the standby machine is pre-configured and ready to take over the role of NameNode.

standby namenode

The Standby NameNode keeps reading the editLogs from the journal nodes and keeps itself updated. This configuration makes Standby ready to take up the active NameNode role in case of failure. All the DataNodes are configured to send the block report to both of the NameNodes. Thus, the Standby NameNode becomes active in case of NameNode failure in a short amount of time.

cluster

Published at DZone with permission of Shubhra Sharma. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
  • Implementing Budget Policies and Budget Limits on Databricks
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook