DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How to Build Your Exchange Server Recovery Strategy to Overcome Ransomware Attacks
  • Backup and Disaster Recovery in the Age of GitOps and CI/CD Deployments
  • Understanding Leaderless Replication for Distributed Data
  • How to Back up Virtual Machines

Trending

  • A Deep Dive Into Firmware Over the Air for IoT Devices
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • Developers Beware: Slopsquatting and Vibe Coding Can Increase Risk of AI-Powered Attacks
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Backup, Restore, and Disaster Recovery in Hadoop

Backup, Restore, and Disaster Recovery in Hadoop

Test your backup and restore procedures right after you install your cluster. Backups are a waste of time and space if they don't work and you can't get your data back!

By 
Tim Spann user avatar
Tim Spann
DZone Core CORE ·
Mar. 03, 17 · Opinion
Likes (5)
Comment
Save
Tweet
Share
28.8K Views

Join the DZone community and get the full member experience.

Join For Free

Many people don't consider backups since Hadoop has 3X replication by default. Also, Hadoop is often a repository for data that resides in existing data warehouses or transactional systems, so the data can be reloaded. That is not the only case anymore! Social media data, ML models, logs, third-party feeds, open APIs, IoT data, and other sources of data may not be reloadable, easily available, or in the enterprise at all. So, this is not critical single-source data that must be backed up and stored forever.

There are a lot of tools in the open-source space that allow you to handle most of your backup, recovery, replication, and disaster recovery needs. There are also some other enterprise hardware and software options.

Some Options

  • Replication and mirroring with Apache Falcon.

  • Dual ingest or replication via HDF.

  • WANdisco.

  • DistCP.

  • In-memory WAN replication via memory grids (Gemfire, GridGain, Redis, etc.).

  • HBase Replication.

  • Hive Replication.

  • Apache Storm, Spark, and Flink custom jobs to keep clusters in sync.

Disaster Recovery

First, see Part 1 and Part 2.

HDFS Snapshots and Distributed Copies

HDFS snapshots and distributed copies should be part of your backup policies. Make sure you leave 10-25% space free to make several snapshots of key directories. See the following resources:

  • HDFS architecture.

  • Rolling upgrades.

  • Archival.

  • Hadoop archives and Hadoop archival storage.

Creating a Hadoop archive is pretty straightforward. See here.

Distributed Copy (DistCP)

This process is well documented by Hortonworks here. DISTCP2 is a simple command line tool. 

hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination 

Mirroring Data Sets

You can mirror datasets with Falcon. Mirroring is a very useful option for enterprises and is well-documented. This is something that you may want to get validated by a third party. See the following resources:

  • Hive DR with Falcon.

  • Data movement and integration (this overview from Hortonworks is very useful for practical data movement between and within the cluster).

  • Falcon details (in-depth presentation).

Storage Policies

You must determine a storage policy of how many copies of data you have, what to do with it, data aging, and hot-warm-cold policies. Management, administrators, and users need to discuss these issues.

I like the idea of making backups, disaster recovery copies, and active-active replication where all data of importance come in lands in multiple places or in a write-ahead log. I also like having enough space in in-memory data storage (Hot HDFS, Alluxio, Ignite, SnappyData, Redis, Geode, GemfireXD, etc.). When that ages, it can be parallel-written to many permanent HDFS stores and potentially written to a cold, cold storage like Amazon Glacier or something else that is off-site, but available.

Test your backup and restore procedures right after you install your cluster. Backups are a waste of time and space if they don't work and you can't get your data back!

Reference

  • Disaster Recovery and Backup Best Practices.

Disaster recovery hadoop Backup Data (computing) Replication (computing)

Opinions expressed by DZone contributors are their own.

Related

  • How to Build Your Exchange Server Recovery Strategy to Overcome Ransomware Attacks
  • Backup and Disaster Recovery in the Age of GitOps and CI/CD Deployments
  • Understanding Leaderless Replication for Distributed Data
  • How to Back up Virtual Machines

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: