Over a million developers have joined DZone.

Backup, Restore, and Disaster Recovery in Hadoop

DZone's Guide to

Backup, Restore, and Disaster Recovery in Hadoop

Test your backup and restore procedures right after you install your cluster. Backups are a waste of time and space if they don't work and you can't get your data back!

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Many people don't consider backups since Hadoop has 3X replication by default. Also, Hadoop is often a repository for data that resides in existing data warehouses or transactional systems, so the data can be reloaded. That is not the only case anymore! Social media data, ML models, logs, third-party feeds, open APIs, IoT data, and other sources of data may not be reloadable, easily available, or in the enterprise at all. So, this is not critical single-source data that must be backed up and stored forever.

There are a lot of tools in the open-source space that allow you to handle most of your backup, recovery, replication, and disaster recovery needs. There are also some other enterprise hardware and software options.

Some Options

  • Replication and mirroring with Apache Falcon.

  • Dual ingest or replication via HDF.

  • WANdisco.

  • DistCP.

  • In-memory WAN replication via memory grids (Gemfire, GridGain, Redis, etc.).

  • HBase Replication.

  • Hive Replication.

  • Apache Storm, Spark, and Flink custom jobs to keep clusters in sync.

Disaster Recovery

First, see Part 1 and Part 2.

HDFS Snapshots and Distributed Copies

HDFS snapshots and distributed copies should be part of your backup policies. Make sure you leave 10-25% space free to make several snapshots of key directories. See the following resources:

Creating a Hadoop archive is pretty straightforward. See here.

Distributed Copy (DistCP)

This process is well documented by Hortonworks here. DISTCP2 is a simple command line tool. 

hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination 

Mirroring Data Sets

You can mirror datasets with Falcon. Mirroring is a very useful option for enterprises and is well-documented. This is something that you may want to get validated by a third party. See the following resources:

Storage Policies

You must determine a storage policy of how many copies of data you have, what to do with it, data aging, and hot-warm-cold policies. Management, administrators, and users need to discuss these issues.

I like the idea of making backups, disaster recovery copies, and active-active replication where all data of importance come in lands in multiple places or in a write-ahead log. I also like having enough space in in-memory data storage (Hot HDFS, Alluxio, Ignite, SnappyData, Redis, Geode, GemfireXD, etc.). When that ages, it can be parallel-written to many permanent HDFS stores and potentially written to a cold, cold storage like Amazon Glacier or something else that is off-site, but available.

Test your backup and restore procedures right after you install your cluster. Backups are a waste of time and space if they don't work and you can't get your data back!


Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

hadoop ,backups ,disaster recovery ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}