The need to address Business Continuity and Disaster Recovery (BCDR) concerns is well known to anyone who runs production systems. This blog introduces HBase’s new backup and restore capabilities, which give HBase the ability to perform full and incremental backups across clusters and into the cloud. When combined with real-time replication, this new incremental backup capability gives HBase the ability to satisfy the most demanding availability needs, even at large data scales.
The State of HBase Data Recovery to Date
For years HBase has provided options that satisfy various data recovery (DR) needs, including:
- Table exports within a cluster, which could be copied remotely.
- Table-level snapshots.
- Real-time cross-cluster replication.
Is this enough to satisfy enterprise-grade DR needs at scale? Let’s review the most basic needs:
To cover the basics we must:
- Protect against hardware failure. HDFS gives us good protection by replicating data 3 times, but some risk remains unless we replicate data off the cluster.
- Protect against user mistakes and application errors. Replication technologies will replicate bad data just as quickly as good data, so we need some ability to “rewind”, like a snapshot gives us.
- Protect against complete site outages. This forces some kind of replication strategy.
Replication or table copies alone can’t protect us against the most common failure scenarios. Can we do better with these capabilities? What if we set up a rolling window of snapshots that we export to a remote cluster? While this satisfies many of the needs, it is impractical for large databases because it requires full data copies for each backup rather than allowing copies of just incremental updates.
Incremental Backup and Restore in HBase
A more robust approach to backup and restore was originally proposed in HBASE-7912. Most importantly, this proposal introduced the notion of incremental backup and restore, but also includes concepts of backup sets to manage related tables and retaining backup metadata within HBase itself.
Incremental Backup and Restore Approaches
Most likely you will want to choose one of these basic approaches for backup/restore:
Approach 1: Backing up within a cluster. This approach is really only suitable for validating your backup and restore approaches (think: test/dev), or if HDFS is itself backed up in some other way.
Approach 2: Back up to a dedicated HDFS cluster. This is appropriate if the archive cluster uses a more economical hardware profile or is in a different failure domain (for example, another data center).
Approach 3: Backup to a cloud storage solution. You may choose to backup to a public cloud provider like Azure, AWS, or Google Cloud. Alternatively, you can backup directly to one of the many on-premises storage arrays like EMC Isilon, or any other array that provides HDFS, S3, or other Hadoop-compatible APIs.
Example: Incremental Backup to Amazon S3
Here’s an example showing full and incremental backup of related tables to an Amazon S3 bucket. Let’s assume that we have an application called “green” that includes a table of customers and also keeps track of customer transactions. The transactions could be sales transactions, call detail records, or many other things. The important thing is that these tables need to be backed up and restored as a group to properly support the higher-level applications.
To ensure these tables are backed up and restored as a set we first create a backup set called “green”:
$ hbase backup set add green transactions $ hbase backup set add green customer
From now on we can just use the backup set name for all operations. The first application backup we do must be a full backup. Here’s what it looks like when backing up to S3:
$ ACCESS_KEY=ABCDEFGHIJKLMNOPQRST $ SECRET_KEY=0123456789abcdefghijklmnopqrstuvwxyzABCD $ sudo -u hbase hbase backup create full \ s3a://$ACCESS_KEY:$SECRET_KEY@prodhbasebackups/backups -set green
Backups and restores should be run as the hbase superuser (which is called “hbase” by default).
When it comes time to do the incremental backup, we run essentially the same procedure:
$ sudo -u hbase hbase backup create incremental \ s3a://$ACCESS_KEY:$SECRET_KEY@prodhbasebackups/backups -set green
Note that if you are planning to backup to Amazon S3, be aware of HADOOP-3733, which means if you are running an older version of Hadoop your secret keys must not contain slash (/) characters.
Finally, the restore command in its most basic form looks like this:
$ sudo -u hbase hbase restore -set green \ s3a://$ACCESS_KEY:$SECRET_KEY@prodhbasebackups/backups backup_1467823988425 \ -overwrite
This last argument is a backup ID that is created when the backup is run. HBase maintains a record of successful backups that can be accessed using the hbase backup history command. For example:
(Some output is truncated for readability.) This command shows us the backup IDs, whether the backup was full or incremental, plus other details.
This example is suitable for a complete restore that overwrites all data. There are a number of restore strategies that suit different needs.
Now Our Backup Is Done, What About Restore?
Something went wrong and you’re under pressure to fix it ASAP. Hopefully, you’ve documented and tested your restore strategy. When planning your recovery strategy, you should think through: (1) the amount of data that you’re willing to lose (the RPO) and (2) the amount of custom development / scripting that you’re willing to do to minimize data loss.
The Simple Approach: Overwrite
If you’re optimizing for simplicity and predictability, the absolute easiest thing to do is restore from your latest backup and overwrite existing data. This entails a single CLI command and you know exactly how much data you’re going to lose. The simplicity of the approach makes it attractive if the application is not mission critical or if you can re-ingest the data. The S3 example above takes this overwrite approach.
If You Must Have Lower RPO
If you can’t tolerate application downtime or data loss, there are still a few simple options available to you.
Restore from Staging:
If data loss is not acceptable, it is possible to restore a table under a different name, extract data from it, and re-populate the original table. On the plus side, we don’t have downtime or data loss. This comes at a cost of complexity: you must develop the procedures to identify, extract, and apply the data you need.
If your tables are large it may not be possible to restore them under a different name due to space constraints. The really powerful thing about HBase backups is they are stored in WAL files that can be parsed using a simple interface that can be consumed either in Java or using the “hbase wal” utility.
Consider this scenario: A customer rep deleted some data because he thought it was unimportant. A week later the customer is upset because the data was important and you need to restore these few pieces of information. With HBase backups, all you need to do is parse through the backups with a WAL reader and extract the historical values, which you can then add back in. With other databases, you would have to bring another database instance online and load the backups into it. Having backups in open, well-understood formats unlocks many powerful opportunities and can bring recovery times down from days to minutes.
Reviewing the Restore Options
The list of options presented here is by no means comprehensive, but it does cover the simplest approaches. Here’s an image summarizing these.
Should You Backup or Should You Replicate? Yes.
A solid backup strategy is only one part of continuous availability for mission-critical apps. Combining incremental backup to the cloud with replication gives all the tools you need for true continuous availability.
Incremental backup and restore is currently being developed in the HBASE-7912 branch in Apache HBase and is planned to land in the mainstream HBase 2 release. Hortonworks Data Platform users will be able to use Incremental Backup and Restore as a technical preview feature starting in the upcoming HDP 2.5 release.