We recently had the opportunity to participate in a podcast with Demetrius Malbrough, Chief Data Protection Chef and Next-Generation Backup & Recovery Leader to discuss why data protection for distributed databases is critical to success in today’s world of cloud and big data.
To listen to the podcast, featuring Jeannie Liou and Shalabh Goyal of Datos IO, please click here. Below, you will find a transcript of the podcast.
1) How is the IT application stack changing and what are some impacts of this change?
As social, mobile, and cloud continue to make their way into the enterprises, we see that organizations are rapidly adopting next-generation applications. These applications gather large amounts of data at a high ingestion rate and process that data in real-time to deliver actionable insights.
Some examples of these applications include IoT, analytics, and e-commerce. Most of these applications are deployed either on-premise on distributed databases such as Apache Cassandra, MongoDB, Apache HBase, or on cloud-native databases such as Amazon DynamoDB, Amazon Aurora, Google BigTable.
Some of the impacts are that as these organizations adopt next-generation distributed databases, they also need an ecosystem of data management products to protect the data and extract value from that data.
The challenge is that distributed databases that support these next-generation applications lack enterprise-class data protection solutions. And it’s this gap that’s putting enterprises at the risk of data loss and furthermore limiting their adoption of this new infrastructure. Enterprises will not be able to onboard their mission critical applications on distributed databases unless this gap is addressed.
I want to quickly mention that what’s interesting here is that Mitch Betts, in his CIO article, points out that the tools that manage missing or inaccurate data are required to actually get value out of big data. And furthermore, in one of the customer surveys that we conducted, it showed that 61% of app architects and DBAs believe that lack of backup and recovery solutions is inhibiting the growth of distributed databases in the enterprise.
2) Why do enterprises need data protection?
I would say there are two key reasons why data protection is important:
- First, it’s to minimize application downtime in the event of data loss due to hardware failures or human error. Human errors occur all the time, such as fat-fingers. In today’s ‘always-on’ world, customers want instant access and application downtime can be detrimental, leading to a loss of customers, the loss of brand, and ultimately loss of revenue.
- Second, there are compliance requirements in certain verticals that require organizations to retain and be able to recover data over its lifetime.
Thus, it is important for enterprises to have the ability to protect, recover and manage their data over its lifecycle.
One other use case that customers care about today is refreshing their test/dev environment with production data to enable continuous development. A backup copy of production data can also be easily used to enable this use case.
3) How are the data protection requirements different for next-generation distributed databases? Why can’t customers use legacy backup and recovery products?
There are fundamental differences in legacy relational databases and the next-generation distributed databases. These differences are resulting in new data protection requirements that cannot be fulfilled by legacy backup and recovery products. Let me go through these new requirements:
- Storage Is No Longer Unified Or Shared
Next-generation databases are scale-out in nature and they use commodity storage at a node level (also referred to as DAS storage), creating a highly distributed storage pool. This distributed storage pool is low cost and high performance, but it is not conducive to using traditional database backup and recovery tools which rely on shared storage architecture for a point-in-time backup.
- Next-generation databases are eventually consistent in nature
They are not built around the traditional ACID transaction model and that’s why there are additional challenges when you need to create a point-in-time backup copy that is cluster-wide consistent. If the backup copy is not consistent, then you have to pay at the time when you restore this copy in terms of additional repair time, which can run into hours and days or this may result in anomalies at the application level.
- Deduplication Matters at Scale
With traditional databases, block level snapshots and block-based deduplication techniques resulted in space efficiency on secondary storage. However, in a distributed scale-out database, replication of data leads to a rapid increase in secondary storage requirements. There are 3-6 copies of every write operation. So, effective data reduction at scale is critical to success and economics for effective data protection in this environment.
- Another characteristic of these environments is that Single Points of Failure Do Fail
Legacy data protection approaches have commonly involved a media server which presents a choke point in the secondary data path and leads to a single point of failure of data protection infrastructure. Given the very high ingest rates of modern scale-out databases, this media server based approach to data protection is not prudent and may not be capable of keeping up with your rate of data change. In addition, the risk of failure is high given the trend towards using commodity infrastructure.
So, to summarize, the new data protection requirements are:
- New point-in-time backup technique is needed to get a consistent state across a cluster.
- Backup and recovery needs to be scalable and failure resilient.
- Backups need to be maintained in native formats for data management services, such as search.
- Flexible deployment model is required for on-premise or in public cloud
4) What are customers doing now and are they doing enough?
Our customer survey revealed that more than half the customers are just relying on database native replication capabilities for data protection. That is clearly not enough. Let me explain.
Most distributed databases today provide native replication to protect the data in the event of hardware failures and thus address availability requirements. Replication does not provide point-in-time backup and recovery capability.
Reputable industry studies have concluded that as much as ⅔ of time data disasters that ultimately lead to application downtime are a result of human error. Because human errors have a big impact on business operations and occur randomly, they cannot be solved by solutions such as replication. Replication does not provide point-in- time versioning and recovery so enterprises cannot go back and fix these errors. In fact, if errors are introduced, the databases native replication can lead to almost immediate corruption across all nodes of the database cluster.
We have seen some customers have implemented manual scripted solution that is too fragile and not scalable at all. In fact, our conversations with noted Apache Cassandra committer Aaron Morton has revealed that most times these solutions cannot be relied upon to recover data. The specific drawbacks of such scripted solutions are:
- Operationally intensive to manage (hire Cassandra in-house experts)
- Consume a lot of secondary storage as there is no deduplication of replicas
- They are very prone to failures as the database nodes fail all the time
- Very siloed in nature given that scripts work for the database itself
So, clearly, there is a big gap as far as enterprise, cloud-native data protection for distributed databases is concerned.
5) What is at stake here? Why should this be my priority as a customer?
The stakes are actually quite high. Consider a scenario, where a consumer facing application is running on Apache Cassandra. Due to a human error, a member of the IT team accidentally deletes a database that causes a major functionality of that application to become unavailable to consumers.
What will happen?
The customers will lose confidence in the organization and the brand value of the organization will take a massive unrecoverable hit. How many times we see news on front lines where customers could not get back to their data? It will lead to enterprises losing customers to competition and suffer significant business loss.
Who will be responsible for the loss? Ultimately, it is the business owner.
6) Given that the stakes are so high, what should the customers do?
As you onboard these next-generation distributed databases, you should think about the data loss eventuality. You should think about 4 things:
- Business loss
How will my business suffer if I lose data? What much downtime my applications or business handle?
- How much risk do I have?
Do you have a point-in-time recovery solution? Even if you have a manually scripted solution, you should check whether you have ever validated data recovery and will it work when you need it.
Will my data protection scale with application and data growth? Can I rely on it when the nodes fail?
- Operational efficiency & productivity
Can I efficiently refresh my test/dev environment to improve my development agility?
How much am I spending for manual effort to develop and maintain my customized backup solution?
Can I prove compliance with regulatory or line-of-business requirements?
If you need help, we have experts at Datos IO who may evaluate your environment for risk of data loss and suggest improvements.
About Demetrius Malbrough
Demetrius Malbrough is the host of podcast series “Data Protection Gumbo” and the owner of the LinkedIn group “Backup & Recovery Professionals” which has 18,000+ members worldwide. A graduate of Tennessee State University, Malbrough is currently with Dell as a Data Protection Solutions Architect and holds numerous professional certifications.