Revenue and business operations are highly dependent on digital technologies, so it is a no-brainer for any organization to invest in resilient IT systems. Still, oftentimes organizations take shortcuts or do not take adequate measures to ensure durability, business continuity, and disaster recovery. Time after time, IT leaders fail to put disaster recovery processes and systems in place, which results in massive costs to the business.
This week offered a headline-making example of a massive system failure: a power outage bringing down the IT infrastructure Delta Airlines, resulting in the grounding of a large portion of their global fleet for hours. The result costs are likely a multi-million dollars hit to the company, combined with a ripple effect of negatives — from losing face as a trusted brand among customers to legal and regulatory hammers to likely to fall on the airline industry. Sadly, this is not an isolated event. American, United, JetBlue, and Southwest Airlines all faced similar situations in the past 24 months.
Because of their real-time, exhaustive impact to consumer and the global economy, airlines are at special risk for making headlines when a technological catastrophe occurs — but it is important to recognize that data loss and system failure events happen everyday in multiple organizations across countless industries, with equally dire results.
So, what was the cause? Did Delta not foresee any failure risk or (worse) fail to judge the impact of an outage? Did Delta never consider the multiple ways in which a system error or threat could snowball into system chaos? Whatever the cause, to risk brand and revenue by not investing in an adequate recovery system and process — many costing only a few thousand dollars — reflects unsound decision-making at the highest levels. We can only speculate the rationale, but (coming from longtime experts in the datacenter industry) here’s a list of probable excuses the organizations came up with for not incorporating failsafe data protection into its systems and processes:
- It has never happened to us in the past (and hence it will never happen to us).
- We understand the risk but we don’t have budget to put this in place this year.
- We have other higher priorities at the moment (and we will think about it later).
- We use ‘put your favorite cloud provider here’ and hence we are fully protected.
- We have ‘put your favorite product/technology here’ and hence we are fully protected.
- Yes, we have backup systems (but have never tested them ever!!!).
Ensuring fully resilient IT operations is not achieved through a point product or process. There are multiple levels of protection mechanisms that needs to be placed in a layered approach (for example, availability alone may not be enough, organizations need to complement with point-in-time recovery solutions for bullet-proof data protection). By themselves, such systems do not guarantee recoverability but require matching organization processes to make sure that data may be recovered quickly.
Physical infrastructure layer : redundancy, checksums, clustering
Storage layer : geo-replication (synchronous, asynchronous), snapshots
Database layer: clustering (availability), backup & recovery, archival, logs
Data center layer: active/active or active/passive sites, power supply backups
Processes: recovery plans, periodic validation and fine-tuning of recovery plans
Additionally, as organizations on-board newer database technologies such as Apache Cassandra, MongoDB, and others, as well as modern models, such as public cloud, they need to think holistically about failure resiliency — and not simply rely on cloud provider solutions. With today’s patchwork quilt of mix-and-match technologies and deployment models, it is frankly lazy and reckless to depend on legacy systems to guarantee data protection and business continuity. In a nutshell, organizations need to look at new protection solutions for the new era of IT technologies.
This fundamental shift is resulting in a critical gap in the recovery and data protection solutions underneath these modern application architectures deployed, thus putting enterprises at considerable data loss risk. To reduce data loss risk for next-generation application environments, listed below are some steps organizations can take to develop a reliable data protection and availability strategy:
- Implement redundancy at physical infrastructure level, e.g. backup power supplies, clustered servers, and storage systems.
- Understand the different data protection technologies such as replication, backup & recovery, snapshots, and what each technology offers for protection.
- Know about your recovery point and recovery time objectives to choose the right data protection product for specific requirements.
- List all possible failure scenarios that may occur in a given environment.
- Map all failure scenarios to a single of multiple data protection systems/products.
- Understand the failure resiliency of the underlying data protection product — no one wants their data protection product to fail when it’s needed most.
- Create a recovery plan and test that plan regularly (every quarter) to make sure people and products work as expected during emergency situations .
- Do diligence for data protection when onboarding a new technology or a new deployment model.
It’s not hard to point to an event where an organization suffered a system outage for a customer-facing application that was deployed on a modern, distributed database. Because most are caused by relying on archaic data protection processes — many never tested. In instances like these, as Delta Airlines has exemplified, it can take, more than a day to recover from an outage, and cost hundreds of thousands to millions of dollars of revenue and brand equity. Factor in the global chaos delayed recovery can create — like the outage of a major airline’s datacenter — and it becomes clear that backup is as crucial, if not more, than the deployment of modernized applications.