An ineffective backup strategy was identified to be the main culprit for data loss at source-code hub Gitlab.com recently. Gitlab is hugely popular among developers who appreciate the fact that it’s an all-in-one solution, providing everything a developer needs over the course of a project. At the core is a Git-based version control system, which is paired with helpful extras. As a result, a lot of companies depend on it, ranging from smaller startups and individual developers to larger enterprises like Intel and Red Hat.
Last Tuesday evening, Gitlab.com found that a fatigued system administrator, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a database replication process. It appears that he deleted a folder containing 300GB of live production data that was due to be replicated. By the time he canceled the action, only 4.5GB remained, and the last potentially viable backup was taken six hours beforehand.
From the detailed information available on Gitlab.com, it is clear that they knew about possible data protection techniques, ranging from volume snapshots to replication, and backup and recovery. However, it is unclear whether they had the right expertise or data protection and recovery products in house to use these techniques correctly. We have seen this before; enterprises with critical applications either fail to leverage the right tools available for backup and recovery, or they deploy legacy solutions for distributed cloud applications. Worse yet, they simply rely on replication as a backup strategy. From what we know, Gitlab may have experienced data loss for any one of the following reasons:
- Neglecting backup and recovery specific tools in favor of home-grown scripts.
- Delaying the deployment of backup and recovery tools until a data loss has occurred.
- Not performing a thorough analysis of business requirements before choosing a solution that can meet those application uptime requirements.
- Failing to perform a recovery operation, and blindly believing that it will work.
Other reasons exist, but what matters is how they fix the issue. But first, it's important to know who is ultimately responsible when errors like this happen. Is it the operator who made the mistake, the database administrator who is responsible for the database, the architect who designed the end-to-end application stack, or the application owner who is impacted by business loss?
We all know humans aren’t perfect. We make mistakes and even with our daily IT security practices. Mistakes happen. What organizations can do to protect themselves from these potential incidents is take a point-in-time backup of mission-critical databases. In its simplest form, this could be a snapshot of all the nodes in a cluster that is transferred to a backend storage. However, given the distributed nature and frequent hardware failures in scale-out databases, these patchwork solutions, such as node-by-node snapshots become operational nightmares to manage. In the best scenario, it takes several days to recover data, resulting in significant application and business downtime. In the worst scenario, the data may never be recoverable!
That is why a more robust solution is needed to reduce data loss risk for next-generation application environments. Listed below are some steps organizations can take to develop a reliable data protection and availability strategy:
- List all possible failure scenarios that may occur in a given environment. Don’t forget the human errors!
- Understand the failure resiliency of the data protection product — no one wants their data protection product to fail when it’s needed most.
- Know about your recovery point objective (RPO) and recovery time objective (RTO) to choose the right data protection product for specific requirements.
- Different data protection technologies such as replication, backup and recovery, and snapshots are available. Organizations must understand what each technology offers and their limitations
- Create a recovery plan and test that plan regularly (every quarter) to make sure people and products work as expected during emergency situations.
Whatever the cause of failure, the best way to keep them from harming your organization is to verify your backups by performing regular recovery test restores. Although testing your backups regularly won’t prevent failures, they can help in noticing the issue which will allow you to fix the problem.
It’s important to highlight that even in this incident, Gitlab.com showed that they are dedicated to transparency — even in the worst days.