How a Distributed Database Architecture Enabled Kiwi.com To Run Nonstop Through the OVHcloud Fire
Learn how Kiwi.com architected their data infrastructure for extreme resiliency, how their strategy held up when their data center burned down, and lessons they learned.
Join the DZone community and get the full member experience.Join For Free
When a cloud provider’s data center burned to the ground, 3.6 million websites went dark. But one leading travel service kept running without a hitch, thanks to the design of their environment-aware distributed database. Learn how they architected their data infrastructure for extreme resiliency, how their strategy held up, and what lessons they learned.
Disasters can strike any business on any day. This particular disaster, a fire at the OVHcloud Strasbourg data center, struck recently and the investigation and recovery are still ongoing. This is an initial report of one company’s resiliency in the face of that disaster.
Less than an hour after midnight on Wednesday, March 10, 2021, in the city of Strasbourg, at 0:47 CET, a fire began in a room at the SBG2 data center of OVHcloud, the popular French cloud provider. Within hours the fire had been contained, but not before wreaking havoc. The fire nearly entirely destroyed SBG2, and gutted four of twelve rooms in the adjacent SBG1 data center. Additionally, combatting the fire required proactively switching off the other two data centers, SBG3 and SBG4.
Netcraft estimates this disaster accounted for knocking out 3.6 million websites spread across 464,000 domains. Of those,184,000 websites across nearly 60,000 domains were in the French country code Top Level Domain (ccTLD) .FR — about 1 in 50 servers for the entire .FR domain. As Netcraft stated, “Websites that went offline during the fire included online banks, webmail services, news sites, online shops selling PPE to protect against coronavirus, and several countries’ government websites".
(Image: SDIS du Bas Rhin )
Kiwi.com Keeps Running
However, one company that had its servers deployed in OVHcloud fared better than others: Kiwi.com, the popular online travel site. ScyllaDB, the distributed database Kiwi.com had standardized upon, was designed from the ground up to be highly available and resilient, even in the face of disaster.
Around 01:12 CET, about a half-hour after the fire initially broke out, Kiwi.com’s monitoring dashboards produced alerts as nodes went down and left the cluster. There were momentary traffic spikes as these nodes became unresponsive, but soon the two other OVHcloud European data centers used by Kiwi.com took over requests bound for Strasbourg.
Out of a thirty node distributed NoSQL cluster, ten nodes became suddenly unavailable. Other than a brief blip around 1:15, Kiwi.com’s database cluster continued working seamlessly. Load on the remaining online nodes rose from ~25% before the outage to ~30-50% three hours later. (Source: Kiwi.com)
Kiwi.com had just lost 10 server nodes out of 30 nodes total, but the remaining database cluster was capable of rebalancing itself and handling the load. Plus, because their database was data center topology-aware and kept multiple copies of data geographically distributed, their database kept running with zero data loss.
According to Kiwi.com’s Milos Vyletel, “As we designed our database to be running on three independent locations — every location at least 200 kilometers from another — Kiwi.com survived without any major impact of services.”
The multi-local OVHcloud infrastructure enabled Kiwi.com to build out a robust and scalable triple replicated database in three data centers all in separate locations. The secure OVH vRack synchronized the connection of the three sites via a reliable private network, allowing the cluster optimal replication and scalability across multiple locations.
Indeed, Kiwi.com had done their disaster planning years before, even joking about their resiliency by having their initial database cluster launch party in a Cold War-era nuclear fallout shelter. Now their planning, and their technology choice, had paid back in full.
(Image: AP Photo/Jean-Francois Badias)
With the dawning of a new day, load on Kiwi.com’s database picked up, which taxed the remaining servers, yet their database kept performing.
The Road to Disaster Recovery
In total, ten production nodes, plus two other development servers, located in SBG2 were lost to Kiwi.com and are unrecoverable. The next steps are to wait for the other OVHcloud SBG buildings to be brought back up again, at which point Kiwi.com will refresh its hardware with new servers. Kiwi.com is also considering using this opportunity to update the servers in their other data centers.
Milos provided this advice from Kiwi.com’s perspective: “One thing we have learned is to test full data center outages on a regular basis. We always wanted to test it on one product, as one of the devs was pushing us to do, but never really had taken the time.”
“Fortunately, we sized our database cluster in a way that two DCs were able to handle the load just fine. We applied the same principles to other (non-Scylla) clusters as well, but over time as new functionality was added we have not been adding new capacity for various reasons — COVID impact being the major one over this last year or so. We are kind of pushing limits on those clusters — we had to do some reshuffling of servers to accommodate for the lost compute power.
“The bottom line is it is more expensive to have data replicated on multiple geographically distributed locations, providing enough capacity to survive a full DC outage, but when these kinds of situations happen it is priceless to be able to get over it with basically no downtime whatsoever.”
I recently captured Kiwi.com's experience in this 15-minute video. If you want more detail about what happened or lessons learned, take a look.
Published at DZone with permission of Peter Corless. See the original article here.
Opinions expressed by DZone contributors are their own.