Recently, Dan wrote a great piece on testing network failures with NuoDB's support for geo-distribution. If you haven't read it, then go do that right now. It's cool, and it illustrates pretty clearly how you can tune the rules for durability based on awareness of regions.
We've made our tools region-aware: you can define where your database is running and what the criteria are for acknowledging commit back to a client. So in Dan's test, he simulated a network partition that cut off the West and East coasts. No writes were lost. Consistency was maintained before, during and after the partition. Like I said, I think that's pretty cool.
The only thing about this test is that it wasn't physically running in Boston, New York and Los Angeles. It was running on systems physically close to each other that were simulating the topology of separate data centers. It proved that our region-aware management features work, but it didn't really stress our software in all the ways that a real deployment might. So yesterday morning I grabbed Dan and asked if we could re-run the same test, this time on Amazon across physically locations.
Like with Dan's original test, we ran in three regions: Virginia (us-east-1), California (us-west-1) and Oregon (us-west-2). We setup one host in Oregon and two hosts in each of the other two regions to mimic the original layout:
SQL> select address, type, georegion from system.nodes order by georegion; ADDRESS TYPE GEOREGION ------------------------------ ----------- ---------- 184.108.40.206 Storage us-east-1 ip-10-180-203-206.ec2.internal Transaction us-east-1 220.127.116.11 Storage us-west-1 18.104.22.168 Transaction us-west-1 22.214.171.124 Storage us-west-2 126.96.36.199 Transaction us-west-2
All instances were m1.xlarge; no additional services or optimizations were used. We spun up a sixth instance in Oregon to drive the jepsen client-load and partition simulation. As with Dan's original test we ran with --commit region:2 to ensure that all commits are made durable in at least two regions. See his previous post for details about how we started each TE/SM.
In this version of the test, Virginia plays the role that LA had before. The East coast region gets partitioned from the West coast. When that happens the two hosts in California and the one in Oregon can continue to talk with each other but the Virginia processes should fail. This time we've got real-world latency and real-world network-translation involved.
As you can probably guess, the results were exactly what we'd want to see:
0 unrecoverable timeouts Collecting results. Writes completed in 200.077 seconds 2000 total 1995 acknowledged 1995 survivors all 1995 acked writes out of 2000 succeeded. :-)
The Virginia processes failed as expected, and could be re-started with no trouble once the network partition was fixed. In the meantime we could re-direct load as needed and maintain consistency and durability of the data. Again, pretty cool.
Expect to see some follow-up discussion about the next tests we're thinking about. We're also going to be putting up some recipes that make it easier for you to try running this set of tests (or other AWS experiments) yourself. In the meantime, we were pretty jazzed to be running real-world, cross-country failure handling tests so we wanted to share our excitement with all y'all.