That is one scary headline, isn’t it? Go into the story and read all about it, there is a plot twist in the end.
This post was written in 05:40 AM, and I have spent the entire night up and awake*.
A customer called me in a state of panic, their database is not loading, and anything they tried didn’t work. I worked with him for a while on understanding what is going on, and how to try to recover what was going on.
Here is the story as I got it from the customer in question, and only embellished a lot to give the proper context for the story.
It all started with a test failure, in fact, it started with all the tests failing. The underlying reason was soon discovered, the test database disk was completely full, not even enough disk space to put half a byte. The customer took a single look at that, and run a search on the hard disk, to find what is taking so much space. The answer was apparent. The logs directory was full, in fact, the Webster dictionary would need to search hard and wide to find a word to express how full it was.
So the customer in question did the natural thing, and hit Shift+Delete to remove all that useless debug logs that has been cluttering the disk. He then started the server again, and off to the races. Except that there was a slight problem, when trying to actually load the database, the server choked, cried and ended up curled to a fetal position, refusing to cooperate even when a big stick was fetch and waved in its direction.
The problem was that the logs files that were deleted were not debug log. Those were the database transaction logs. And removing them from the disk has the effect of causing the database to be unable to recover, so it stopped and refused to work.
Now, remember that this is a test server, which explains why developers and operations guys are expected to do stuff to it. But the question was raised, what actually caused the issue? Can this happen in production as well? If it happens, can we recover from it? And more importantly, how can this be prevented?
The underlying reason for the unbounded transaction logs growth was that the test server was an exact clone of the production system. And one of the configuration settings that was defined was “enable incremental backups”. In order to enable incremental backups, we can’t delete old journal files, we have to wait for a backup to actually happen, at which point we can copy the log files elsewhere, then delete them. If you don’t backup a database marked with enable incremental backups, it can’t free the disk space, and hence, the problem.
In production, regular backups are being run, and there were no tx log files being retained. But no one bothered to do any maintenance work on the test server, and the configuration explicitly forbid us from automatically handling this situation. But in a safe-by-default mindset we would do anything for the operations guy to notice it with enough time to do something about it. That’s why for 3.0 we are taking a proactive step toward this case, and we will alert when the database is about to run out of free space.
Now, for the actual corruption issue. Any database makes certain assumptions, and chief among them is that when you write to disk, and actually flush the data, it isn’t going away or being modified behind our back. If that happens, which can be because of disk corruption, manual intervention in the files, stupid anti virus software or just someone randomly deleting files by accident. At that point, all bets are off, and there isn’t much that can be done to protect ourselves from this situation.
The customer in question? Federico from Corvalius.
Note that this post is pretty dramatized. This is a test server, not a production system, , so the guarantees, expectations and behavior toward them are vastly different. The reason for making such a big brouhaha from what is effectively a case of fat fingers is that I wanted to discuss high availability story with RavenDB.
The general recommendation we are moving toward in 3.0 is that any High Availability story in RavenDB has to take the Shared Nothing approach. In effect, this means that you will not using technologies such as Windows Clustering, because that relies on a common shared resource, such as the SAN. Issues there, which actually creep up on you (out of quota space in the SAN can happen very easily) and take down the whole system, even though you spent a lot of time and money on a supposedly highly available solution.
A shared nothing approach limit the potential causes for failure by having multiple nodes that can each operate independently. With RavenDB, this is done using Replication, you define a master/master replication between two nodes, and you can run it with one primary node that your servers connect to usually. At that point, any failure in this node would mean automatic switching over to the secondary, with no downtime. You don’t have to plan for it, you don’t have to configure it, It Just Works.
Now, that is almost true, because you need to be aware that in a split brain situation, you might have conflicts, but you can set a default strategy for that (server X is the authoritative source) or a custom conflict resolution policy.
The two nodes means that you always have a hot spare, which can also handle scale out scenario by handling some of the load from the primary server if needed.
Beyond replication, we also need to ensure that the data that we have is kept safe. A common request from admins that we heard is that a hot spare is wonderful, but they don’t trust that they have a backup if they can’t put it on a tape and file it on a shelve somewhere. That also help for issues such as offsite data storage in case you don’t have a secondary data center (if you do, put a replicated node there as well). This may sound paranoid, but having an offline backup means that if something did a batch job that was supposed to delete old customers, but deleted all customers, you won’t be very happy to realize that this batch delete process was actually replicate to your entire cluster and your customer count is set at zero, and then start declining from there. This is the easy case, a bad case is when you had a bug in your code that wrote bad data over time, you really want to be able to go back to the database as it was two weeks ago, and you can only do that from cold storage.
One way to do that is to actually do backups. The problem with doing that is that you usually go for full backups, which means that you might be backing up tens of GB on every backup, and that is very awkward to deal with. Incremental backups are easier to work with, certainly. But when building Highly Available systems, I usually don’t bother with full backups. I already have the data in one or two additional locations, after all. I don’t care for quick restore at this point, because I can do that on one of the replicated nodes. What I do care is that I have an offsite copy of the data that I can use if I ever need to. Because time to restore isn’t a factor, but convenience and management is, I would usually go with the periodic export system.
This is how this looks like:
The Q drive is a shared network drive, and we do an incremental export to it every 10 minutes and a full export every 3 days.
I am aware that this is pretty paranoid setup, we have multiple nodes holding the data, and exports of the data, sometimes I even have each node export the data independently, for the “no data loss, ever”
Oh, and about Federico’s issue? While he was trying to see if he could fix the database in the event such a thing happen in production (in the 3 live replicas at once), he was already replicating to the test sandbox from one of the production live replicas. With big databases it will take time, but a high-availability setup allows it. So even if the data file appears to be corrupted beyond repair, everything is good now.
* To be fair, that is because I’m actually at the airport waiting for a plane to take me on vacation, but I thought it was funnier to state it this way.