During the work on restoring a backup, the developer in charge came up with the following problematic scenario.
- Start restoring the backup of database Northwind on node A, which can take quite some time for a large database.
- Create a database named Northwind on node B while the restore is taking place.
The problem is that during the restore, the database doesn’t exist in a proper form in the cluster until it is done restoring. During that time, if an administrator is attempting to create a database, it will look like it is working, but it will actually create a new database on all the other nodes and fail on the node where the restore is going on.
When the restore will complete, it will either remove the previously created database or it will join it and replicate the restored data to the rest of the nodes, depending exactly on when the restore and the new database creation happened.
Now, trying to resolve this issue involves us coordinating the restore process around the cluster. However, that also means that we need to do heartbeats during the restore process (to the entire cluster) and handle timeouts and recovery, and effectively take on a pretty big burden of pretty complicated code. Indeed, the first draft of the fix for this issue suffered from weaknesses such that it would only work when running on a single node, and only work in a cluster mode in very specific cases.
In this case, it is a very rare scenario that requires an admin (not just a standard user) to do two things that you’ll not usually expect them together, and the outcome of this is a bit confusing even if you manage the database, but there isn’t any data loss.
The solution was to document that during the restore process you shouldn’t create a database with the same name but instead let RavenDB complete and then let the database span additional nodes. That is a much simpler alternative to going into a distributed mode reasoning just for something that is an operator error in the first place.