A couple of weeks ago, we finished the development of RavenDB. All that remains from now is to get it actually out the door — and we take that very seriously.
To the right, you can see one of our servers, which has been running a longevity test for the two months in a production environment. We currently have a few teams doing nasty stuff to network and hardware to see how hostile they can make the environment and how RavenDB behaves under these conditions. For the past few days, if I had to go to the bathroom, I needed to watch out for random network cables strewn together all over the place as we create ad hoc networks and break them.
We have been able to test some really interesting scenarios this way and uncover some issues. I might post about a few of these in the future — some of them are interesting. Another team has been busy seeing what kind of effects you can get when you abuse the network at the firewall level, doing everything from random packet drops to reducing the quality of service to almost nothing and seeing if we are recovering properly.
One of the bugs that we uncovered in this manner was an issue that would happen during disposal of a connection that timed out. We would wait for the TCP close in a synchronous fashion, which propagated an error that was already handled into a freeze for the server.
Yet another team is working on finishing the documentation and the smoothing of the setup experience. We care very deeply about the “five minutes of out of the gate” experience, and it takes a lot of work to ensure that it wouldn’t take a lot of work to setup RavenDB properly (and securely).
We are making steady progress, and the list of stuff that we are working on grows smaller every day.
We are now in the last portion: running longevity and failure tests and basically taking the measure of the system. One of the things that I’m really happy about is that we are actively abusing our production system — to the point where if there were Computer Protective Services, we would probably have CPS take them away — but the overall system is running just fine. For example, this blog has been running on RavenDB 4.0 and the sum total of all issues there after the upgrade was no handling the document ID change — the cluster going down, individual machines going crazy, taken down, network barriers, etc. It just works!