[This post was originally written by Trek Palmer]
Hello, blog readers. Some of you may know that Kyle Kingsbury of Jepsen fame turned the baleful eye of his test framework against NuoDB 1.2. Specifically, the Jepsen test attempts to quantify how a system with durability guarantees behaves in the presence of short-lived network partitions. Unfortunately, the Jepsen tests ran into node instability issues while testing was underway. Since the big push for 2.0.x is for good multiple-datacenter support, we have been looking at this stuff pretty closely (in between all the other stuff we're working hard at here in the NuoDB volcano lair). The Jepsen stuff is all open source and available on github. We've forked Mr. Kingsbury's repo and made some changes to resolve some of the issues that were cropping up.
The first difference is that we didn't use his salticid stuff to setup NuoDB. We were running on top of 5 physical servers, and NuoDB has an easy-to-use set of management tools that's increasing in power and expressiveness with every release. Since our management system is fully scriptable, we just ginned up some scripts to start up nodes (an example in the forked repo is at
setup/nuodb/setupJepsen.sh). No ruby knowledge required, and no strange ncurses dependencies to sort out. Even though NuoDB doesn't require a storage manager running on every node, nor does it require running a transaction engine on every node either; the test setup for the Jepsen blog post was 5 ubuntu containers running one TE and one SM each. Therefore, even though we were running on bare metal rather than LXC containers, we ran with the same configuration for the sake of comparability. Some of the changes to the Jepsen clojure code were to make the node names more configurable, so that the machines didn't have to be named 'n1', 'n2', etc.
Our configuration was 6 ubuntu machines. 5 were running 1 TE and 1 SM each, while the 6th machine was the client simulator, and where we ran Jepsen. For those using the forked code, you just need to map the symbolic names :n1, :n2 etc. to the names of the machines/VMs that you're using. This is a clojure map defined in
The Jepsen Client
The Jepsen application is simulating client load where each client may connect to any of the 5 machines and will attempt to add a unique integer to a set. This is a simplification of real application behavior, but it makes verification of writes after a run super easy. For some reason, the client used for the blog post was operating against a single-row table. The row consisted of an integer column and a CLOB (character blob, basically an unbounded string). The CLOB itself was actually the string representation of a JSON integer set. The way it worked was: a Jepsen client would read in the CLOB, convert it into a clojure map via JSON parsing then stick the unique integer into the map, re-encode it into a JSON string then build a CLOB out of the string and update the single table row to be that CLOB. This worked, insofar as all succeeding writes would have a string version of their integer value in the JSON string. However, it is a strange way to represent a set of integers in a SQL database.
The Postgres client code for Jepsen just uses a table, where each client just inserts a row into the table whose value is the unique integer directly. Since it's easier to write SQL around processing rows in a table rather than strings, we changed the NuoDB client so that it could operate in an 'insert mode' where it would behave more like the Postgres client. This has the advantage of making it easier to inspect the database with the NuoSQL command-line tool afterwards, as well as making the NuoDB client more comparable to the client for the only other SQL database being tested (Postgres). For those interested in running in this mode, just pass "-X insert" as an argument when kicking this off with leiningen.
Dealing With Crashes
Unfortunately for everyone, Mr. Kingsbury's first experience with NuoDB was to have some nodes start failing on him. We take crashes seriously here at NuoDB, so we began investigating right away. The culprit was in the way the Jepsen client's chosen SQL interface was building connections. Jepsen used a clojure SQL module called Korma. It has a nicer LISP-ier feel to it than just doing string manipulation and passing that through to Java JDBC code. I was not familiar with Korma before, and I am by no means an expert; however, what Jepsen does is to build up a host configuration for each of the 5 hosts and then point some subset of the workers at each host. By default, Korma doesn't do pooling, and due to the way the request tasks were being queued, it was possible for the Jepsen client process to build up thousands and thousands of connections (I verified this by hacking the JDBC driver to periodically output high-water marks for open connection counts). It was not uncommon for the Jepsen code to establish 8000+ open JDBC connections in the space of a few dozen seconds.
NuoDB is a distributed system, but it is also a SQL database. That means that connections to NuoDB are more like those to Oracle or Postgres, rather than like those to some REST service. DB connections are stateful, weighty affairs, and because of this, connection pooling has been de riguer in the DB programming community since I wore a younger man's clothes. There is definitely a mismatch between how Jepsen is scheduling work against an abstract 'node', how Korma is modeling a database host and how NuoDB brokers balance client load. After too brief an exploration, I found that simply telling Korma to cap the pool size per host was enough to stop this runaway connection bug. The default in the forked code is 200 per node (that's 1000 database connections).
This is not to say that all the stability issues are somehow Jepsen client problems. It is absolutely a bug in NuoDB that we don't handle connection overflow gracefully. And you may rest assured that we are aware of it and that it is being worked on by top men. However, even if we did gracefully handle clients establishing thousands upon thousands of connections per node, that is almost certainly not the behavior that Jepsen was attempting to simulate. This conclusion is backed up by the fact that the Postgres test established only 1 connection per client 'worker' (there are 5 workers). Therefore, I would argue that the connection pooling fix was reasonable, as that's the recommended way of interacting with a SQL database. Of course, the webby folks out there have a point and a RESTful interface is definitely better in some circumstances. Currently, however, there is no RESTful query interface for NuoDB so we'll have to stick with JDBC connection-based client best practices. Stay tuned to this space for interesting announcements, though.
Dealing with Disconnections
NuoDB's connections are negotiated in a 2-step fashion. First, a client requests that a nearby broker return the address of the 'best' TE. Where 'best' is determined via some load-balancing logic, but is meant to decrease latency or increase performance or both. Then the client establishes a more traditional SQL connection to the TE that the broker informed the client about. As Kyle points out in his Postgres blog post, partition isn't just between server-side processes. Clients can become partitioned from the servers as well. And, because NuoDB nodes may shut themselves down or the hardware they're running on may crash, clients may also have to deal with the SQL connection failing. We recommend that clients that do their own connection management reconnect on java IOExceptions. The J2EE Data Provider we ship with does this automatically under the hood. Kyle's original code actually associated a 65 second timeout with each request, and would fail if the timeout expired without a successful commit or an update/insert failure. On insert/update failure, the client would retry. We just extended it to retry when an IOException is caught by the client as well.
After some additional usability enhancements (configurable username/password stuff, configurable ports, etc), we're now ready to run the new-and-improved NuoDB 2.0.1 against the Jepsen partitioning machine. At this point, one should be able to setup a 5 machine database and run Jepsen with something like
lein run nuodb -n 2000 -f noop -u <uname> -p <passwd> -X insert -t <port> , and watch it insert-away without incident. Next up will be a walkthrough of a Jepsen run, and then an overview of how NuoDB detects silent network failures. However, before I end this post, I'd like to share the result of a run I did in the background while writing this blog post:
Pretty cool eh? No lost writes, and only those clients connected to the minority partition had any write errors.