This is an interesting post because I'm going to lay down some of the options that we have for concurrency inside the database engine while not really discussing them in depth. In particular, even with concurrency control, you get into… strange situations with databases (see transaction isolation levels and what problems each is trying to solve) because of the nature of the beast.
The ideal is, of course, that each client feels like they are the only client touching the database, and no one can touch the database while it is running. The easiest way to do that is to allow just a single client at a time to enter the database. That is actually something that happens frequently, but in practice, that isn't something that you really want to do. Even embedded databases allow at least multiple readers, so that isn't really something that we'll deal with.
Before we get into the actual concurrency discussion, we first need to figure out what we are talking about with regards to concurrency inside the database engine. The Holy Grail is writers that don't hold up readers and readers that don't hold up writers, allowing both reads and writes to proceed without blocking.
Before we get to the actual complexities involved in implementing that, though, let's see what kind of problems we have after we've solved this. In particular, the issue is when we have multiple clients reading/writing to the same value concurrently. What are we supposed to do then? If we have W1 and W2 both trying to mutate the same record, which one will win? Do we serialize the accesses? What happens if we have a W1 and R2 both modifying and reading from the same record? Until the write transaction is completed, do we give the reader the new value, make it wait?
The classic model in databases used to be that the database would take locks. You can think about them as reader/writer locks whenever it read/wrote a particular value. The release schedule for those locks would impact the transaction isolation level, but that isn't really important for this discussion.
Note that a lock per value is pretty expensive, so one of the things that a database engine would do was to escalate the lock. If it noted that there were too many locks in a particular page, it would escalate to a page lock (and onward until the entire tree/table was locked).
That exposed a pretty nasty issue to users.
If you had a hotspot in your database (recently modified records), it was easy to get into a situation where all the clients were waiting for the same lock, effectively causing a train. Note that in this case, both readers and writers are waiting for each other, and the idea is that we gain concurrency by spreading the locks around in the hope that they won't contend so much.
Another way to handle this is called MVCC (Multi-Versioning Concurrency Control), in this manner, instead of overwriting a record immediately after a change, we keep the old value so readers don't have to wait for the end of the transaction to get it. Writers still need to wait for each other if they modify the same record, but we just ensure that writers and readers don't need to wait for one another. MVCC is a bit complex because you need to maintain multiple concurrent versions, but it is a very common choice today.
But a lot of the complexity around reader and writer locks is actually embedded in the notion of having a conversation with the database. The ability to issue multiple statements to the database in the same connection, with potentially human reactions behind that, makes for a much more complex system. You have to hold all the locks for the duration of the connection. In most newer databases, there is no such concept — a write transaction is held for the duration of a single command (or a batch of commands), which is processed independently and then committed immediately.
Under that scenario, it actually makes a lot more sense to skip the notion of concurrency writers and move to a concurrent readers/single writer mode. In that mode, there can be only a single write transaction, and the only concurrency you have to deal with is with the readers, which can be solved efficiently with MVCC. That makes for a much simpler database design. Combine that with the serial nature of I/O, which databases depend on for durability (more on that in a future post), and this actually makes a lot of sense, since it removes a lot of the complexity from the database code.
RavenDB, LMDB, MongoDB, and CouchDB are all built with a single concurrent writer in mind. In fact, even databases such LevelDB or RocksDB are effectively single writer (they just do concurrent transaction merges).
So, let's talk about transaction merging for a while, shall we? LevelDB in particular is an interesting case because you can use the notion of WriteBatch to write to it and multiple threads can submit WriteBatch at the same time, giving us concurrent writes. The way it is implemented, though, is quite interesting. All those threads submitting all those WriteBatch instances will add their batch to a queue, then compete on a lock. The first one that wins will run through all the WriteBatch in the queue and commit them all.
It is a simple, attractive model, but I seriously dislike it. The problem is that it is you might have two WriteBatches that modify the same record without really having a good way of knowing that. So you can't really reason about it. Because a single lock is taken anyway, I much prefer the single writer model, in which the write transaction knows that it is the only one that can modify things, so it doesn't need to worry about concurrency with other transactions that might have different ideas about what is supposed to be in the database.
When we implemented transaction merging, we did this with explicit optimistic concurrency in mind, and it worked. But it was a complex model, and switching to a single writer made the whole thing much simpler.