I mentioned that fsync is a pretty expensive operation to do, but it's pretty much required if you need to get proper durability in case of power loss. Most database systems tend to just implement fsync and get away with that, with various backoff strategies to avoid the cost of it.
LevelDB by default will not fsync, for example. Instead, it will rely on the operating system to handle writes and sync them to the disk, and you have to take explicit action to tell it to action sync the journal file to that disk. And most databases give you some level of choice in how you call fsync (MySQL and PostgresSQL, for example, allow you to select fsync, O_DSYNC, none, etc). MongoDB (using WiredTiger) only flushes to disk every 100MB (or 2 GB, the docs are confusing), dramatically reducing the cost of flushing at the expense of potentially losing data.
Personally, I find such choices strange, and I had a direct goal that after every commit, pulling the plug will have no effect on the database. We started out with using fsync (and its family of friends, fdatasync, FlushFileBuffers, etc) and quickly realized that this isn't going to be sustainable. We could only achieve good performance by grouping multiple concurrent transactions and getting them to the disk in one shot (effectively, trying to buffer ourselves). Looking at the state of other databases, it was pretty depressing.
In an internal benchmark we did, we were in second place, ahead of pretty much everything else. The problem was that the database engine that was ahead of us was 40 times faster. You read that right, it was forty times faster than we were. And that sucked. Monitoring what it did showed that it didn't bother to call fsync, instead, it used direct unbuffered I/O (FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH on Windows). Those flags have very strict usage rules (specific alignment for both memory and position in the file), but the good thing about them is that they allow us to send the data directly from the user memory all the way to the disk while bypassing all the caches. That means that when we write a few KB, we write a few KB, we don't need to wait for the entire disk cache to be flushed to disk.
That gave us a tremendous boost. We also compressed the data that we wrote to the journal to reduce the amount of I/O, and again, preallocation and writing in a sequential manner helps quite a lot.
Note that in this post I'm only talking about writing to the journal here, since that is typically what is slowing down writes, in my next post, I'll talk about writes to the data file itself.