Some indexing structures are write optimized in that they are better than B-trees at ingesting data. Other indexing structures are read optimized in that they are better than B-trees at query time. Even within B-trees, there is a tradeoff between write performance and read performance. For example, non-clustering B-trees (such as MyISAM and vanilla MongoDB) are typically faster at indexing than clustering B-trees (such as InnoDB), but are then slower at queries.
This post is the first of two about how to understand write optimization, what it means for overall performance, and what the difference is between different write-optimized indexing schemes. We’ll be talking about how to deal with workloads that don’t fit in memory—in particular, if we had our data in B-trees, only the internal nodes (perhaps not even all of them) would fit in memory.
As I’ve already said, there is a tradeoff between write and read speed in B-trees. There is also a mathematically optimal tradeoff that’s been proven between data ingestion and query speed, no matter what data structure you have.
But these are not the same tradeoff! B-trees are not even remotely optimal in terms of read/write tradeoff. I’d say this is the most common confusion I run into on this topic. The way it comes up is in the assumption people seem to make that if you are faster than a B-tree at indexing, you must pay a read penalty. This is simply not true, and I intend to convince you, but today I’ll just describe how I think people arrive at this confusion.
B-trees are not on the optimal insert/query tradeoff curve
What’s the best we can do if we only worry about writes? Or reads? By considering these extreme cases we can get a feel for the space of all possible solutions.
What’s the most write-optimized structure? Simply log all the data. You ingest data at the bandwidth of the storage system, and you can’t do better than that. The problem is that each query now requires all the data to be examined, and that’s a pretty lousy way to get query performance (unless, of course, you intend to look at it all anyway, like MapReduce does).
At the other extreme, you get maximum read performance by re-sorting the data and laying it out optimally for queries (for each index!) every time new data is inserted. The layout is technical, but can be done, and query performance doesn’t get any better than this. But then, insertion speeds are disastrously slow.
For most real workloads, we can’t sacrifice one aspect of performance to benefit another, we need something that works well for both.
A B-tree is one compromise between reads and writes. They’re easy to understand and they’ve been popular for decades, but as the data structure grows, their performance falls to pieces. Heavy use over the years has led to all kinds of refinements, and there are plenty of write optimizations for B-trees we could discuss at length. Here’s a little observation that will help us out:
Let’s examine some write optimizations and see how this observation applies. We’ll also see some drawbacks of each technique.
Insert in sequential order: B-trees are a couple of orders of magnitude faster at inserting data in sequential order compared to random order. This is because once we insert a row into a leaf, that leaf is in memory. Since the next run of rows are destined for the same leaf, they can all be inserted cheaply. According to the disk, it’s almost like you’re just logging the data.
This property of B-trees leads many people to bend over backwards trying to keep their insertions sequential. In many applications, this is not a natural or easy thing to do, and causes all sorts of problems elsewhere in the system.
Use a write buffer: We store up a bunch of insertions in memory. When we have too many, we pick a leaf and write all the rows we’ve stored that are going to that leaf.
This optimization works when we get to pick a bunch of stored rows that are going to the same leaf. When this happens, we see a speedup: you get to accomplish a bunch of work just for touching a leaf once.
You can expect a win of a factor of two or so for this optimization, which is nice, but leaves a factor of more than 100 on the table. The query cost doesn’t take much of a hit in this case. Even though you do have to query the write buffer, it’s in memory so it’s way faster than querying the tree on disk.
OLAP: OLAP is a bunch of things, but with respect to insertions, the idea is to save up a big batch of rows, pre-sort them, and then insert them into an existing B-tree in sorted order. You can think of it like an on-disk version of the insertion buffer. The big win happens when the amount you batch up is of a size comparable to the B-tree you already have, and this gets you the insertion boost that a write buffer can’t achieve.
The downside is that, unlike the in-memory write buffer, the rows you are buffering don’t get indexed—they just get logged, and we already said how bad query performance is for a log. In practice, OLAP users don’t even look at their data until it gets indexed.
So by batching more, you get better insertion speed, but wait longer before your data is available for queries. In this case, you get a write/freshness tradeoff, rather than a write/read tradeoff, but the fundamental reason is the same.
Use fewer indexes: In general, each row inserted must go to the leaf of every index. This technique is just: maintain fewer indexes, so we have less to do. We wanted to maximize the work accomplished per leaf we touch, but now we’re minimizing the number of leaves we need to touch per amount of work (row insertion), but we get the same effect.
This is less like a technique to be employed, and more like an artifact of bad B-tree performance, but it’s probably the most common “optimization.” Its downside makes it the clearest case for true write optimization: if you can’t afford to keep the right set of indexes, you kill your query performance, and this is without a doubt a read/write tradeoff.
These are pretty common tricks for speeding up B-tree insertions. You may have even tried some of them yourself. Each one has a downside, though, and often they’re not obvious in production. Maybe if you didn’t need to insert sequentially for speed, you could simplify your application in a useful way. Or if you thought you could afford more indexes, you’d spend the time to think up cool new ways to analyze your data.
I think the reason people are convinced that B-trees are on the optimal tradeoff curve is pretty simple. Since all of the popular write optimizations are modifications to a B-tree, most people end up stuck on the B-tree tradeoff curve. But contrary to conventional wisdom, it is possible to do a lot better. In the next post, I’ll explain why.