Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer to that post for the details on what exactly pages in a database are.
Voron is currently using 4KB pages. That is pretty much the default setting because everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc. However, 4KB is pretty small, and that leads to trees that have higher depth. The depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).
We previously tested using different page sizes (8 KB, 16 KB, and 32 KB) and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. However, a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc.).
In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing as a way to reduce the amount of unnecessary data that we write to disk. A side effect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory-to-memory cost and is negligible in compared to the saved I/O.
Actually making the change was both harder and easier than expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of page size. However, while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.
Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.
The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let's assume that we can fit 100 entries per page using 4KB pages.
That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):
- One root page holding 3 entries.
- Three branch pages holding 250 entries.
- 25,000 leaf pages holding the 2.5 million entries.
With 8 KB pages, we’ll have:
- 1 root page holding 63 entries.
- 12,500 lead pages holding 2.5 million entries.
That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.
What we are paying for is the search on them.
The cost of searching the 4KB tree is:
- O(log2 of 3) for searching the root page.
- O(log2 of 100) for searching the relevant branch page.
- O(log2 of 100) for searching the leaf page.
In other words, about 16 operations. For the 8 KB page, that would be:
- O(log2 of 63) for searching the root page.
- O(log2 of 200) for searching the leaf page.
It comes to 14 operations, which doesn’t seem like a lot, but a lot of our time goes on key comparisons on the key. Anything helps.
However, note that I said that the situation above was the ideal one. This can only happen if the data was inserted sequentially, which usually isn't. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non-sequential keys are so strongly discouraged in pretty much all databases.
However, the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly.