Reviewing Sled: Part III
Look at part three of a review of Sled, an embedded database engine written in Rust.
Join the DZone community and get the full member experience.
Join For FreeUnusually for me, I had a bit of a pause in reviewing Sled. As a reminder, Sled is an embedded database engine written in Rust. I last stopped looking at the buffer management, but I still don’t really have a good grasp of what is going on. You can find Part 1 here and Part 2 here.
The next file is the iterator. It looks like it translates between segments and messages in these segments. Here is the struct:
As you can see, the log iterator holds an iterator of segments, and iterating over it looks like it will go over all the messages in the segments in order. Yep, here is the actual work being done:
The next() method is fairly straightforward, I found. But I have to point out this:
First, the will need call is really interesting. Mostly because you have a pretty obvious way to do conditional compiling that doesn’t really suck. #if is usually much more jarring in the code.
Second, I think that the style of putting really important functions inside an if result in a pretty dense code. Especially since the if is entered only on error. I would have preferred to have it as a stand-alone variable and then check if it failed.
What I don’t understand is the read_segment call. Inside that method, we have:
There are also similar calls on the segment trailer. It looks like we have a single file for the data, but stuff that is too large is held externally in the blob files.
We then get to this guy, which I find to be a really elegant way to handle all the different states.
That is about it for interesting bits in the iterator. The next fun bit is the Log. I do have to admit that I don’t like the term log. It is too easy to get it confused with a debug log.
The fact that you need to figure out where to get the offset of the data you are about to write is really interesting. This is the method that does the bulk of the work:
Note that it just reserved and completed the operation. This also does not flush the data to disk. That is handled by the flusher or by an explicit call. The reserve() method calls to reserve_internal(), and there we find this gem:
I know what it does (conditional compilation), but I find it really hard to follow. Especially because it looks like a mistake with buf being defined twice. This is actually a case where an #if statement would be better in my opinion.
Most of the code in there is to manage calls to the iobuf, which I already reviewed. So I’m going to skip ahead and look at something that is going to be more interesting, the page cache. Sled has an interesting behavior, in that it can shred a page into multiple location, requiring some logic to bring it all back together. That is going to be really interesting to look at I hope.
The file starts with this:
And this…takes a while to unpack. Remember that epoch is a manual GC pattern for concurrent data structure without GC.
The cached_ptr value is a shared pointer to a Node (inside a lock-free stack) that holds a CacheEntry with static lifetime and thread safe to a generic argument that must have static lifetime and be thread-safe. And there is an unsigned long there as well.
No idea yet what is going on. But here is the first method on this struct:
That is a lot. The cache entry is a discriminated union with the following options:
There are some awesome documentation comments here, including a full-blown sample code that really helps me understand what is going on in the code.
There seems to be a few key methods that are involved here:
- allocate(val) — create a new page and write an initial value, gets back a page id.
- link(id, val) — make a write to a page id. Which simply write a value out.
- get(id) — read all the values for a page id, and uses a materializer to merge them all to a single value.
- replace(id, val) — set the page id to the new value, removing all the other values it had.
The idea here, as I gather. Is to allow sequential writes to the data, but allow fast reads, mostly by utilizing SSD’s random read feature.
I’m trying to follow the code, but it is a bit complicated. In particular, we have:
This tries to allocate either a free page or allocate a new one. One of the things that really messes with me here is the use of the term Page. I’m using B+Tree, where a page is literally some section of memory. Here, it refers to something more nebulous. The key point here is that I don’t see where the size is specified. But given that you can link items to page, that sort of make sense. I just need to get used to Pages != storage.
The fact that all of this is generic also makes it hard to follow what is actually going on. I’m getting lost in here, so I think that I’ll stop for now. Stay tuned for the next part!
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments