MongoDB’s Storage Engine Bit by Bit
I made the trip to Santa Clara for 10gen’s big MongoDB conference, MongoSV, and I
thought it’d be fun to report back on the event for those
of you who couldn’t make the trip.
MongoDB’s Storage Engine Bit by Bit
Presented by Mathias Stearn
Mathias is set to present about MongoDB storage internals and the room is filling up. Somebody asks a question: what is his favorite Mongo feature?. He’s partial to find-and-modify but his favorite is just general ease-of-use. He’s also talking a bit about the new aggregation framework which should be coming out in 2.2. Declarative aggregations: much easier than map reduce and better performance. Another question: what’s the most common mistake? One: thinking MongoDB is the same as a relational DB. Two: thinking MongoDB is totally different from a relational DB. Basic concepts still matter but also can’t think “too relational”. The room is now SRO and the talk is starting.
Mathias is talking about the structure of the files in /data/db. You’ll see mongod.lock, which is a lock file that just contains the PID (a few bytes). There’s also a .ns file per DB, which is 16MB and is basically just a big hash table (more details later). There are also data files: dbname.0, dbname.1, etc. Those files grow up to 2GB, and there’s always a pre-allocated empty file: this is why an almost empty DB will still use >200MB of space. Trading off space for speed.
Now we’re hearing about mmap: an OS feature that maps a file on disk into virtual memory. A common confusion point is that “MongoDB is an in-memory DB”. Mongo maps to virtual memory: doesn’t need to fit in RAM. Now he’s showing a diagram of virtual address space: kernel space, stack, heap, program text, etc. Program text is mmap’ed just like Mongo data files. Showing where the mapped data files live in the virtual address space.
Now Mathias is talking about the 32-bit limitation caused by relying on mmap: roughly 2.5GB of space left over in the address space after kernel, heap, stack, program text, etc. (start w/ 4GB: 2^32). With journaling this is effectively even smaller: don’t use 32-bit. With journaling you still effectively have 64TB of capacity w/ 64-bit.
Describing a doubly-linked list: data structure used a lot in internals. Each element has both a next & prev pointer to allow traversal in either direction.
Now we’re talking about the .ns file: Namespace Details. 16MB divided up into ~1KB buckets. Inside each bucket we have the name of a collection, some stats (size, count, etc.), pointers to first extent & last extent (doubly-linked list), free list (actually an array of bucketed free-lists), and index data. There are namespaces for collections and also one for each index. There’s one special namespace: $freelist for free extents. ~16000 namespaces per database, which is usually enough but can be configured if more is needed.
The pointers are stored as DiskLocs, which is a data structure with a fileNum and an offset: two ints (a 64-bit struct). Note that max offset is 2^32 - 1: same as max file size.
Now we’re talking about data files (.0, .1, etc.). Each data file is broken up into extents: contiguous blocks w/in a file that is owned by a single namespace. Each extent has a location, pointers to the next and previous extents, pointers to the first and last records in the extent, and the length of the extent. Then, obviously, there’s a data section. The first and last record pointers are just 4-byte offsets into the data section. Extents are nice because they keep data local per namespace.
Now on to records. Each record has a length, offset w/in the extent, and then offsets of the next and previous record. Lastly there’s data (BSON objects, b-tree buckets, etc.). How big is the data section? In the BSON case it’s bigger than the object it contains: there’s some padding that gets pre-allocated to try to avoid moves when documents grow. That padding factor is dynamic and calculated by how often moves end up occurring. The more documents move, the larger the padding factor will be. It’s local to the namespace (i.e. index collections never grow). Padding factor can vary between 1 (no padding) and 2 (double data size), starts at 1. The extent offset is necessary to go from a record back to the extent header data.
Raw queries ($natural order) just walk over the next/prev links directly.
Talking about the “swiss-cheese problem”. If documents change sizes a lot you can end up with gaps that are too small to be used by any new documents: hence the new “compact” option to remove those gaps. This is a blocking operation: best to run on a secondary node and then step-down the master. Right now there’s not great metrics about fragmentation: can get some clue by looking at dataSize vs storageSize. If storageSize - lastExtentSize » dataSize then there is a lot of empty space that compact would clean up.
On initial RS sync or re-sync you’ll basically get compaction too.
Now we’re talking about the B-tree implementation. Regularly B-tree w/ a few modifications. Each bucket is 8KB, w/ keys in sorted order. One modification is the way splits happen: rather than always splitting in the middle of the node Mongo will do a 90/10 split if you’re doing incremental / ordered insertions. That can save a lot of space.
Bucket data structure: parent link, link to the right-most child, and
key nodes. Key nodes are fixed size blocks, each with a link to the
left child, data record, and a key offset. The offset points into a key
objects section (BSON object w/ stripped key names and values). That key
objects section actually has new format in 2.0 (v1 index format): more
compact than BSON. Key nodes are sorted (can do binary search), but key
objects are unsorted and not fixed size.
Here’s a link to the entire series of posts.