DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. MongoDB’s Storage Engine Bit by Bit

MongoDB’s Storage Engine Bit by Bit

Mike Dirolf user avatar by
Mike Dirolf
·
Dec. 12, 11 · Interview
Like (0)
Save
Tweet
Share
12.06K Views

Join the DZone community and get the full member experience.

Join For Free

I made the trip to Santa Clara for 10gen’s big MongoDB conference, MongoSV, and I thought it’d be fun to report back on the event for those of you who couldn’t make the trip.

MongoDB’s Storage Engine Bit by Bit

Presented by Mathias Stearn

Mathias is set to present about MongoDB storage internals and the room is filling up. Somebody asks a question: what is his favorite Mongo feature?. He’s partial to find-and-modify but his favorite is just general ease-of-use. He’s also talking a bit about the new aggregation framework which should be coming out in 2.2. Declarative aggregations: much easier than map reduce and better performance. Another question: what’s the most common mistake? One: thinking MongoDB is the same as a relational DB. Two: thinking MongoDB is totally different from a relational DB. Basic concepts still matter but also can’t think “too relational”. The room is now SRO and the talk is starting.

Mathias is talking about the structure of the files in /data/db. You’ll see mongod.lock, which is a lock file that just contains the PID (a few bytes). There’s also a .ns file per DB, which is 16MB and is basically just a big hash table (more details later). There are also data files: dbname.0, dbname.1, etc. Those files grow up to 2GB, and there’s always a pre-allocated empty file: this is why an almost empty DB will still use >200MB of space. Trading off space for speed.

Now we’re hearing about mmap: an OS feature that maps a file on disk into virtual memory. A common confusion point is that “MongoDB is an in-memory DB”. Mongo maps to virtual memory: doesn’t need to fit in RAM. Now he’s showing a diagram of virtual address space: kernel space, stack, heap, program text, etc. Program text is mmap’ed just like Mongo data files. Showing where the mapped data files live in the virtual address space.

Now Mathias is talking about the 32-bit limitation caused by relying on mmap: roughly 2.5GB of space left over in the address space after kernel, heap, stack, program text, etc. (start w/ 4GB: 2^32). With journaling this is effectively even smaller: don’t use 32-bit. With journaling you still effectively have 64TB of capacity w/ 64-bit.

Describing a doubly-linked list: data structure used a lot in internals. Each element has both a next & prev pointer to allow traversal in either direction.

Now we’re talking about the .ns file: Namespace Details. 16MB divided up into ~1KB buckets. Inside each bucket we have the name of a collection, some stats (size, count, etc.), pointers to first extent & last extent (doubly-linked list), free list (actually an array of bucketed free-lists), and index data. There are namespaces for collections and also one for each index. There’s one special namespace: $freelist for free extents. ~16000 namespaces per database, which is usually enough but can be configured if more is needed.

The pointers are stored as DiskLocs, which is a data structure with a fileNum and an offset: two ints (a 64-bit struct). Note that max offset is 2^32 - 1: same as max file size.

Now we’re talking about data files (.0, .1, etc.). Each data file is broken up into extents: contiguous blocks w/in a file that is owned by a single namespace. Each extent has a location, pointers to the next and previous extents, pointers to the first and last records in the extent, and the length of the extent. Then, obviously, there’s a data section. The first and last record pointers are just 4-byte offsets into the data section. Extents are nice because they keep data local per namespace.

Now on to records. Each record has a length, offset w/in the extent, and then offsets of the next and previous record. Lastly there’s data (BSON objects, b-tree buckets, etc.). How big is the data section? In the BSON case it’s bigger than the object it contains: there’s some padding that gets pre-allocated to try to avoid moves when documents grow. That padding factor is dynamic and calculated by how often moves end up occurring. The more documents move, the larger the padding factor will be. It’s local to the namespace (i.e. index collections never grow). Padding factor can vary between 1 (no padding) and 2 (double data size), starts at 1. The extent offset is necessary to go from a record back to the extent header data.

Raw queries ($natural order) just walk over the next/prev links directly.

Talking about the “swiss-cheese problem”. If documents change sizes a lot you can end up with gaps that are too small to be used by any new documents: hence the new “compact” option to remove those gaps. This is a blocking operation: best to run on a secondary node and then step-down the master. Right now there’s not great metrics about fragmentation: can get some clue by looking at dataSize vs storageSize. If storageSize - lastExtentSize » dataSize then there is a lot of empty space that compact would clean up.

On initial RS sync or re-sync you’ll basically get compaction too.

Now we’re talking about the B-tree implementation. Regularly B-tree w/ a few modifications. Each bucket is 8KB, w/ keys in sorted order. One modification is the way splits happen: rather than always splitting in the middle of the node Mongo will do a 90/10 split if you’re doing incremental / ordered insertions. That can save a lot of space.

Bucket data structure: parent link, link to the right-most child, and key nodes. Key nodes are fixed size blocks, each with a link to the left child, data record, and a key offset. The offset points into a key objects section (BSON object w/ stripped key names and values). That key objects section actually has new format in 2.0 (v1 index format): more compact than BSON. Key nodes are sorted (can do binary search), but key objects are unsorted and not fixed size.

Here’s a link to the entire series of posts.

Source: http://blog.fiesta.cc/post/13975691790/mongosv-live-blog-mongodbs-storage-engine-bit-by-bit

Data file MongoDB Database Extent (file systems) Engine Space (architecture) Data structure

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 19 Most Common OpenSSL Commands for 2023
  • Java REST API Frameworks
  • File Uploads for the Web (2): Upload Files With JavaScript
  • Use AWS Controllers for Kubernetes To Deploy a Serverless Data Processing Solution With SQS, Lambda, and DynamoDB

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: