Rescue 911: Stories from the MongoDB Trenches
These are cautionary tales, but also an opportunity for schadenfreude. All true stories.
1. The first example is of a user who ran w/o journaling. Important to note that journaling isn’t a panacea: OSes can have bugs, hardware can fail, etc.
Journaling is good, but replication is even more good.
So. No journaling. All of the mongod processes were in the same data center. Lost power and everything shut down. When things came up they needed to run `repair`. (This is like fsck for MongoDB). They were warned that it would take a long time (about 12 hrs in their case). They decided to go back into production w/o repairing to avoid the 12 hr wait.
After a few days, things weren’t happy. Mongod was stable but some cursors were aborting when they saw corruption (they were seeing “Invalid BSON” in the log files). They didn’t notice the log file problem but just saw queries aborting. One of the cursors that was aborting was the replication cursor. The corruption was such that mongod could run, but queries were failing after a certain corruption point.
Note: this isn’t guaranteed operational behavior, just the way this particular corruption played out. Ultimately, the fix was to take the system offline and repair. It fixed the problem but took 12 hrs.
tl;dr Use journaling unless you’re absolutely sure you don’t need it.
2. A user decided to shut down a config server (in a sharded deployment) and delete all of it’s data files.
The config server is one of three independent servers that use 2PC to maintain identical copies of their data.
Once the config server was down (because 2 were still up) reads and writes still worked but data rebalancing stopped. At some point they brought down the second and third config servers as well. At the time they had some mongos servers running that probably had config server data cached, but that data isn’t persisted. They had two options:
1. Dump each shard’s data. Set up a new cluster and re-load all of the data.
2. Analyze data on each shard to figure out what the actual ranges of data were on each shard. That could reconstruct the data that was lost. This is a very tricky/hard program to write.
Either of these approaches could lead to a problem because sometimes there are multiple copies of data across shards (during migrations from one shard to another, after an aborted migration, etc.). This is normally the job of mongos/config servers. They ended up having to resolve those conflicts manually.
tl;dr Don’t delete your config servers!
3. User was seeing replication lagging and write ops taking forever on the primary. Application was basically unresponsive.
Primary was showing large #s of page faults and a high lock % (time spent holding the write lock).
As it turned out, 100% of their write load consisted of pushing elements onto lists. Number of pushes grew with the square of the number of users. They also didn’t have indexes on userId. So 100% of writes and reads were table scans, and each write action was N^2. This load was saturating the primary, so secondaries couldn’t keep up.
Adding an index fixed immediate issues, but they needed schema/application logic changes for long-term fix.
tl;dr Pay attention to your indexes!
4. A user deployed but vastly underestimated their uptake. Needed to shard within 3 hours of launch.
Going from N shards to N+1 shards is easier for large N. In any case, it’s hard if you’re already overloaded - can’t migrate data. Luckily, they hada few collections, each equally written to.
All they needed to do was set up some replica sets and dump out specific collections to move them off. Then they set up config servers manually. Once they had headroom, they added shards and used a more conventional sharding setup.
tl;dr Get reasonable performance measurements in advance, and a reasonable set of requirements.Need to do capacity planning in advance or over-provision: too hard to add capacity when it’s absolutely needed.