Working with MongoDB MultiMaster
Learn all about working with MondoDB multimaster.
Join the DZone community and get the full member experience.
Join For FreeSince starting my own consulting business, I've had the opportunity to work with lots of interesting technologies. Today I wanted to tell you about some interesting technology I got to develop for MongoDB: multi-master replication.
MongoMultiMaster Usage (or tl;dr)
Before getting into the details of how MongoDB MultiMaster works, here's a short intro on how to actually useMongoMultiMaster. Included in the distribution, is a command-line tool named mmm. To use mmm, you need to set up a topology for your replication in YAML, assigning UUIDs to your servers (the value of the UUIDs doesn't matter, only that they're universal). In this case, I've pointed MMM to two servers running locally:
server_a:
id: '2c88ae84-7cb9-40f7-835d-c05e981f564d'
uri: 'mongodb://localhost:27019'
server_b:
id: '0d9c284b-b47c-40b5-932c-547b8685edd0'
uri: 'mongodb://localhost:27017'
Once this is done, you can view or edit the replication config. Here, we'll clear the config (if any exists) and view it to ensure all is well:
$ mmm-ctest.ymlclear-config
Abouttoclearconfigonservers:['server_a','server_b'],areyousure? (yN)y
Clearconfigforserver_a
Clearconfigforserver_b
$ mmm-ctest.ymldump-config
===ServerConfig===
server_a(2c88ae84-7cb9-40f7-835d-c05e981f564d)=>mongodb://localhost:27019
server_b(0d9c284b-b47c-40b5-932c-547b8685edd0)=>mongodb://localhost:27017
===server_aReplicationConfig
===server_bReplicationConfig
Next, we'll set up two replicated collections:
$ mmm -c test.yml replicate --src=server_a/test.foo --dst=server_b/test.foo
$ mmm -c test.yml replicate --src=server_a/test.bar --dst=server_b/test.bar
And confirm they're configured correctly:
$ mmm -c test.yml dump-config
=== Server Config ===
server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019
server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017
=== server_a Replication Config
=== server_b Replication Config
- test.foo <= server_a/test.foo
- test.bar <= server_a/test.bar
Now, let's make the replication bidirectional:
$ mmm -c test.yml replicate --src=server_b/test.foo --dst=server_a/test.foo
$ mmm -c test.yml replicate --src=server_b/test.bar --dst=server_a/test.bar
And verify that it's correct...
$ mmm -c test.yml dump-config
=== Server Config ===
server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019
server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017
=== server_a Replication Config
- test.foo <= server_b/test.foo
- test.bar <= server_b/test.bar
=== server_b Replication Config
- test.foo <= server_a/test.foo
- test.bar <= server_a/test.bar
Now we can run the replicator:
$ mmm -c test.yml run
And you get a program that will replicate updates, bidrectionally, between server_a and server_b. So how does this all work, anyway?
Replication in MongoDB
For those who aren't familiar with MongoDB replication, MongoDB uses a modified master/slave approach. To use the built-in replication, you set up a replica set with one or more mongod nodes. These nodes then elect a primary node to handle all writes, with all the other servers becoming secondary, with the ability to handle reads, but without the ability to handle any writes. The secondaries then replicate all writes on the primary to themselves, staying in sync as much as possible.
There's lots of cool stuff you can do with replication, including setting up "permanent secondaries" that can never be elected primary (for backup), keeping a fixed delay between a secondary and its primary (for protection against accidental catastrophic user error), making replicas data-center aware (so you can make your writes really safe), etc. But in this article, I want to focus on one limitation of the replica set configuration: it is impossible to replicate data from a secondary; all secondaries read from the same primary.
The client problem
My client has a system where there is a "local" system and a "cloud" system. On the local system, there is a lot of writing going on, so he initially set up a replica set where the local database was set as primary, with the cloud database set as a permanent (priority-zero) secondary. Though it's a little bit weird to configure a system like this (more often, the primary is remote and you keep a local secondary), it wasn't completely irrational. There were, however, some problems:
- The goal of high availability wasn't really maintained, as the replica set only had a single server capable of being a primary.
- It turned out that only a small amount of data needed to be replicated to the cloud server, and replicating everything (including some image data) was saturating the upstream side of his customers' DSL connections.
- He discovered that he really would like to write to the cloud occasionally and have those changes replicated down to the local server (though these updates would usually be to different collections than the local server was writing to frequently).
What would really be nice to have is a more flexible replication setup where you could
- Specify which collections get replicated to which servers
- Maintain multiple MongoDB "primaries" capable of accepting writes
- Handle bidirectional replication between primaries
First Step: Build Triggers
And so I decided to build a so-called multi-master replication engine for MongoDB. It turns out that MongoDB's method of replication is for the primary to write all of its operations to an oplog. Following the official replication docs on the MongoDB site and some excellent articles by 10gen core developer Kristina Chrodorow on replication internals, the oplog format, and using the oplog, along with a bit of reverse-engineering and reading the MongoDB server code, it's possible to piece together exactly what's going on with the oplog.
It turns out that the first thing I needed was a way get the most recent operations in the oplog and to "tail" it for more. In Python, this is fairly straightforward:
>>> spec = dict(ds={'$gt': checkpoint})
>>> q = conn.local.oplog.rs.find(
... spec, tailable=True, await_data=True)
>>> q = q.sort('$natural')
>>> for op in q:
... # do some stuff with the operation "op"
What this does is set up a find on the oplog that will "wait" until operations are written to it, similar to the tail -f command in UNIXes. Once we have the operations, we need to look at the format of operations in the oplog. We're interested in insert, update, and delete operations, each of which have a slightly different format. Insert operations are fairly simple:
{ op: "i", // operation flag
ts: {...} // timestamp of the operation
ns: "database.collection" // the namespace the operation affects
o: {...} // the document being inserted
}
Updates have the following format:
{ op: "u", // operation flag
ts: {...} // timestamp of the operation
ns: "database.collection" // the namespace the operation affects
o: { ... } // new document or set of modifiers
o2: { ... } // query specifier telling which doc to update
b: true // 'upsert' flag
}
Deletes look like this:
{ op: "u", // operation flag
ts: {...} // timestamp of the operation
ns: "database.collection" // the namespace the operation affects
o: { ... } // query specifier telling which doc(s) to delete
b: false // 'justOne' flag -- ignore it for now
}
Once that's all under our belt, we can actually build a system of triggers that will call arbitrary code. In MongoMultiMaster (mmm), we do this by using the Triggers class:
>>> from mmm.triggers import Triggers
>>> t = Triggers(pymongo_connection, None)
>>> def my_callback(**kw):
... print kw
>>> t.register('test.foo', 'iud', my_callback)
>>> t.run()
Now, if we make some changes to the foo collection in the test database, we can see those oplog entries printed.
Next Step: Build Replication
Once we have triggers, adding replication is straightforward enough. MMM stores its current replication state in a mmm collection in the local database. The format of documents in is collection is as follows:
{ _id: ..., // some UUID identifying the server we're replicating FROM
checkpoint: Timestamp(...), // timestamp of the last op we've replicated
replication: [
{ dst: 'slave-db.collection', // the namespace we're replicating TO
src: 'master-db.collection', // the namespace we're replicating FROM
ops: 'iud' }, // operations to be replicated (might not replicate 'd')
...
]
}
The actual replication code is fairly trivial, so I won't go into it except to mention how MMM solves the problem of bidirectional replication.
Breaking Loops
One of the issues that I initially had with MMM was getting bidirectional replication working. Say you have db1.foo and db2.foo set up to replicate to one another. Then you can have a situation like the following:
- User inserts doc into db1.foo
- MMM reads insert into db1.foo from oplog and inserts doc into db2.foo
- MMM reads insert into db2.foo from oplog and inserts doc into db1.foo
- Rinse & repeat...
In order to avoid this problem, operations in MMM are tagged with their source server's UUID. Whever MMM sees a request to replicate an operation originating on db1 to db1, it drops it, breaking the loop:
- User inserts doc into db1.foo on server DB1
- MMM reads insert into db1.foo and inserts doc into db2.foo on DB2, tagging the document with the additional field {mmm:DB1}
- MMM reads the insert into db2.foo, sees the tag {mmm:DB1} and decides not to replicate that document back to DB1.
Conclusion & Caveats
So what I've built ends up being a just-enough solution for the client's problem, but I though it might be useful to some others, so I released the code on GitHub as mmm. If you end up using it, some things you'll need to be aware include:
- Replication can fall behind if you're writing a lot. This is not handled at all.
- Replication begins at the time when mmm run was first called. You should be able to stop/start mmm and have it pick up where it left off.
- Conflicts between masters aren't handled; if you're writing to the same document on both heads frequently, you can get out of sync.
- Replication inserts a bookkeeping field into each document to signify the server UUID that last wrote the document. This expands the size of each document slightly.
So what do you think? Is MMM a project you'd be interested in using? Any glaring omissions to my approach? Let me know in the comments below!
Published at DZone with permission of Rick Copeland. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments