DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Databases
  4. Get Familiar with MongoDB Replica Set Internals: Syncing

Get Familiar with MongoDB Replica Set Internals: Syncing

Kristina Chodorow user avatar by
Kristina Chodorow
·
May. 08, 12 · Interview
Like (0)
Save
Tweet
Share
10.04K Views

Join the DZone community and get the full member experience.

Join For Free

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Syncing

When a secondary is operating normally, it chooses a member to sync from (more on that below) and starts pulling operations from the source’s local.oplog.rs collection. When it gets an op (call it W, for “write”), it does three things:

  1. Applies the op
  2. Writes the op to its own oplog (also local.oplog.rs)
  3. Requests the next op

If the db crashes between 1 & 2 and then comes back up, it’ll think it hasn’t applied W yet, so it’ll re-apply it. Luckily (i.e., due to massive amounts of hard work), oplog ops are idempotent: you can apply W once, twice, or a thousand times and you’ll end up with the same document.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

w

To ensure a write is present on, say, two members, you can do:

> db.foo.runCommand({getLastError:1, w:2})

Syntax varies based on language, consult your API docs, but it’s always an option for writes. The way this works is kind of cool.

Suppose you have a member called primary and another member syncing from it, called secondary. How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.

So, it goes like:

  1. Do a write on primary.
  2. Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
  3. {getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
  4. secondary queries the oplog on primary and gets the op
  5. secondary applies the op from time t
  6. secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
  7. primary updates that secondary has applied up to t because it is requesting ops > t.
  8. getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.

Starting up

When you start up a node, it takes a look at its local.oplog.rs collection and finds the latest entry in there. This is called the lastOpTimeWritten and it’s the latest op this secondary has applied.

You can always use this shell helper to get the current last op in the shell:

> rs.debug.getLastOpWritten()

The “ts” field is the last optime written.

If a member starts up and there are no entries in the oplog, it will begin the initial sync process, which is beyond the scope of this post.

Once it has the last op time, it will chose a target to sync from.

Who to sync from

As of 2.0, servers automatically sync from whoever is “nearest” based on average ping time. So, if you bring up a new member it starts sending out heartbeats to all the other members and averaging how long it takes to get a response. Once it has a decent picture of the world, it’ll decide who to sync from using the following algorithm:

for each member that is healthy:
    if member[state] == PRIMARY
        add to set of possible sync targets

    if member[lastOpTimeWritten] > our[lastOpTimeWritten]
        add to set of possible sync targets

sync target = member with the min ping time from the possible sync targets

The definition of “member that is healthy” has changed somewhat over the versions, but generally you can think of it as a “normal” member: a primary or secondary. In 2.0, “healthy” debatably includes slave delayed nodes.

You can see who a server is syncing from by running db.adminCommand({replSetGetStatus:1}) and looking at the “syncingTo” field (only present on secondaries). Yes, yes, it probably should have been syncingFrom. Backwards compatibility sucks.

Chaining Slaves

The algorithm for chosing a sync target means that slave chaining is semi-automatic: start up a server in a data center and it’ll (probably) sync from a server in the same data center, minimizing WAN traffic. (Note that you can’t end up with a sync loop, i.e., A syncing from B and B syncing from A, because a secondary can only sync from another secondary with a strictly higher optime.)

One cool thing to implement was making w work with slave chaining. If A is syncing from B and B is syncing from C, how does C know where A is synced to? The way this works is that it builds on the existing oplog-reading protocol.

When A starts syncing from B (or any server starts syncing from another server), it sends a special “handshake” message that basically says, “Hi, I’m A and I’ll be syncing from your oplog. Please track this connection for w purposes.”

When B gets this message, it says, “Hmm, I’m not primary, so let me forward that along to the member I’m syncing from.” So it opens a new connection to C and says “Pretend I’m ‘A‘, I’ll be syncing from your oplog on A‘s behalf.” Note that B now has two connections open to C, one for itself and one for A.

Whenever A requests more ops from B, B sends the ops from its oplog and then forwards a dummy request to C along “A‘s” connection to C. A doesn’t even need to be able to connect directly to C.

A        B        C
           <====>
  <====>   <---->

<====> is a “real” sync connection. The connection between B and C on A’s behalf is called a “ghost” connection (<---->).

On the plus side, this minimizes network traffic. On the negative side, the absolute time it takes a write to get to all members is higher.

Coming soon to a replica set near you…

In 2.2, there will be a new command, replSetSyncFrom, that lets you change who a member is syncing from, bypassing the “choosing a sync target” logic.

> db.adminCommand({replSetSyncFrom:"otherHost:27017"})

 

 

 

Sync (Unix) Database Data (computing) MongoDB

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Integrate AWS Secrets Manager in Spring Boot Application
  • Choosing the Right Framework for Your Project
  • Demystifying the Infrastructure as Code Landscape
  • Strategies for Kubernetes Cluster Administrators: Understanding Pod Scheduling

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: