Unpacking MySQL Semi-Synchronous Replication: Durability, Consistency and Split-Brains
Discussion of durability, consistency, and split-brains, with scenarios requiring higher-level intervention to ensure availability and to avoid split-brains.
Join the DZone community and get the full member experience.Join For Free
MySQL semi-synchronous is a plugin mechanism on top of asynchronous replication that can offer better durability and even consistency. It helps in high availability solutions, but can in itself reduce availability. In this article, we will look at some basics and follow up to present scenarios requiring higher-level intervention to ensure availability and avoid split-brains.
As a quick recap, semi-synchronous replication is a mechanism where a commit on the primary does not apply the change onto the internal table data and does not respond to the user until the changelog is guaranteed to have been persisted (though not necessarily applied) on a preconfigured number of replicas. We limit our discussion to MySQL 5.7 or equivalent.
I recommend reading this semi-sync blog post by Jean-François Gagné (aka JFG), which illustrates the internals of the semi-sync implementation and debunks some myths about semi-sync. We will overlap a bit with another recommended post by JFG, about high availability and recovery.
Specifically, the primary is configured with
rpl_semi_sync_master_wait_for_slave_count, which is one or above, if semi-sync is to be used. An INSERT transaction, for example, will not report being committed (i.e. the user will not get “OK”) until at least that number of replicas have acknowledged receipt of that transaction’s changelog (entries in the binary log). In practice, this means at least the specified number of replicas have written the changelog onto their relay logs where the change is now persisted. The replicas will apply the change later on, normally as soon as they can. For the scope of this post, we assume replicas are not intentionally stopped, and we can also ignore delayed replicas.
The primary waits until
rpl_semi_sync_master_timeout, after which it falls back to asynchronous replication mode, committing and responding to the user even if not all expected replicas have acknowledged receipt. In the scope of this post, we are interested in an “infinite” timeout, which we will accept to be a very large number.
The only replicas to acknowledge receipt of the changelog are replicas where semi-sync is enabled. There could be other non-semi-sync replicas in our topology. They, too, pull the changelog from the primary but do not acknowledge receipt. As JFG points out in his post, the semi-sync replicas aren’t necessarily the most up-to-date. It’s possible that a non-semi-sync replica has pulled more changelog data than some or all semi-sync replicas.
The most straightforward use case for semi-sync is to guarantee the durability of the data. We only approve a commit once the transaction (in the form of a changelog) is persisted elsewhere, on a different server, or on multiple servers.
As mentioned earlier, this does not mean the changelog has been applied on those servers. Replication may be lagging, and the servers may be too busy to apply the changelog in a timely fashion. But, at the very least, if the primary crashes, there’s “forensic evidence” in the relay logs of a semi-sync replica that can tell us what the latest changes were.
The term “consistency” is overloaded, especially in distributed systems. The CAP theorem definition of consistency is, for example, very strict: once data was written on one server, any immediate follow-up read on any other server must reflect that write (or any later write).
We will suffice with eventual consistency. This in itself is an overloaded term, so let’s clarify: in the case of a primary outage, we consider our system consistent if we’re able to promote a new primary within any amount of time (obviously in practice we expect the time to be short) such that it is consistent with the previous primary’s advertised state. I write the “advertised” state because the end-user or applications should not be surprised by any data changes when the new primary is promoted, regardless of how transactions/replications work internally.
A split-brain is a situation where two servers simultaneously take writes, each thinking they’re the (single) primary. In a split-brain scenario, the data diverge between the two servers. If those servers also have replicas, then we have two divergent replication trees. Fixing and merging those divergent trees is complex, and we usually revert the changes on one to make it look like the other (either by backup and restore or by some rollback method).
Reiterating JFG’s observation that while engineers usually frown upon split-brains, the damage of a split-brain may well be within a product’s allowance.
Topologies and Scenarios
Getting the best durability, consistency, response times, and split-brain mitigation as possible is costly. Product owners and engineers make reasonable tradeoffs.
Let’s consider a few configured topologies and see what scenarios we can expect upon failure. In particular, let's learn how durable our data is, how we can expect it to be consistent, and when we can expect split-brains.
General Scenario: Wait for a Single Replica and Multiple Semi-Sync Replicas
In this common scenario, our primary is configured with
rpl_semi_sync_master_wait_for_slave_count=1, and there are multiple replicas configured with semi-sync. Let’s say there are four semi-sync replicas, named R1, R2, R3, and R4. We nickname this setup “1-n”
In this setup, a write takes place on the primary, and before approving the write to the user/app, the primary waits for an acknowledgment from exactly one replica. It doesn’t matter which. For some writes, it will be replica R1; at some other time, R4 is the first to acknowledge.
It’s important to note that the changelog (the binary log) is sequential. A replica that acknowledges some changelog event has necessarily received all of its prior events. This is why the primary doesn’t need to care that different acknowledgments come from different replicas. An acknowledgment means there’s a replica with full durability of committed transactions up to and including the one in question.
Should the primary fail, we have at least one replica guaranteed to have all approved commits in its relay logs. These are not necessarily applied, but if our replication cluster is normally low-lagged, we can expect those commits to be few, and we can expect the replica to catch up (apply those commits) within a few seconds or less.
A possible situation is that there may be non-semi-sync replicas even more up-to-date than our semi-sync replica. We can choose to use them. The fact they have more transactions than the most up-to-date semi-sync replica means they have received binary log events not acknowledged by a semi-sync replica. While this may seem confusing at first, these are fine to accept and do not contradict our promise to our users. We promised that “if the primary tells you a commit is successful, then the data is durable elsewhere”. We never said anything about the state of the data before telling the user the commit is successful. In fact, this scenario is no different than if the primary actually gets a commit acknowledged by the replica, begins to communicate that to the user, and then crashes halfway. The user cannot distinguish between the two cases at the time of failure (though post-crash analysis may indicate which of the two was true).
And so we can choose to use the data from that even more up-to-date non-semi-sync replica, seed the rest of the replicas with that data, or we can choose to throw away any server that is more up-to-date than our most-up-to-date semi-sync replica and recycle them. It’s our choice and it’s down to operational considerations (time, capacity, load, etc.).
If Data Is Durable, Does That Mean It’s Available?
Not necessarily. Consider this possible, even likely, scenario: the primary and R1 are both in the same data center. R2, R3, R4 are in different availability zones, possibly remote, and so have higher latency than R1 communicating to the primary.
The app issues an update on the primary. R1 acknowledges the write. R2, R3, R4 do not, as of yet. But the primary does not care, it commits and reports success to the app. Then the primary and R1 both get network isolated together because the DC they’re in has lost network. The loss of network cut short the delivery of the changelog to R2, R3, and R4. They never got it.
Should we promote either R2, R3, or R4, we lose consistency. The app expects the result of that update to be there, but neither of these servers has the data. Moreover, even human intervention is limited. Since the data is only found on primary and R1, we cannot get hold of the data. A human can always transport to the physical server to extract the data and send it over via cellular network, or even by car. It’s likely that this will take substantial time.
A predefined failover plan may choose to promote, say, R2 after all, losing some data but cutting short on outage time. Reconciliation will have to take place afterward.
Do We Even Know the Status?
Suppose the primary and one of the semi-sync replicas go offline. In that case, we have an additional problem: we can’t say for sure whether any of the remaining semi-sync replicas got the latest updates from the primary. It’s possible that they did. But it’s also possible that the single semi-sync replica that went offline was the single one to get and acknowledge some last writes taking place on the primary.
To generalize, if the number of lost semi-sync replicas is equal to, or greater than
rpl_semi_sync_master_wait_for_slave_count, then we do not know whether the remaining replicas are consistent.
But, arguably worse, is that we can now be in a split-brain situation. Apps in the primary data center were still writing to the old primary when the network went down. Those writes continued to run. The old primary now receives writes never seen by R2 (the newly promoted primary). Once we do wish to reconcile, we are faced with contradicting information.
The geo-distribution of our servers plays a key part in how tolerant our system is for failure and what outcomes we can expect.
Co-Located Primary and Semi-Sync Replicas
On a normal day, writes to the primary have low latency: acknowledgments from semi-sync replicas are quick to arrive thanks to the fast network within a data center.
When the primary fails, we have at least one good semi-sync replica to promote (or we can use it to seed a different server to promote). Because we promote within the same datacenter, app behavior is unlikely to be affected, other than the obvious disruption during the failover time.
If the datacenter is network-isolated, we need to make a decision: wait out the outage, incurring downtime to our services, or promote a new primary in another datacenter. We risk losing transactions (we cannot know whether we lose transactions until we are again able to connect to the original primary). We also risk a split-brain scenario.
Cross-Site Semi-Sync Replication
If we only run semi-sync replicas in a different datacenter than the primary’s, we first pay with increased write latency, as each commit runs a roundtrip outside the primary’s data center and back. With multiple semi-sync replicas, it’s the time it takes for the fastest replica to respond.
When the primary goes down, we have the data durable outside its data center. But we can then also compare with non-semi-sync replicas in the primary’s datacenter — they may yet have all the transactions, too, in which case we can promote one of them. Again, promoting within the same data center tends to be less disruptive to the application. Or, we can spend some time seeding them from one of our semi-sync replicas.
Or, we can choose to promote our semi-sync replicas, switching the primary data center for the replication cluster. We then must reassign semi-sync replicas, and ensure none run from within the new primary’s datacenter.
In the event of the primary’s datacenter network isolation, we promote one of our semi-sync replicas, in a different DC. We reconfigure semi-sync replicas, before making the new primary writable. This avoids a split-brain scenario, assuming all of our semi-sync replicas are available to us.
What If Two or More Sites Get Network Isolated at the Same Time?
It largely depends on your hardware distribution. If your servers are distributed across three sites, then the majority of your sites are now down. Chances are you were only planning for a single-site outage at any given time, and a two-site outage is a hard downtime for you. You may or may not have the capacity to run with just one-third of your hardware. You’d need to risk consistency and split-brain.
Let’s assume we run out of five different sites (or availability zones) and two are down. In our “1-n” setup, split-brain is still possible, even though the majority of sites are up. That’s because our semi-sync setup is about the communication between a primary and any of its replicas.
But this is also something you can attach a number to. What is the probability of two availability zones going down at the same time? For a given vendor? Cross vendors? Is there a number you’re willing to accept? What is the probability of the two zones being in outage both over X minutes? Will you be willing to wait out X minutes to maintain durability and consistency?
In our physical world, there is no hard limit to how many sites could go down at the same time. We’re generally willing to acknowledge that some risks are so low that we can accept them.
Of special interest is that in our “1-n” scenario, we have a quorum of two servers out of five or more. The primary, with a single additional replica, is able to form a quorum and to accept writes.
That’s how we got to have a split-brain. While R2, R3, and R4 form a majority of the servers, writes took place without their agreement.
Published at DZone with permission of Shlomi Noach. See the original article here.
Opinions expressed by DZone contributors are their own.