Over a million developers have joined DZone.

Five Reasons Why MySQL Replication Lag Flaps Between 0 and XXXXX

DZone's Guide to

Five Reasons Why MySQL Replication Lag Flaps Between 0 and XXXXX

· Performance Zone ·
Free Resource

SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.

This post comes from Roman Vynar at the MySQL Performance Blog.

Working day to day with Percona Remote DBA customers, we face an issue from time to time when MySQL replication lag flaps between 0 and XXXXX constantly – e.g., Seconds_Behind_Master is zero for a few seconds, then it’s 6,287 or 25,341, then zero again and so on. I would like to note the five different scenarios and symptoms of this – some are obvious and some are not.

1. Duplicate server-ids on two or more slaves

Symptoms: The MySQL error log on a slave shows that the slave thread is connecting/disconnecting from a master constantly.
Solution: Check whether all nodes in the replication have unique server-ids.

2. Dual-master set-up, “log_slave_updates” enabled, server-ids changed

Scenario: You stop MySQL on the first master, then you stop the second one. Afterwards, you perform some migration or maintenance. Suddenly, you realize that it would be better to change server-ids to match the end part of new IP addresses. You bring everything back online and notice strange behavior, like some portion of binlogs cycling between masters due to log_slave_updates being enabled. But why are the server-ids different? It could be that when some data was written to the second master prior to its shutdown, the first one was already off and both nodes started recognizing cycled data as not their own, thus applying them and passing down the replication as the sender’s server-id, which doesn't match the server-id written in the binlogs prior to shutdown. So we have an infinite loop.
Solution: Simply reset the slave position on one of the masters as new, so finally the “ghost” binlogs will stop cycling.

3. MySQL options “sync_relay_log”, “sync_relay_log_info”, “sync_master_info”

Symptoms: According to the “SHOW SLAVE STATUS” output, at one point the slave thread queue events showed some delay, another time the waits for the master showed zero lag and so on. Considering the real master position, indeed there should be a XXXXX delay. Another symptom is I/O saturation and a high-disk IOPS number, but a disk is only busy half as much as shown by pt-diskstats. In my case, I observed 1,500 IOPS on the master and 10 times that, 15,000 IOPS being 60 percent busy, on slave. You may think it could be a row-based replication (binlog_format=ROW) and constant updates on a master, but still, why is the delay flapping, and why is the disk I/O high but not at 100 percent of capacity?
Possible reason: Check whether you have sync_relay_log, sync_relay_log_info, and sync_master_info enabled, which makes the slave sync data to disk on every event, e.g. sync_relay_log ensures one write to the relay log per statement if autocommit is enabled, and one write per transaction otherwise. Enabling those options makes your system slow, unless the disk has a battery-backed cache, which makes synchronization very fast.

4. Network latency

Symptoms: There is a poor network link between master and slave. With saturated links, the slave's IO_thread will have a hard time keeping up with the master. The SQL thread reports zero lag while it just waits for more events stored by the IO_thread. Once the IO_thread is able to write more into the relay log, SQL_thread will report XXX seconds delay again.
Solution: Check the network throughput.

5. Late committed transactions

Scenario: if you open a transaction on master and then, for some reason (maybe an application bug or a front-end problem) you commit it some time later, it will show it as replication latency on the slave. So, for example, if you wait one hour before commit, it will show 3,600 seconds of latency on the slave while that transaction is being executed.
Solution: Use the pt-heartbeat tool.

SignalFx is built on a massively scalable streaming architecture that applies advanced predictive analytics for real-time problem detection. With its NoSample™ distributed tracing capabilities, SignalFx reliably monitors all transactions across microservices, accurately identifying all anomalies. And through data-science-powered directed troubleshooting SignalFx guides the operator to find the root cause of issues in seconds.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}