This post comes from
Working day to day with Percona Remote DBA customers, we face an issue from time to time when MySQL replication lag flaps between 0 and XXXXX constantly – e.g., Seconds_Behind_Master is zero for a few seconds, then it’s 6,287 or 25,341, then zero again and so on. I would like to note the five different scenarios and symptoms of this – some are obvious and some are not.
1. Duplicate server-ids on two or more slaves
Symptoms: The MySQL error log on a slave shows that the slave thread is connecting/disconnecting from a master constantly.
Solution: Check whether all nodes in the replication have unique server-ids.
2. Dual-master set-up, “log_slave_updates” enabled, server-ids changed
Scenario: You stop MySQL on the first master, then you stop the second one. Afterwards, you perform some migration or maintenance. Suddenly, you realize that it would be better to change server-ids to match the end part of new IP addresses. You bring everything back online and notice strange behavior, like some portion of binlogs cycling between masters due to log_slave_updates being enabled. But why are the server-ids different? It could be that when some data was written to the second master prior to its shutdown, the first one was already off and both nodes started recognizing cycled data as not their own, thus applying them and passing down the replication as the sender’s server-id, which doesn't match the server-id written in the binlogs prior to shutdown. So we have an infinite loop.
Solution: Simply reset the slave position on one of the masters as new, so finally the “ghost” binlogs will stop cycling.
3. MySQL options “sync_relay_log”, “sync_relay_log_info”, “sync_master_info”
Symptoms: According to the “SHOW SLAVE STATUS” output, at one point the slave thread queue events showed some delay, another time the waits for the master showed zero lag and so on. Considering the real master position, indeed there should be a XXXXX delay. Another symptom is I/O saturation and a high-disk IOPS number, but a disk is only busy half as much as shown by pt-diskstats. In my case, I observed 1,500 IOPS on the master and 10 times that, 15,000 IOPS being 60 percent busy, on slave. You may think it could be a row-based replication (binlog_format=ROW) and constant updates on a master, but still, why is the delay flapping, and why is the disk I/O high but not at 100 percent of capacity?
Possible reason: Check whether you have sync_relay_log, sync_relay_log_info, and sync_master_info enabled, which makes the slave sync data to disk on every event, e.g. sync_relay_log ensures one write to the relay log per statement if autocommit is enabled, and one write per transaction otherwise. Enabling those options makes your system slow, unless the disk has a battery-backed cache, which makes synchronization very fast.
4. Network latency
Symptoms: There is a poor network link between master and slave. With saturated links, the slave's IO_thread will have a hard time keeping up with the master. The SQL thread reports zero lag while it just waits for more events stored by the IO_thread. Once the IO_thread is able to write more into the relay log, SQL_thread will report XXX seconds delay again.
Solution: Check the network throughput.
5. Late committed transactions
Scenario: if you open a transaction on master and then, for some reason (maybe an application bug or a front-end problem) you commit it some time later, it will show it as replication latency on the slave. So, for example, if you wait one hour before commit, it will show 3,600 seconds of latency on the slave while that transaction is being executed.
Solution: Use the pt-heartbeat tool.