Troubleshooting Relay Log Corruption in MySQL
Have you ever seen the replication stopped with message like this:
Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
This is relay relay log corruption and you can check details in the
MySQL Error log file. The error message describes few reasons and
indeed because there is little validation (ie no checksums) in the
replication there are multiple reasons for bad event to show up in
Really this is only one of various error messages you could see if relay log corrupted. You could also see malformed queries (with some junk), complaining about event to big etc if there is a garbage in relay logs.
If relay logs are corrupted it is surely worth to check what could cause it - it could be network (especially if replicating over unreliable long distance networks), MySQL bugs on master or slave, hardware problems and few others. In any case it is worth investigating.
Investigating is what you do later but how do you fix the problem first ? The important question you need to have answered - are logs corrupted on the master ? If logs on the master are OK you can just run SHOW SLAVE STATUS on slave experiencing error and use CHANGE MASTER TO to re-point replication to Relay_Master_Log_File:Exec_Master_Log_Pos:
localhost:(none)> slave stop;
Query OK, 0 rows affected (0.00 sec)
localhost:(none)> CHANGE master TO master_log_file=Relay_Master_Log_File,master_log_pos=Exec_Master_Log_Pos
Query OK, 0 rows affected (1.16 sec)
localhost:(none)> slave start;
Query OK, 0 rows affected (0.00 sec)
This will purge existing relay logs re-fetch all events which have not been executed yet. Doing this command make sure your master is operational and it still has all the logs needed to re-fetch events.
How would you know if logs are OK on the master ? Well in this case there were probably 5 another slaves which did not have the problem - which means Master is most likely OK. In any case it is little harm to try restarting from the same position - if logs are bad on the master you would get the same error message again and can continue with investigation.
What if logs on the master are corrupted ?
In this case you have couple of choices (and you also potentially have multiple slaves to deal with). You can use mysqlbinlog (or you favorite hex editor if mysqlbinlog does not work) to find the next event start and potentially recover "corrupted" event to be manually executed on the slaves.
Skipping around event makes master and slave potentially inconsistent and you should access the risks depending on applications (and on amount of events which were corrupted) you may want to let replication continue from the new position or resync the slaves to the master.
How can you recover the slave ?
As all slaves are likely to be affected in this case you can't clone another slave. You also can't use classical method of recovery from backup - because you would need relay logs to roll forward, and they are corrupted. You can either re-clone the data from Master. (This is where LVM or similar techniques can help you a lot) or skip bad events as described and when use Maatkit mk-table-checksum to check what tables are out of sync and when use mk-table-sync to resync them.
Last method works in particularly well in case you can afford to run for a while with slaves which are a bit out of sync, which is quite often better than having just master available (also having extra load of data copied from it).