MHA (Master High Availability Manager and tools for MySQL) is one of the most important pieces of our managed services. When properly set up, it can check replication health, move writer and reader virtual IPs, perform failovers, and have its output constantly monitored by Nagios. Is it easy to deploy and follows the KISS (Keep It Simple, Stupid) philosophy that I love so much.
This blog post is a quick start guide to try it out and play with it in your own testing environment. I assume that you already know how to install software, deal with SSH keys and setup replication in MySQL. The post just covers MHA configuration.
Taken from /etc/hosts:
192.168.1.116mysql-server1 192.168.1.117 mysql-server2 192.168.1.118 mysql-server3 192.168.1.119 mha-manager
mysql-server1: Our master MySQL server with 5.6
mysql-server2: Slave server
mysql-server3: Slave server
mha-manager: The server monitors the replication and from where we manage MHA. The installation is also required to meet some Perl dependencies.
We just introduced some new concepts, the MHA Node and MHA Manager:
It is installed and runs on each MySQL server. This is the piece of software that it is invoked by the manager every time we want to do something, like for example a failover or a check.
As explained before, this is our operations center. The manager monitors the services, replication, and includes several administrative command lines.
- Replication must already be running. MHA manages replication and monitors it, but it is not a tool to deploy it. So MySQL and replication need to be running already.
- All hosts should be able to connect to each other using public SSH keys.
- All nodes need to be able to connect to each other’s MySQL servers.
- All nodes should have the same replication user and password.
- In the case of multi-master setups, only one writable node is allowed. All others need to be configured with read_only.
- MySQL version has to be 5.0 or later.
- Candidates for master failover should have binary log enabled. The replication user must exist there too.
- Binary log filtering variables should be the same on all servers (replicate-wild%, binlog-do-db…).
- Disable automatic relay-log purge and do it regularly from a cron task. You can use an MHA-included script called “purge_relay_logs”.
While that is a large list of requisites, I think that they are pretty standard and logical.
As explained before, the MHA Node needs to be installed on all the nodes. You can download it from this Google Drive link.
This post shows you how to install it using the source code, but there are RPM packages available. Deb too, but only for older versions. Use the installation method you prefer. This is how to compile it:
tar -xzf mha4mysql-node-0.57.tar.gz perl Makefile.PL make make install
The commands included in the node package are save_binary_logs, filter_mysqlbinlog, purge_relay_logs, apply_diff_relay_logs. Mostly tools that the manager needs to call in order to perform a failover, while trying to minimize or avoid any data loss.
On the manager server, you need to install MHA Node plus MHA Manager. This is due to MHA Manager dependence on a Perl library that comes with MHA Node. The installation process is just the same.
We only need one configuration file on the Manager node. The example below is a good starting point:
# cat /etc/app1.cnf [server default] # mysql user and password user=root password=supersecure ssh_user=root # working directory on the manager manager_workdir=/var/log/masterha/app1 # working directory on MySQL servers remote_workdir=/var/log/masterha/app1 [server1] hostname=mysql-server1 candidate_master=1 [server2] hostname=mysql-server2 candidate_master=1 [server3] hostname=mysql-server3 no_master=1
So pretty straightforward. It specifies that there are three servers, two that can be master and one that can’t be promoted to master.
Let’s check if we meet some of the pre-requisites. We are going to test if replication is working, can be monitored, and also if SSH connectivity works.
# masterha_check_ssh --conf=/etc/app1.cnf [...] [info] All SSH connection tests passed successfully.
It works. Now let’s check MySQL:
# masterha_check_repl --conf=/etc/app1.cnf [...] MySQL Replication Health is OK.
Start the Manager and Operations
Everything is setup, we meet the pre-requisites. We can start our manager:
# masterha_manager --remove_dead_master_conf --conf=/etc/app1.cnf [...] [info] Starting ping health check on mysql-server1(192.168.1.116:3306).. [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
The manager found our master and it is now actively monitoring it using a SELECT command. –remove_dead_master_conf tells the manager that if the master goes down, it must edit the config file and remove the master’s configuration from it after a successful failover. This avoids the “there is a dead slave” error when you restart the manager. All servers listed in the conf should be part of the replication and in good health, or the manager will refuse to work.
Automatic and Manual Failover
Good, everything is running as expected. What happens if the MySQL master dies!?!
[...] [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.57 --binlog_prefix=mysql-bin Creating /var/log/masterha/app1 if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /var/log/mysql, up to mysql-bin.000002 [info] HealthCheck: SSH to mha-server1 is reachable. [...]
First, it tries to connect by SSH to read the binary log and save it. MHA can apply the missing binary log events to the remaining slaves so they are up to date with all the before-failover info. Nice!
Theses different phases follow:
* Phase 1: Configuration Check Phase.. * Phase 2: Dead Master Shutdown Phase.. * Phase 3: Master Recovery Phase.. * Phase 3.1: Getting Latest Slaves Phase.. * Phase 3.2: Saving Dead Master's Binlog Phase.. * Phase 3.3: Determining New Master Phase.. [info] Finding the latest slave that has all relay logs for recovering other slaves.. [info] All slaves received relay logs to the same position. No need to resync each other. [info] Starting master failover.. [info] From: mysql-server1(192.168.1.116:3306) (current master) +--mysql-server2(192.168.1.117:3306) +--mysql-server3(192.168.1.118:3306) To: mysql-server2(192.168.1.117:3306) (new master) +--mysql-server3(192.168.1.118:3306) * Phase 3.3: New Master Diff Log Generation Phase.. * Phase 3.4: Master Log Apply Phase.. * Phase 4: Slaves Recovery Phase.. * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. * Phase 4.2: Starting Parallel Slave Log Apply Phase.. * Phase 5: New master cleanup phase..
The phases are pretty self-explanatory. MHA tries to get all the data possible from the master’s binary log and slave’s relay log (the one that is more advanced) to avoid losing any data or promote a slave that it was far behind the master. So it tries to promote a slave with the most current data as possible. We see that server2 has been promoted to master, because in our configuration we specified that server3 shouldn’t be promoted.
After the failover, the manager service stops itself. If we check the config file, the failed server is not there anymore. Now the recovery is up to you. You need to get the old master back in the replication chain, then add it again to the config file and start the manager.
It is also possible to perform a manual failover (if, for example, you need to do some maintenance on the master server). To do that you need to:
- Stop masterha_manager.
- Run masterha_master_switch –master_state=alive –conf=/etc/app1.cnf. The line says that you want to switch the master, but the actual master is still alive, so no need to mark it as dead or remove it from the conf file.
And that’s it. Here is part of the output. It shows the tool making the decision on the new topology and asking the user for confirmation:
[info] From: mysql-server1(192.168.1.116:3306) (current master) +--mysql-server2(192.168.1.117:3306) +--mysql-server3(192.168.1.118:3306) To: mysql-server2(192.168.1.117:3306) (new master) +--mysql-server3(192.168.1.118:3306) Starting master switch from mha-server1(192.168.1.116:3306) to mha-server2(192.168.1.117:3306)? (yes/NO): yes [...] [info] Switching master to mha-server2(192.168.1.117:3306) completed successfully.
You can also employ some extra parameters that are really useful in some cases:
–orig_master_is_new_slave: if you want to make the old master a slave of the new one.
–running_updates_limit: if the current master executes write queries that take more than this parameter’s setting, or if any of the MySQL slaves behind master take more than this parameter, the master switch aborts. By default, it’s 1 (1 second). All these checks are for safety reasons.
–interactive=0: if you want to skip all the confirmation requests and questions masterha_master_switch could ask.
Check this link in case you use GTID and want to avoid problems with errant transactions during the failover:
Since this is a quick guide to start playing around with MHA, I won’t cover advanced topics in detail. But I will mention a few:
- Custom scripts. MHA can move IPs around, shutdown a server and send you a report in case something happens. It needs a custom script, however. MHA comes with some example scripts, but you would need to write one that fits your environment.The directives are master_ip_failover_script, shutdown_script, report_script. With them configured, MHA will send you an email or a message to your mobile device in the case of a failover, shutdown the server and move IPs between servers. Pretty nice!
Hope you found this quickstart guide useful for your own tests. Remember, one of the most important things: don’t overdo automation! These tools are good for checking health and performing the first initial failover. But you must still investigate what happened, why, fix it and work to avoid it from happening again. In high availability (HA) environments, automate everything and cause it to stop being HA.