Over a million developers have joined DZone.

Percona XtraDB Cluster on Ceph (Part 1)

See how well XtraDB Cluster and Ceph complement each other in this two-part series. First, you'll see how the mixture creates a some problems (and learn how to fix them).

· Database Zone

Sign up for the Couchbase Community Newsletter to stay ahead of the curve on the latest NoSQL news, events, and webinars. Brought to you in partnership with Coucbase.


This post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat's Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The Challenges of Clone-Based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let's have a look at the following list:

  1. New node starts (joiner) and unmounts its current MySQL datadir.
  2. The joiner and asks for an SST.
  3. The donor creates a consistent snapshot of its MySQL datadir with the Galera position.
  4. The donor sends to the joiner the name of the snapshot to use.
  5. The joiner creates a clone of the snapshot name provided by the donor.
  6. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership.
  7. The joiner initializes MySQL on the mounted clone.

As we can see, all these steps are fairly simple but hide some challenges for an SST method based on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges — and SST scripts normally run under the MySQL user. The second challenge I encountered wasn't expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let's start with the SST script. The script is available in my GitHub.

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph
chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.


The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool


What mount point to use. It defaults to the MySQL datadir as provided to the SST script.


The options used to mount the filesystem. The default value is: rw,noatime


The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring


Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0

Root Privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: "mount." The SST script on the joiner will call itself back with sudo and will pass "mount" for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files Opened by MySQL Before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function mysqld_main, which sets the current working directory to the datadir (an empty directory at that point).  After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 90760ba..bd9fa38 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name);
+  /*
+   * Forcing a new setwd in case the SST mounted the datadir
+   */
+  if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help)
+    unireg_abort(1);        /* purecov: inspected */
   if (opt_bin_log)

With this patch, I added a new my_setwd call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the base_dir Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:


Of course, if you have other provider options, don't forget to add them there.

And that's all for now! With the setup complete and the problems solved, be ready for the next post, which will cover the walkthrough and see XtraCB Cluster and Ceph working hand-in-hand to complete an SST operation.

The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.

ceph ,sst ,percona xtradb cluster ,script

Published at DZone with permission of Yves Trudeau, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}