Over a million developers have joined DZone.

Percona XtraDB Cluster on Ceph (Part 2)

With the setup out of the way, see XtraDB Cluster and Ceph in action as they work to complete an SST operation.

· Database Zone

Build fast, scale big with MongoDB Atlas, a hosted service for the leading NoSQL database. Try it now! Brought to you in partnership with MongoDB.

Welcome back for the finale of our two-part series. In part one, we worked out the kinks of running an XtraDB Cluster on Ceph. Now, we're going to get them working together to perform an SST operation. Without further ado, let's get to it.


So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph Cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we'll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main'
apt-get update
apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file, /etc/ceph/ceph.conf...

fsid = 87671417-61e4-442b-8511-12659278700f
mon_initial_members = odroid1, odroid2
mon_host =,,
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_journal = /var/lib/ceph/osd/journal
osd_journal_size = 128
osd_pool_default_size = 2

...and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s
    cluster 87671417-61e4-442b-8511-12659278700f
     health HEALTH_OK
     monmap e2: 3 mons at {odroid1=,odroid2=,serveur-famille=}
            election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille
     mdsmap e204: 1/1/1 up {0=odroid3=up:active}
     osdmap e995: 4 osds: 4 up, 4 in
      pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects
            643 GB used, 6318 GB / 7334 GB avail
                1352 active+clean
  client io 16491 B/s rd, 2425 B/s wr, 1 op/s

That gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool.  We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the First RBD Image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

And "map" the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1
mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid
chown mysql.mysql /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
mkdir /var/lib/galera
chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the mysql_install_db tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf Files

You will need to set or adjust the specific wsrep_sst_ceph settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:


At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the Other XtraDB Cluster Nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a "flush tables with read lock" at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC2:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC2-1463776424 -    /dev/rbd1
root@PXC3:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC3:~# rbd showmapped
id pool      image           snap device
1  mysqlpool PXC3-1464118729 -    /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls
root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424
rbd image 'PXC2-1463776424':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.108b4246146651
        format: 2
        features: layering
        parent: mysqlpool/PXC@1463776423
        overlap: 10240 MB


Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424
warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow.
PXC2-1463776424      10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!

Now it's easier than ever to get started with MongoDB, the database that allows startups and enterprises alike to rapidly build planet-scale apps. Introducing MongoDB Atlas, the official hosted service for the database on AWS. Try it now! Brought to you in partnership with MongoDB.

xtradb cluster,ceph,sst

Published at DZone with permission of Yves Trudeau, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}