Over the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.
Most of the work I've done with Ceph was in collaboration with folks from Red Hat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the Red Hat Summit in San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.
Introduction to Ceph
Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. RedHat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn't restricted to a virtual environment.
A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement, but only up to a certain point. There are minimally two types of nodes with Ceph — monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.
Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes — though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the "crushmap") in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.
With the information provided by the monitors' crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor because these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where — given a cluster map provided by the NDB management node — clients can directly access the data on data nodes.
Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor ("size" in Ceph terminology).
The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes. As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.
So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store — for things like the support of block devices, rbd devices, and even a network file systems, cephfs.
When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.
The Benefits of Using Ceph
In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.
On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1,000 iops, a four-node cluster could reach up to 4,000 iops.
Ceph replicates data at the storage level, and provides resiliency to storage node crash. A kind of DRBD on steroids.
Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.
You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.
The Caveats of Using Ceph
Of course, nothing is free. Ceph use comes with some caveats.
Ceph Reaction to a Missing OSD
If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.
The "no out" OSD attribute mitigates this and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the "no out" attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.
Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.
Ceph has many parameters so that tuning Ceph can be complex and confusing. Because distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones — especially if you use high-speed NVMe devices.
Hopefully, this post provided a good introduction to Ceph. I've discussed the architecture, the benefits and the caveats of Ceph. In future posts, I'll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.
Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk, and a gigabit Ethernet port.
I won't claim record breaking performance (although it's decent), but cost-wise it is pretty hard to beat (at around $600)!