ColumnStore: Storage Architecture Choices
Let's take a walk through MariaDB's ColumnStore, including the available architecture choices and how to set up disaster recovery.
MariaDB ColumnStore provides a complete solution for automated high availability of a cluster. On the data processing side:
- Multiple front-end or User Module (UM) servers can be deployed to provide failover on SQL processing.
- Multiple back-end or Performance Module (PM) servers can be deployed to provide failover on distributed data processing.
Due to the shared-nothing data processing architecture, ColumnStore requires the storage tier to deliver high availability for a complete solution. In this blog, I intend to provide some clarity and direction on the architecture and choices available.
As a recap, here is an outline of MariaDB ColumnStore’s architecture:
ColumnStore provides data processing redundancy by individually scaling out the front-end MariaDB Server UM nodes and back-end Distributed Query Engine PM nodes.
All ColumnStore data is accessed and managed by the PM nodes. Each PM manages one or more DBRoots, which contain Extents that hold the data values for a single column. A DBRoot belongs to exactly one PM server (at a point in time). The complete data for a given column is spread across all DBRoots with a given column value being stored exactly once. To learn more about the storage architecture, please refer to my previous blog.
During installation, a choice must be made between utilizing:
- Local Storage: The DBRoot is created on the local disk for the PM node, specifically under /usr/local/mariadb/columnstore/data<DBROOTID>.
- External Storage: The DBRoot will be stored on storage external to the PM server, then will be mounted for access. An entry for each DBRoot mount must exist in the /etc/fstab file on each PM server.
To provide data redundancy, ColumnStore relies on external storage to provide resilient storage and enable a particular DBRoot volume to be remounted on another PM server. This generally implies a remote networked storage solution, although filesystems such as GlusterFS can allow deployment without additional servers.
When internal storage is utilized, journaling filesystems and RAID deployment provide for resilient storage. However, since the storage is only available within a given PM server, the storage cannot be remounted on another PM server should one fail. In this case, the failed server must be recovered before ColumnStore can support additional queries.
With external storage, ColumnStore can provide automated failover and continuity in the event a PM server fails. This is because a given DBRoot storage is external to the failed PM server and can be remounted on another PM server. The following diagram illustrates how this works:
In this case:
- Server PM4 crashes.
- The ProcMgr process on PM1 detects that PM4 is no longer reachable and instructs PM3 to mount DBRoot4 and process reads and writes in addition to DBRoot3.
- When PM4 recovers, ProcMgr instructs PM3 to unmount DBRoot4, and then PM4 to mount DBRoot4, thereby returning the system to a steady state.
- If the PM1 server crashes, which contains the active ProcMgr process, the system will promote another PM server to act as the active ProcMgr, which will then initiate the above process.
Storage choices that support this model include:
- AWS EBS (if deployed on AWS) – The ColumnStore AMI image utilizes this and provides automation around storage management.
- Other Cloud Platforms – Are available and provide similar capabilities such as Persistent Disks for Google Cloud.
- SAN / NAS – Many vendor choices exist and may include capabilities such as snapshotting to simplify backup and asynchronous replication to DR storage.
- GlusterFS – Open source software filesystem (with support available from RedHat). On a small cluster this can be co-deployed with the PM servers for a simplified topology.
- CEPH – Open source storage cluster (with support available from RedHat). This enables deployment of a lower cost software-defined storage cluster.
- DBRD – A community member has been working with us on testing this as a storage layer, though it is not certified yet.
- Other Options – Are available and will work as long as they present a compliant file system layer to the Linux operation system and ensure that a data volume can be remounted onto a different PM server.
To enable a warm standby DR setup, a second MariaDB ColumnStore cluster should be installed and configured identically in a secondary location. Available storage cluster replication should be configured to provide replication between the locations. Should the cluster in Data Center 1 fail, the cluster in Data Center 2 can be initiated to provide continuity.
To enable an active-active DR setup, a second MariaDB ColumnStore cluster should be installed, configured and run in a secondary location. Potentially this could have a different topology of PMs from the primary data center. The data feed output of the ETL process must be duplicated and run on both clusters in order to provide a secondary active cluster.
MariaDB ColumnStore offers fully automated high availability. To provide this at the storage level it relies upon the storage layer to provide both:
- The ability to remount a DBRoot on a different PM server.
In the majority of cloud providers, such networked storage capabilities are a baseline offering. Private clouds such as OpenShift also come with similar capabilities such as the CEPH storage cluster. Finally, for a bare metal installation, storage appliances and software-defined storage offerings can provide this. Storage-based replication provides tried and trusted replication that works and handles the many edge cases that arise in real-world replication scenarios.
Not needing to focus on this enables my engineering team to support more customer use cases to broaden the reach and impact of ColumnStore. For example, in response to community and customer feedback, we are working on improving the text data type as well as providing a bulk write API to better support streaming use cases (i.e., Kafka integration). These improvements, plus other roadmap items, will be addressed in our 1.1 release later this year.