Here’s a common question we get asked at tradeshows, on customer calls, by analysts, and by our fellow devs and DBAs in the industry. Since this has come up a lot, let’s answer it in detail!
To begin, a quick high-level architecture review is useful. ClustrixDB is at heart a single logical database implemented as a cluster of shared-nothing nodes. The minimum amount of nodes for a production ClustrixDB cluster is three (3). Each of the nodes can accept reads or writes, and a minimum of two instances of the entire dataset is hash-distributed across all the nodes. This means no single node has a complete copy of all the data. Having two copies of the data also means the cluster is fault tolerant to the loss of a single node, by default. Additional fault tolerance can be set using nResiliency.
ClustrixDB architecture is fundamentally different than what is typically called a “cluster” in the MySQL world.
A “cluster” running MySQL is usually an association between a “master” MySQL instance, and one or many slaves. The master accepts writes from the application, and the slave(s) accept only reads. Each slave is running MySQL replication and strives to be an exact copy of the master. This allows the slave to function as a “live backup,” or DR (Disaster Recovery) for the master. In addition, the application can route read-only queries to the slave, offloading that workload from the master. The trade off is latency; typical MySQL replication is asynchronous or semi-synchronous, neither of which guarantee the slave to be in sync with the master.
Or in the case of MySQL clusters using Galera replication (Maria Galera Cluster, or Percona XtraDB Cluster), each node is considered a “master” and can accept writes, while the other nodes receive that write asynchronously via “certification replication.” (A further discussion of the various types of MySQL clusters, including MySQL NDB Cluster, is in a forthcoming blog. Stay tuned!)
Long story short, MySQL replication using master/slave is asynchronous.
Each COMMIT occurs only synchronously on the local node, and the other nodes update in the background. Thus, there are no guarantees of consistent transactional states between master and slave. This is also the case with Galera certification replication, such as MariaDB Galera Cluster or Percona XtraDB Cluster.
ClustrixDB uses synchronous COMMIT across all participating nodes.
This is how ClustrixDB maintains strong consistency across all the nodes in the cluster. All nodes participating in a write must provide a synchronous acknowledgment before a write is complete. Writes are performed in parallel.
All ClustrixDB nodes participating in a write must provide an acknowledgment before a write is complete.
This means every node participating in the transaction has to PREPARE (“transaction is semantically correct and all writes are durably logged”) and ACCEPT (“transaction has been durably logged by the acceptors”) before the write is complete, and the application can be notified of COMMIT.
For example, in the simplest write use case, i.e. a point-insert, with ClustrixDB there will be two participants in the transaction. Yes, a point-insert will update only one logical location, e.g. a single table in a single data slice, but each ClustrixDB data slice by default has two replicas, a primary and a secondary, and both must be updated for a transaction to be complete.
Note: The number of ClustrixDB data replicas can be modified, but is recommended to be at least two, to ensure fault tolerance transparent to the loss of a single node. To be fault tolerant despite the loss of multiple nodes, nResiliency can set the number of data replicas (and transaction Paxos Acceptors) to withstand the loss of up to one less than half the nodes in the cluster. Why that particular limit? That’s due to handling split-brain: a cluster can only be recovered if there are at least one more than half of the original nodes still available.
ClustrixDB writes are performed in parallel across all nodes containing the data.
Each node containing data slice(s) affected by a transaction has to both PREPARE and ACCEPT that transaction before signaling to the originating node (i.e. the Global Transaction Manager for that transaction) the COMMIT for that transaction. Each of the participating nodes both PREPARE and ACCEPT in parallel, after being told the by GTM, and they do this simultaneously.
In a complex write query, multiple slices will be updated simultaneously, which means multiple nodes must PREPARE and ACCEPT for multiple slices (and the multiple replicas for each slice). Each of these operations can use available CPU cores, so not only can multiple different transactions happen in parallel on a single node, but multiple different slice updates for a single transaction can happen on a single node as well.
Strong consistency across all nodes in the cluster is always maintained.
All nodes in a ClustrixDB cluster have the identical transaction state. As a data slice is updated (on multiple nodes, due to its multiple replicas). MVCC guarantees read transactions see the correct previous state of the data, until the current write(s) COMMIT.
An example of this is online schema changes. ClustrixDB performs schema changes without blocking reads or writes to the table(s) undergoing an ALTER. These changes can range from simply adding an index or column to completely redesigning the schema. This avoids having to require partial or full table locks to allow background validation while schema changes process each row and transform the data structure.
ClustrixDB avoids locking during schema changes by creating a new temporary layered tree (or set of layered trees) and copying the data into its new structure. If there are any DML operations to a table that is being modified, a temporary transaction log records any write or update transactions run against the old layered trees, then applies them to the new layered trees. Once all the original records have been copied and logged transactions have been processed, the ALTER transaction commits, the new table goes live, and the old table is dropped.
ClustrixDB uses synchronous communication to ensure all participating nodes actually complete and durably log the local transaction(s), before the application is informed of the COMMIT. There is no possibility of skew between transactional state between the nodes.
For reference, MySQL replication is usually an asynchronous process by which a slave RDBMS consumes the changes made on the master RDBMS. This results in skew between the master and slave. Even with Galera’s “certification-based replication,” there is still skew between master and the slave. Replication is designed to provide multiple full copies of the RDBMS for fault tolerance, with the trade off of some skew.