Lifecycle of a Node in Couchbase Server Demystified: Adding & Removing Nodes, Rebalancing & Failover
One of the top attributes of Couchbase Server is its simplicity when it comes to deploying and managing a cluster. Changing the topology of a cluster can be done simply within couchbase through a few actions and few states AND that's what I will present in this post.
Every node is identical in Couchbase Server in the binaries it carries and services it provides. As more nodes get added to the cluster, they mostly inherit settings from the first node, though you have options to configure certain settings per node like data file location.
The first node in Couchbase Server starts life with the initialization of a cluster. The initial setup captures your cluster-wide settings, sample buckets and setting up of the admin password for the cluster. Add Server, Remove Server, Rebalance and Failover are the main verbs that cause nodes to transition between states. These verbs work on nodes in 3.0.
Active and Inactive present the steady states of nodes. Active means a node is part of the cluster and is taking traffic. Inactive is a node that is no longer part of the cluster and is not taking traffic. All the other states serve an important purpuse so lets take a deeper look;
Add and Remove Server are obvious operations. However there is one important thing to note about them: these operations don't immediately commit the change to topology but put the node in a pending Add/Remove state. Failover operates in a similar fashion. Failover is the verb that causes the promotion of replicas to masters - if you have replicast to failover to. It also puts the node in a failed over state and does not immediately inactivate the node.
Why do we need these intermediate states? Well, if you want to be efficient on data movement, you want to make all the changes to the topology and commit all the changes at once to complete the transition in data movement step! This one move for all the changes is obviously alot more efficient than many intermediate moves of data.
Rebalance is the verb that commits the topology change and initiates the data movement with building or rebuilding of replicas within the cluster. Rebalance has a lot of smarts like the ability to detect the fact that you are doing an equal number of add and remove servers. This case is called a swap-rebalance and required no data movement on nodes that are not in pending remove or add state.
Most of this does not surprise many people except one thing on this picture (even some of the veterans are surprised by this). That one surprise is the remove server operation and the fact that remove server action does not take a node out of rotation immediately. A node in pending remove state still takes on traffic until a rebalance is issued. If you are interviewing at Couchbase, watch out for the trick question on this topic!
By the way, with the next version of the product, we are looking to make some major changes in this area. If you are interested in giving feedback, feel free to reach out to me at email@example.com.