Elasticsearch Distributed Consistency Principles Analysis, Part 1

DZone 's Guide to

Elasticsearch Distributed Consistency Principles Analysis, Part 1

In Part 1 of this two-part series, we look at ES cluster composition, node discovery, and master election. Read to get started mastering Elasticsearch!

· Big Data Zone ·
Free Resource

Elasticsearch (ES) is the most common open-source distributed search engine. It's based on Lucene, an information-retrieval library, and provides powerful search and query capabilities. To learn its search principles, you must understand Lucene. To learn the ES architecture, you must know how to implement a distributed system. Consistency is at the core of distributed systems.

This article describes the ES cluster composition, node discovery, master election, error detection, and scaling. In terms of node discovery and master election, ES uses its own implementation instead of external components such as ZooKeeper. We will describe how this mechanism works, and the problems with it. This series covers:

  1. ES cluster composition (Part 1)
  2. Node discovery (Part 1)
  3. Master election (Part 1)
  4. Error detection (Part 2)
  5. Cluster scaling (Part 2)
  6. Comparison with the implementation method, such as Zookeeper and Raft (Part 2)
  7. Summary (Part 2)

ES Cluster Composition

First, an Elasticsearch cluster (ES cluster) is composed of multiple nodes, which have different types. Through the configuration below, four types of nodes can be generated:

    node.master: true/false
    node.data: true/false

The four types of nodes are combinations of the true/false node.master and node.data. Other types of nodes, such as IngestNode (which is used for data pre-processing), are not within the scope of this article.

When node.master is true, the node is a master node candidate and can participate in the election. It is often referred to as a master-eligible node in ES documentation, which is similar to MasterCandidate. The ES can only have one master (that is, leader) during normal operation, as having more than one master would cause a split-brain.

When node.data is true, the node acts as a data node, stores the shard data assigned to the node, and is responsible for the write and query of the shard data.

In addition, a node in any cluster can perform any request. The cluster forwards the request to the corresponding node for processing. For example, when node.masterand node.data are both false, this node acts as a proxy-like node, accepts requests, and forwards aggregated results.

Elasticsearch cluster

The figure above is a diagram of an ES cluster, where Node_A is the master of the current cluster, and Node_ B and Node_C are the master node candidates; Node_A and Node_ B are also DataNodes; in addition, Node_D is a simple DataNode; and Node_E is a proxy node.

Here are some questions to consider: how many master-eligible nodes should be configured for an ES cluster? When there are insufficient storage or computing resources for the cluster, and scaling is needed, what type should the added nodes be set to?

Node Discovery

After a node is started, it needs to be added to the cluster through node discovery. ZenDiscovery is an ES module providing functionality, like node discovery and master election, without having to rely on tools such as ZooKeeper. See the official documentation:


In short, node discovery relies on the following configuration:

    discovery.zen.ping.unicast.hosts: [,,]

This configuration creates an edge from each node to every other host. When all nodes in the cluster form a connectivity map, each node can see other nodes in the cluster, preventing silos.

The official recommendation is that the unicast hosts list is maintained as the master-eligible nodes list in a cluster. Therefore, it is recommended that the unicast hosts list is maintained as the master-eligible nodes list in a cluster.

Master Election

As mentioned above, there may be more than one master-eligible node in a cluster, and master election ensures that there is only one elected master node. If more than one node is elected master, a split-brain will occur, which can affect data consistency and lead to chaos in the cluster with varying unexpected results.

To avoid a split-brain, ES uses a common distributed system concept, ensuring that the elected master is recognized by the master-eligible node of the quorum, resulting in only one master. This quorum is configured as follows:

    discovery.zen.minimum_master_nodes: 2

This configuration is critical for the cluster.

1. Who Initiated the Master Election and When Did it Start?

The master election is initiated by a master-eligible node when the following conditions are met:

  1. The current state of the master-eligible node is not master.
  2. The master-eligible node queries other known nodes in the cluster using ZenDiscovery's ping operation, and confirms that no nodes are connected to the master.
  3. There are currently more than minimum_master_nodes nodes (including this node) that are not connected to the master.

In short, when a node determines that the master-eligible nodes within the quorum, including itself, think that the cluster has no master, then master election can be initiated.

2. When Master Election Is Required, Which Node Should Be Elected?

The first question is, which node should be elected? As shown in the following source code, the first MasterCandidate (that is, master-eligible node) after sorting is elected.

    public MasterCandidate electMaster(Collection<MasterCandidate> candidates) {
        assert hasEnoughCandidates(candidates);
        List<MasterCandidate> sortedCandidates = new ArrayList<>(candidates);
        return sortedCandidates.get(0);

Then, how are they sorted?

public static int compare(MasterCandidate c1, MasterCandidate c2) {
    // we explicitly swap c1 and c2 here. The code expects "better" to be lower in a sorted
    // list, so if c2 has a higher cluster state version, it needs to come first.
    int ret = Long.compare(c2.clusterStateVersion, c1.clusterStateVersion);
    if (ret == 0) {
        ret = compareNodes(c1.getNode(), c2.getNode());
    return ret;

As shown in the source code above, the clusterStateVersion of the nodes is compared, with higher clusterStateVersion taking priority. When nodes have the same clusterStateVersion, the program goes to compareNodes, in which the IDs of the nodes are compared (IDs are randomly generated when the node initially starts).

In summary:

  1. The higher clusterStateVersion takes priority. This ensures that the new master has the latest clusterState (that is, the meta of the cluster), avoiding loss of committed meta changes. When the master is elected, it is updated based on the clusterState of this version (one exception is when the cluster restarts and none of the nodes have meta; in this case, a master needs to be elected first, then the master uses persistent data for meta recovery, and then performs meta synchronization).
  2. When nodes have the same clusterStateVersion, the node with a lower ID takes priority. That is, a node with a low ID tends to be selected. The ID is a random string generated when the node initially starts. This is designed to ensure the stability of the election results, avoiding election failure due to too many master candidates.

3. What Is a Successful Election?

When a master-eligible node (Node_A) initiates an election, it chooses an approved master according to the sorting strategy above. The process varies depending on whether Node_A selects itself or Node_B as master.

Assuming Node_A Selects Node_B as the Master:

Node_A sends a join request to Node_ B, then:

  1. If Node_ B has become master, it adds Node_A to the cluster, and publishes the latest cluster_state, which contains the information for Node_A. It is similar to adding a new node under normal circumstances. Node_A completes the join when a new cluster_state is published for Node_A.
  2. If Node_ B is running for master, it will take this join as a vote. In this case, Node_A waits until timeout to see whether Node_ B becomes the master, or another node is elected as master.
  3. If Node_ B thinks it is not the master (at any time), it will reject this join. In this case, Node_A initiates the next election.

Assuming That Node_A Selects Itself as Master:

Node_A waits for other nodes to join, that is, waits for votes from other nodes. When more than half of the votes are collected, it regards itself as master, changes the master node in the cluster_state to itself, and sends a message to the cluster.

For more information, see the following source code:

        if (transportService.getLocalNode().equals(masterNode)) {
            final int requiredJoins = Math.max(0, electMaster.minimumMasterNodes() - 1); // we count as one
            logger.debug("elected as master, waiting for incoming joins ([{}] needed)", requiredJoins);
            nodeJoinController.waitToBeElectedAsMaster(requiredJoins, masterElectionWaitForJoinsTimeout,
                    new NodeJoinController.ElectionCallback() {
                        public void onElectedAsMaster(ClusterState state) {
                            synchronized (stateMutex) {

                        public void onFailure(Throwable t) {
                            logger.trace("failed while waiting for nodes to join, rejoining", t);
                            synchronized (stateMutex) {

        } else {
            // process any incoming joins (they will fail because we are not the master)
            nodeJoinController.stopElectionContext(masterNode + " elected");

            // send join request
            final boolean success = joinElectedMaster(masterNode);

            synchronized (stateMutex) {
                if (success) {
                    DiscoveryNode currentMasterNode = this.clusterState().getNodes().getMasterNode();
                    if (currentMasterNode == null) {
                        // Post 1.3.0, the master should publish a new cluster state before acknowledging our join request. We now should have
                        // a valid master.
                        logger.debug("no master node is set, despite the join request completing. Retrying pings.") ;
                    } else if (currentMasterNode.equals(masterNode) == false) {
                        // update cluster state

                } else {
                    // failed to join. Try again...

Following the process above, here is a simple scenario to make it clearer:

Assuming that a cluster has 3 master-eligible nodes, Node_A, Node_ B, and Node_C, and the election priority order is Node_A, Node_ B, Node_C. Each of the three nodes determines that there is no current master. Each node initiates an election, and based on the priority order, all nodes elect Node_A. So Node_A waits for joins. Node_B and Node_C send join requests to Node_A. When Node_A receives the first join request, along with its own vote, it has two votes in total (more than half) and becomes master. At this point, the cluster_state contains two nodes. When Node_A receives a join request from the remaining node, the cluster_state contains all three nodes.

4. How Can the Election Avoid Split Brain?

The basic principle lies in the quorum strategy. If only the node approved through quorum becomes master, it is impossible for two nodes to be approved by the quorum.

In the process above, the master candidate needs to wait for nodes that submitted approval in the quorum to join before becoming master. This ensures that this node was approved by the quorum. While the process above looks reasonable and works well in most scenarios, there is a problem.

This process has no restriction on how many times a node can vote in the election process. Under what circumstances would a node be allowed to vote twice? For example, Node_B votes for Node_A once, but Node_A hasn't become master after a certain period of time. Node_ B can't wait, and initiates the next election. At this point, it determines that the cluster contains Node_0, which has a higher priority than Node_A, so Node_B votes for Node_0. Assuming that both Node_0 and Node_A are waiting for votes, then Node_B has voted twice, each time for different candidates.

How can we solve this problem? For example, the Raft algorithm introduces the concept of election term, ensuring that each node can vote only once during each election term. Additional votes would be counted in term+1. If both the last two nodes think they are the master, one term must be greater than the other. Because quorum votes are received for both terms, the quorum node has a greater term, ensuring that the node with the smaller term cannot commit any status changes (commits require the quorum node for successful log persistence, and quorum persistence conditions cannot be met due to the term check). This ensures that status changes within the cluster are always consistent.

ES (v6.2) has not solved this problem yet. In test cases in similar scenarios, sometimes two masters are elected, and both nodes consider themselves master and publish a status change to the cluster. Publishing includes two phases. First, it ensures that the quorum node "accepts" this change, then all nodes are required to commit this change. Unfortunately, the two masters may both complete the first phase, and enter the commit phase. This causes inter-node status inconsistency, which isn't an issue in Raft. How can both masters complete the first phase? Because in the first phase, ES puts the new cluster_state into the memory queue after a simple check. If the master of the current cluster_state is empty, it will not be checked. In other words, after accepting the cluster_state where Node_A becomes master (before committing), Node_B also can be accepted as master in the cluster_state. This allows both Node_A and Node_B to meet the commit condition and initiate the commit command, which leads to inconsistent cluster status. Of course, split-brain situations like this will be automatically recovered quickly, because when a master publishes cluster_state again after the inconsistency occurs, the quorum condition will no longer be met, or it is automatically downgraded to a candidate because its followers no longer constitute a quorum.

When compared with mature consistency solutions, ES's ZenDiscovery modules have issues handling some specific scenarios. We will analyze other scenarios where ES consistency has issues in the following description of the meta change process.

That's all for Part 1! Tune back in on Monday when we'll cover topics such as error detection, cluster scaling, and more. 

big data, elasticsearch, elasticsearch clusters, elasticsearch tutorial for beginners, zookeeper

Published at DZone with permission of Leona Zhang . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}