Elasticsearch Distributed Consistency Principles Analysis and Data, Part 2
We finish up this two-part article by focusing on the PacificA algorithm, how it fits into, and compares against, Elasticsearch.
Join the DZone community and get the full member experience.Join For Free
Welcome back! If you missed Part 1, you can check it out here.
The Elasticsearch data replication model is based on the primary-backup model and is described very well in the PacificA paper of Microsoft Research. That model is based on having a single copy from the replication group that acts as the primary shard. The other copies are called replica shards. The primary serves as the main entry point for all indexing operations. It is in charge of validating them and making sure they are correct. Once an index operation has been accepted by the primary, the primary is also responsible for replicating the operation to the other copies.
There are few articles on the Internet that provide details about this algorithm, so, in this article, we give a brief introduction to the algorithm based on the PacificA paper. The algorithm has the following features:
- Has strong consistency.
- Synchronizes the data from a single Primary node with multiple Secondary nodes.
- Uses additional consistency components for Configuration maintenance.
- Supports writes even when a minority of Replica nodes are available.
First, let us take a look at some terms used by this algorithm:
- Replica Group: A dataset in which each piece of data is a copy of another, and each copy is a Replica node. Only one copy in a Replica Group is the Primary node; the rest are Secondary nodes.
- Configuration: Configuration of a Replica Group describes which copies are included in the Replica Group and which one is the Primary.
- Configuration Version: The version number of the Configuration. The version number increments by 1 whenever Configuration changes occur.
- Configuration Manager: This manages global Configuration components, which ensures the consistency of Configuration data. Configuration change requests are initiated by a Replica node and are then sent to the Configuration Manager along with the Version. The Configuration Manager verifies that the Version is correct. If not, the change request is rejected.
- Query and Update: There are two types of Replica Group operations, Query and Update. Query does not change the data, while Update does.
- Serial Number (sn): This represents the order of each Update operation execution. It increments by 1 for every Update operation, and it is a consecutive number.
- Prepared List: This is the preparation sequence for Update operations.
- Committed List: This is the commit sequence for Update operations. The operations in the commit sequence definitely take effect (unless all copies crash). On the same Replica node, the Committed List must come before the Prepared List.
With the PacificA algorithm, an error detection mechanism is required to satisfy the following invariant.
When a Replica node deems itself the Primary node at any time, the configuration maintained in the Configuration Manager also considers it to be the current Primary. At any time, only one Replica node deems itself the Primary node in this Replica Group.
The Primary Invariant can ensure that, when a node deems itself the Primary, it must be the current Primary node. If the Primary Invariant cannot be satisfied, Query requests would likely be sent to the Old Primary, which would result in legacy data being read.
How do you ensure the Primary Invariant is satisfied? According to the paper, this can be achieved by adopting a Lease mechanism, which is a common method used in distributed systems. Specifically, the Primary node periodically obtains a Lease, and once successfully obtained, it deems itself to be the only Primary node for a set period. It loses Primary status if it has not obtained a new Lease once the period has expired. As long as the CPU in each machine does not have significant clock skew, the effectiveness of the lease mechanism is guaranteed.
As described in the paper, the Lease mechanism has the Primary node send a heartbeat to all Secondary nodes to obtain a Lease, instead of having all nodes obtain a Lease from a centralized component. Using this decentralized model ensures that there is no centralized component that, if it fails, causes all nodes to lose their leases.
The Query process is relatively simple. Queries can only be sent to the Primary node, and the Primary node returns the corresponding values based on the latest committed data. Since this algorithm requires the Primary Invariant condition to be met, Queries always read the latest committed data.
The update process is as follows:
- Primary node assigns a Serial Number (sn) to an UpdateRequest.
- The Primary node adds this UpdateRequest to its own Prepared List. Meanwhile, it sends the Prepare request to all Secondary nodes, requiring them to add this UpdateRequest to their Prepared Lists.
- When all Replica nodes complete Prepare, that is, when the Prepared Lists of all Replica nodes contain the Update request, the Primary node starts to commit the request, adding the UpdateRequest to Committed List and applying the Update. Note that, on the same Replica node, Committed List always comes before the Prepared List, so the Primary node increases the Committed Point when including the Update Request.
- The result is returned to the Client, and the Update operation is successful.
When the Primary node sends the next request to a Secondary node, the current Committed Point of the Primary is attached to the request, and the Secondary node increases its Committed Point.
We can derive the following invariant from the Update process:
We mark the Committed List of a Secondary node as SecondaryCommittedList, the Prepared List as SecondaryPreparedList, and the Committed List of the Primary as PrimaryCommittedList. SecondaryCommittedList must come before PrimaryCommittedList, and PrimaryCommittedList must come before SecondaryPreparedList.
Reconfiguration: Secondary Failure, Primary Failure, Newly Added Node
1. Secondary Failure
When a Secondary node fails, the Primary node sends a Reconfiguration request to the Configuration Manager, removing the failed node from the Replica Group. Once the Replica node is removed, it no longer belongs to the Replica Group, and requests are no longer sent to it.
Assume that a network fault occurs between a Primary node and Secondary node. In this case, both can nonetheless connect to the Configuration Manager. At this time, the Primary node detects that there is no response from the Secondary node, and, likewise, the Secondary node detects that there is no response from the Primary node. Both try to send a Reconfiguration request to have the other removed from Replica Group. The strategy applied here is the First Win principle: the first request received by Configuration Manager takes effect, and the sender remains in the Replica Group; because the other node no longer belongs to the Replica Group, it can no longer update the Configuration. Because the Primary node requests a Lease from the Secondary node, the Secondary node does not execute Reconfiguration while the Lease is valid, and the probe interval of the Primary node must be less than the Lease probing interval. In this situation, it appears that the tendency is always that the Primary node executes Reconfiguration to have the Secondary node removed.
2. Primary Failure
When a Primary node fails, the Secondary node stops receiving heartbeats from the Primary node. If the Lease is expired, the Secondary node sends a Reconfiguration request to have the Primary removed, which also follows the First Win principle: the Secondary node sending the successful request becomes the new Primary node.
After a Secondary node becomes a Primary node, it must go through a phase called Reconciliation before providing a service. Because of the Committed Invariant mentioned above, the Committed List of the previous Primary node must come before the Prepared List of the new Primary node. This means that, when we align the Prepared List content of the new Primary node with other nodes in the current Replica Group, which is equivalent to recommitting the uncommitted records of this node on all nodes, all previous Commit records must be included. That leads to the next invariant:
Reconfiguration Invariant: When a new Primary node completes Reconciliation at T time, the Committed List of any node before T time (including the original Primary node) takes precedence over the current Committed List of the new Primary node.
The Reconfiguration Invariant indicates that the committed data is not lost during the Reconfiguration process.
3. Newly Added Node
The newly added node must become a Secondary Candidate first, then the Primary node starts to send it Prepare requests. Meanwhile, this node tries to catch up with records that were not previously synchronized. Once it catches up with the records, it sends a request to be a Secondary node, after which the Primary node sends a configuration change request to Configuration Manager to add the node to the Replica Group.
There is another scenario: A node was in the Replica Group and removed due to temporary failure, and now needs to be re-added to the Replica Group. At this time, the data in Committed List on this node must have been committed, while the data in Prepared List may not have been committed. So, the uncommitted data should be removed, and the data should be requested from the Primary beginning at the Committed Point.
PacificA Algorithm Summary
PacificA is an algorithm with strong consistency that meets both read and write requirements. It separates data consistency from Configuration consistency and uses additional consistency components (Configuration Manager) to maintain configuration consistency. This way, when less than half of the copies of data are available, new data can still be written and strong consistency can be ensured.
The ES design refers to the PacificA algorithm. It maintains Index Meta through the Master, which is similar to Configuration maintenance by the Configuration Manager, as discussed in the paper. In IndexMeta, InSyncAllocationIds represents the currently available Shards, which is similar to Replica Group maintenance in the paper. Next, we introduce the SequenceNumber and Checkpoint in ES. These two classes are similar to the Serial Number and Committed Point in the PacificA algorithm. Afterward, we compare the similarities and differences between ES implementation and PacificA.
SequenceNumber, Checkpoint, and Failure Discovery
PacificA, a consistency algorithm model used by ES, is described above. It is important to note that each PacificA Update operation has a corresponding Serial Number, which indicates the order of execution. In the previous versions of ES, some functionality was limited because each write operation lacked a Serial Number or similar mechanism. In 2015, ES officials began planning to add SequenceNumber for each write operation and assumed there would be many application scenarios.
Further details are available at the following two links:
Next, we give a brief introduction to the definitions of Sequence and Checkpoint and discuss their application scenarios.
Term and SequenceNumber
Each write operation is assigned two values: Term and SequenceNumber. Term increments by 1 whenever the Primary changes, which is similar to Configuration Version in the PacificA paper. SequenceNumber increments by 1 after each operation, which is similar to Serial Number in the PacificA paper.
Because the read request is always sent to the Primary node, it assigns the Term and SequenceNumber. When the synchronization request is sent to the Replica node, the two values are attached.
LocalCheckpoint and GlobalCheckpoint
LocalCheckpoint indicates that all requests in this Shard with values less than this value have been processed.
GlobalCheckpoint indicates that all requests with values less than this value have been processed on all Replica nodes. GlobalCheckpoint is maintained by the Primary node. Each Replica node reports its LocalCheckpoint to the Primary node, and then Primary increases the GlobalCheckpoint based on that information.
GlobalCheckpoint is a global safety point indicating that all requests before it have been processed properly by the Replica node and can be used to repopulate data after recovering from a node failure. GlobalCheckpoint can also be used for the Translog GC because there is no longer a need to save the previous operation records. However, the Translog GC strategy in ES is applied based on size or time, while GlobalCheckpoint does not seem to be used.
Fast Failure Recovery
When a Replica node fails, ES removes it. When the failure exceeds a specific period, ES assigns a new Replica node to the new Node. At this point, full data synchronization is needed. But, if the previously failed Replica node returns, simply repopulating the data after the failure recovery and adding the node back once catching up with the records results in fast failure recovery. There are two conditions that must be met to enable fast failure recovery: first, all the operations and their orders during the failure can be saved; second, the node that started data synchronization must be determined. The first condition can be met by saving Translog for a specific amount of time; the second condition can be met using Checkpoint, ultimately achieving fast failure recovery. This is the first important application scenario using SequenceNumber and Checkpoint.
Comparison Between Elasticsearch and PacificA
- Meta consistency and Data consistency are handled separately: In PacificA, Configuration consistency is maintained through Configuration Manager; in ES, Meta consistency is maintained through Master.
- Maintain the copies collection in synchronization: In PacificA, the Replica Group is maintained; in ES, InSyncAllocationIds is maintained.
- SequenceNumber: In both PacificA and ES, write operations use SequenceNumber to record the operation order.
The main difference is that ES complies with PacificA; however, its implementation still does not meet all the requirements of the algorithm, meaning strict strong consistency is not guaranteed. The key points are as follows:
- Meta consistency: We analyzed the Meta consistency issue in ES in the previous section, and we can see that ES cannot guarantee Meta consistency, so it certainly cannot strictly guarantee Data consistency.
- Prepare phase: PacificA has the Prepare phase, which ensures that the data is not committed until it is prepared successfully on all nodes and that the committed data is not lost. In ES, the data is written directly, as it lacks this phase.
- Read consistency: In ES, all InSync Replica nodes can be read, which improves data readability; however, legacy data may also be read. On the other hand, even if only the Primary node can be read, ES also needs a mechanism like Lease, so that the Old Primary is not read. Given that ES is a near real-time system, the requirements for read consistency may not be very strict.
This article analyzed the consistency issues of the data flow in Elasticsearch. While ES has made substantial progress addressing these issues recently, many issues remain. This article is the last of the Elasticsearch Distributed Consistency Principles Analysis series. This series covers the research, analysis, and summary for ES, with step-by-step details covering node discovery, Master election, Meta consistency, Data consistency, and other aspects.
Published at DZone with permission of Leona Zhang. See the original article here.
Opinions expressed by DZone contributors are their own.