DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Eventual Consistency Part 1: The Hinted Handoff Queue

Eventual Consistency Part 1: The Hinted Handoff Queue

This blog series looks at eventual consistency, the consistency model used by many distributed systems and microservices.

Katy Farmer user avatar by
Katy Farmer
·
Jun. 04, 18 · Tutorial
Like (3)
Save
Tweet
Share
5.37K Views

Join the DZone community and get the full member experience.

Join For Free

In this article series, we're going to explore eventual consistency, a term that can be hard to define without having all the right vocabulary. This is the consistency model used by many distributed systems, including InfluxDB Enterprise edition. There are two concepts we need to know in order to understand eventual consistency: the hinted handoff queue and anti-entropy, both of which require special attention.

Part I

What Is the Hinted Handoff Queue?

Despite having a pretty cool name, the hinted handoff (HH) queue doesn't get a lot of attention. The HH queue has a pretty important job, but unless you're a system admin, you rarely interact with it directly. Let's dig into what exactly a hinted handoff queue is and why it matters to you.

In order to talk about the HH queue, we have to talk about distributed computing-just a little. One reason systems like InfluxDB Enterprise exist as distributed systems is to eliminate single points of failure. InfluxDB Enterprise uses replication factor (RF) to determine how many copies of any one set of data should exist. Setting the RF above 1 means the system has a higher chance of servicing a request successfully and not returning an error during a data node outage, meaning we no longer have only one copy of the data that could be lost or unavailable. Distributed systems also offer up unique challenges: how do we know that the data is consistent across the system, especially when storing multiple copies of data?

First, we have to understand some of the promises made by eventual consistency. Spoiler alert: the data in the system must eventually be consistent. When we ask for information from a distributed system, there are points in time where the answer we receive may not be consistently returned. There is some "drift" in terms of the answer we receive as data is stored and replicated throughout the system, but over time, that "drift" should be eliminated. In practice, this means that the most recent time ranges may have the most variation in their results, but that variation is eliminated as the system works through the mechanics of ensuring that the same information is available everywhere.

Image title

Kinds of acceptable drift

If we promise that the system will eventually be consistent, how do we account for failed writes? Data nodes can go offline for any number of reasons, from running out of disk space to plain old hardware malfunctions. If a node is missing data points from the time it spent offline, it can never be consistent, and thus, our promise of eventual consistency would become a lie.

Failed writes can also affect the replication factor across the system. Maintaining the specified RF is another promise that we have to keep, and it's another possible point of failure for writes if data nodes are offline.

Example

Let's explore the simplest example: InfluxDB Enterprise with 2 data nodes and a database with an RF = 2. Data arrives to your favorite load balancer via some collection agent (e.g. Telegraf), the load balancer distributes the writes (also reads, but we'll use writes in this example) to the underlying data nodes. Typically, the load balancer distributes the writes in a round-robin fashion. The data node receiving that data stores and replicates it (sends it to the other data node) and voila: the RF of 2 is achieved.

Note: Not pictured in the diagrams are the meta nodes, which you can read about here.

We still need a solution for failed or delayed writes. Let's say one of the nodes in our system gets physically overheated and goes offline. Without a backup, any unsuccessful writes are dropped completely, never to be seen again.

Enter the HH queue.

The HH queue is a durable, disk-based queue. It's a fundamental part of InfluxDB Enterprise that attempts to ensure eventual consistency, which is a mechanism that ensures that all of the data nodes will eventually have a consistent set of data across them. For InfluxDB Enterprise, the HH queue is one important part of achieving eventual consistency and making sure that data replication factor for each database is ultimately achieved.

Now, let's revisit the scenario where one of the data nodes in my cluster goes offline. There are tons of reasons a node could go offline: hardware defects, disk space limitations, or even regular maintenance. Without the hinted handoff queue, the unsuccessful writes died before they could ever be stored, but now we have a safe place for them to land.

Any unsuccessful writes get directed to the HH queue, and when the node comes back online, it checks the HH queue for pending writes. The node can then complete writes until the queue is drained. Bam-eventual consistency achieved.

Summary

This is a look at what happens inside an eventually consistent cluster, but there are a few considerations from the outside: when data is successfully written to one node, but fails to replicate correctly, does the user see success or failure? What do healthy patterns in the HH queue look like? What does it mean for overall system health if the HH queue is constantly filling and draining? In the next article, we'll discuss how to troubleshoot and identify problematic patterns in our InfluxDB Enterprise clusters.

Data (computing) Database Load balancing (computing)

Published at DZone with permission of Katy Farmer, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Data Mesh vs. Data Fabric: A Tale of Two New Data Paradigms
  • PHP vs React
  • Connecting Your Devs' Work to the Business
  • The Importance of Delegation in Management Teams

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: