DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Caching Strategies for Resilient Distributed Systems
  • Vector Search: The Hottest Topic in Information Retrieval
  • Cache Wisely: How You Can Prevent Distributed System Failures
  • Scaling Java Microservices to Extreme Performance Using NCache

Trending

  • Streamlining Event Data in Event-Driven Ansible
  • Unlocking AI Coding Assistants Part 1: Real-World Use Cases
  • How to Practice TDD With Kotlin
  • Emerging Data Architectures: The Future of Data Management
  1. DZone
  2. Data Engineering
  3. Data
  4. Distributed Systems: Consistent Hashing

Distributed Systems: Consistent Hashing

We will learn consistent hashing, its usage in distributed systems, and how it plays a role in designing distributed systems such as databases and caches.

By 
Vetriselvan M user avatar
Vetriselvan M
·
Dec. 07, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
1.6K Views

Join the DZone community and get the full member experience.

Join For Free

Welcome to the distributed systems series. In this article, we are going to learn about consistent hashing and its usage in distributed systems. Why consistent hashing is important and how it plays a role in designing distributed systems such as databases, cache, etc. Let’s first understand what is hashing and how it is used to distribute data across machines. Then, we will understand what is consistent hashing.

Hashing

Hashing is a technique that generates a unique ID for an object. A simple example would be the hashcode function in Java, which returns a unique ID for an immutable object. This returned ID is used to choose the bucket from an array of buckets for storage and retrieval. In order for this hashing function to return the correct value, the object or key that we use to hash should be immutable. This is how hashing works in Java to store and retrieve the value in the HashMap data structure. If you know how hashmap works, the concept is pretty similar in distributed systems. In distributed systems, we have an array of machines to store the data, and we have to decide which machines should hold the specific data. The following diagram explains how hashing is used to store {key, value} data on different machines.

hashing

The above example is pretty good for storing and distributing a large number of objects. Since the data is partitioned horizontally, a simple hash function based on key hash(key)%N would decide which machine to store the given {key, value}. N represents the number of machines in the cluster. What is the problem with the above approach? If we have to add or remove machines from the cluster, Everything {key, value} object that is stored on the cluster should be redistributed. This is not efficient due to the movement of all the keys in the cluster. 

Why is this inefficient? Let’s just imagine we have a cache cluster of 100 machines. We have to redistribute all the keys when a new machine is added or removed from the cluster. We have to recalculate hash(key)%N for all keys and move them to corresponding machines. Surely, it’s not efficient. A better way to solve this problem is to use a different hashing mechanism called consistent hashing.

Consistent Hashing

Consistent hashing is a technique used in distributed systems to store and manage data efficiently. Let’s first understand how it works. It works by creating a hashring cluster with multiple points that range from start and end. The ring cluster is sorted in order. When a new machine is added to the ring cluster, It will take care of managing certain points in the circle. The below diagram explains how consistent hashing works. We created a simple hashring that has points ranging from 0 to 100. The points are ordered from 0 to 100, and each machine would take care of handling a subset of the entire range. As specified below, the First machine will store points up to 20, which means if the hash(key) returns a value that is ≤ 20, then that {key, value} is stored in the machine {20}. 

How do we find the machine from a hash(key)? We can simply iterate or do a binary search in the range {0...100} to find the machine since points are already sorted. {key, value} is stored on the machine, which has the next higher range for the hash ID. As specified in the below example, hash(k1) returns 18, and hash(k2) returns 38. k1 is stored on machine {20} and k2 is stored on machine {40}.

consistent hashing

diagram

So far, we have seen how to add {key, value} to the respective machine using consistent hashing. The process is the same for deleting objects from the cluster. We have to get the hash(key) to find the node that has higher points for the computed hash ID. Once it is found, we could simply delete the object from the cluster using a key. 

In our example, we have built a distributed ring cluster with points that range from 0 to 100. But what if the hash ID for the key is higher than 100? In that case, we will add {key, value} to the first node {20} of the cluster. The previously mentioned scenario is applicable for deleting {key, value} from the cluster. If the hash(key) is higher than 100, then it would search the {key, value} in the first node to delete.

Adding or Removing Nodes

In the previous section of this post, we saw how to add or remove data from the cluster using consistent hashing. In this section, we will see how adding or removing nodes from the cluster affects the movement of the data in the cluster.

While adding a new node to the cluster, we will calculate a hash(serverId) to find where this node can be placed on our ring cluster. In the below example, we have removed two nodes {40} {50} from the cluster. When we remove these nodes, we could simply copy all the data that is stored on those nodes to node {60}. Now node{60} will take care of all the requests for hash(k2) and hash(k3). If we really look at this example, Instead of moving all the keys across the cluster, we have only moved the subset of the entire key ranges. This is the advantage of consistent hashing. Now, we have reduced the movement of key ranges to k/n, where k is the total number of keys and n is the number of machines in the cluster.

hash(key)

Let’s add a new node to our cluster. Hash (serverId) lies somewhere between {30} and {60}. Let’s consider this new node can take points up to {40}. When a new node is added, It has to copy all the data that belongs to {40} from {60}. Once it is done, {40} will take care of storing and retrieving hash(k2), which is 38, and node{60} will serve the request for hash(k3), which is 48.

add or remove from cluster

So far, we have seen how consistent hashing is used to add or remove {key, value} from the cluster. How does it effectively minimize the data movements across the cluster while a new node is added or removed from the cluster?

One more thing to understand here is that even though consistent hashing helps to minimize data movement, It will not guarantee that data is uniformly distributed across all the nodes of the cluster. This completely depends on the hashing algorithm that we use.

Conclusion

In this article, we learned what consistent hashing is, why it is used in distributed systems such as databases and caches, and how it minimizes the data movement across the cluster. We have also understood how adding or removing a node in the cluster affects the movement of a subset of key ranges in the cluster, and finally, we understood why consistent hashing is used in distributed systems.

Data structure Cache (computing) systems

Published at DZone with permission of Vetriselvan M. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Caching Strategies for Resilient Distributed Systems
  • Vector Search: The Hottest Topic in Information Retrieval
  • Cache Wisely: How You Can Prevent Distributed System Failures
  • Scaling Java Microservices to Extreme Performance Using NCache

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!