Distributed Locking in Cloud-Native Applications: Ensuring Consistency Across Multiple Instances

In modern cloud-native systems, services often run across multiple pods or nodes for scalability and high availability, introducing challenges in data consistency.

Navin Kaushik

Oct. 15, 25 · Analysis

Likes (1)

Comment

Save

3.5K Views

Overview

I am sure that most of us may have used some kind of locking during development, or may have faced issues of incorrect results in some states that are difficult to reproduce. Things are not that complex when we need to manage them within the process or even multiple processes, but on the same machine. It is also very common these days that most of us are involved in making cloud-native applications/services, where there are multiple instances of the service[s], either due to high availability/load balancing.

In case of multiple instances of service[s], things become trickier when you face a situation where you need to make sure that certain operations must be performed in a synchronized manner, and it's not about multiple threads/processes but multiple pods/nodes in a native environment.

In this blog, this aspect will be covered in detail so that you are aware of the challenges and choose the available choices with due diligence. Here, the primary focus would be on the concept side, like identifying the purpose of locking, challenges, available options, and decision-making factors.

Assumption

It is assumed that the reader is aware of race condition challenges and why it's important to take care of them in certain scenarios if applicable. Another assumption is that there are multiple instances of the application in a distributed environment, creating concurrent requests where locking is required to perform the task in the expected manner.

Use Cases/Problem Statement

There are two types of motivations for which you may require locking:

1. Efficiency

In this case, you want to prevent a situation where multiple copies of an operation/job are running, although there is no problem if it happens, but that's a wastage of resources, and it doesn't lead to incorrect results/states in the end. For example, there is a job that copies files from one folder to another folder. Since multiple instances are running, all may try to perform this operation at the same time. While this may not lead to a corrupt state, there is no need to perform this operation in concurrent mode, as the final result would be the same.

2. Correctness

In this case, if concurrent operation is allowed on the same state, it may lead to incorrect results and must be prevented. Basically, cases where a race condition may lead to an incorrect result must be prevented. For example, if you open two online banking sessions and try to transfer some money at the same time from your bank account, then locking must be there in order to prevent incorrect results.

Locking levels may be different, i.e., sometimes taken care of by the underlying framework, like a database, which takes care of locking at one level, and sometimes you need to do it explicitly, due to your business workflow/use case, which is not addressed by the underlying technology/framework. For example:

Taken care of by the database itself: Update employee set daalchini_balance = daalchini_balance - order_amount where employeeid = xyz. In this case, even if you run this query concurrently, the result would be correct as the database itself takes care of locking.
Explicit Locking: Imagine a scenario where a parking sticker is to be given to an employee if not already taken and if parking stickers are available. There are two tables: one table with total stickers, and another table containing the details of each employee and their car details, along with sticker details. We need to make sure that even if multiple concurrent call comes from the same employee to take the sticker, it should still ensure to give the sticker only if it has not already been given. You may need to take an explicit lock to prevent a race condition when checking if the sticker has already been taken by an employee. Then, issue it if it has not been given and stickers are available, i.e., read and update scenario.

Notes

Please note that a lock may be required even for scenarios where the database is not there; here, we just took the database-based example for simplicity.
Please also note, the lock needs to be efficient as it has a cost associated with it from the perspective of latency/throughput, i.e:
1. Lock time should be as little as possible.
2. Lock should be as granular as possible, i.e, don't block things which can be done concurrently. Like row-level lock is preferable to table-level lock wherever possible, and sometimes a combination, like a parking sticker example: lock at business logic level, employeeid, and leverage the underlying database level implicit locks to update records to achieve the desired state.
Please note, our focus is not on atomicity here, so we are not discussing that part, but atomicity needs to be maintained wherever required.

Challenges

In order to provide locking in a distributed environment where multiple service instances are running on the same/different nodes, even different zones/regions, it becomes more challenging as locking information needs to be highly available and fault-tolerant.

It may be provided by an external service, you may write your own locking mechanism either in a complex way within the same service or an external locking service to provide locking functionality, or you may use an existing third-party service[s] which may already exist in your ecosystem, like MySQL, GCP Cloud Storage, Redis, etc. It is clear that whatever mechanism is used, it must meet all the challenges of a distributed environment.

A distributed environment has some basic characteristics (from the current context point of view), like:

There may be network delay/disconnectivity.
There may be data loss due to the failover process, as replication is mostly asynchronous from a performance point of view.
Any instance[s] of a service can be down anypoint of time.

Some Scenarios

Imagine, instance 1 of a service took the lock from an external service and, due to some issue, couldn't proceed to do its work for some time, and meanwhile, a time-out happens, and instance 2 of the service gets the lock, and both are able to change the same state, which may lead to an incorrect result. It means you have to be very careful with the time-out part and need to make sure that if the original owner is unable to release the lock before the time-out period, either it finishes/commits its work before the time-out period or rolls back/undoes it, but shouldn't commit after the time-out period. The longer the time-out means, in case of one instance crashes, then another instance can't get the lock for a longer time.
Imagine that instance 1 gets a lock, and after that, a failover happens at the locking server end, due to which lock information may be lost at the locking server end, as replication happens in an asynchronous manner, and locking information may not be propagated to other nodes. As a result, instance 2 can also acquire a lock, which may alter the same state, leading to an incorrect result.

Distributed Lock Options

RDBMS: In this case, you may look for whether your database provider gives an option to use a lock independent of your schema/table, or in other words, even if you are not using any database but just want to use RDBMS as a distributed lock manager (DLM). For example, in MySQL, you can use a named/user-level lock in order to synchronize multiple threads/instances. We have used it as MySQL was already part of the ecosystem, and it is very simple to use and meets our requirements. It is very good, especially when locking is required for correctness.
1. Reference: https://dev.mysql.com/doc/refman/5.7/en/locking-functions.html
Redis: You may use Redis for distributed locking as well. It is simple if you want to use it for an efficient purpose, as it's quite fast, and in rare cases, if multiple instances get hold of the lock at the same time, it is fine. If you want to use it for correctness purposes, then you may need to use the redlock algorithm/implementation. The support of the algorithm will depend on whether you are using self-hosted Redis or managed Redis, like GCP.
1. References:
Google Cloud Storage: This one is a very interesting way to leverage Google Cloud Storage for locking purposes. If you are using GCP as a cloud provider and Google Cloud Storage in your ecosystem and don't find any other options like MySQL/Redis, you may go for it.
1. Reference: https://www.fullstaq.com/knowledge-hub/blogs/a-robust-distributed-locking-algorithm-based-on-google-cloud-storage
There are many other options like Zookeeper, Hazelcast, etcd, Hashicorp Consul, etc.

Warning

Please note that locking has a cost associated with it; choose it wisely based upon your needs, and identify whether locking is required for efficiency or correctness.
Please use the lock as granularly as possible and the duration as short as possible.
It is highly recommended to do performance testing with concurrent requests that work on the same state.

Recommendations

If efficiency is required, you may go for Redis, and if correctness is required, you may go with MySQL, if these are already used in your ecosystem.

Summary

I hope you found this blog useful, and if you are already using distributed locking in your project, do share in the comment section on the mechanism used and your experience so far.

References

Race condition Cloud Lock (computer science)

Opinions expressed by DZone contributors are their own.

Related

Trending