Distributed Vs Replicated Cache
Caching facilitates faster access to data that is repeatedly being asked for. So in this article, we discuss the pros of distributed vs replicated caches.
Join the DZone community and get the full member experience.Join For Free
Caching facilitates faster access to data that is repeatedly being asked for. The data might have to be fetched from a database or have to be accessed over a network call or have to be calculated by an expensive computation. We can avoid multiple calls for these repeated data-asks by storing the data closer to the application (generally, in memory or local disc). Of course, all of this comes at a cost. We need to consider the following factors when cache has to be implemented:
- Additional memory is needed for applications to cache the data.
- What if the cached data is updated? How do you invalidate the cache? (Needless to say, now that caching works well, when the data is cached it does not need to be changed often.)
- We need to have Eviction Policies (LRU, LFU etc.) in place to delete the entries when cache grows bigger.
Caching becomes more complicated when we think of distributed systems. Let us assume we have our application deployed in a 3-node cluster:
- What happens to the cached data when the data is updated (by a REST API call or through a notification)? Data gets updated in only one of the nodes where the API call is received or the notification is processed. Data cached in other nodes became stale now.
- What if the data to be cached is so large that it does not fit in the application heap?
Usage of distributed/replicated caches would help us in addressing the above problems. Here we will see when to use a distributed vs replicated cache and the pros-cons of each of them.
In a replicated cache, all nodes in a cluster hold all cached entries. If an entry exists on one node, it will also exist on all other nodes too. So the size of cache is uniform across all the nodes of cluster. As the data is stored in multiple nodes, it can contribute towards higher availability of application.
When a data entry has to be updated in a cache, this kind of cache implementation also provides mechanisms to replicate the data on other nodes either synchronously or asynchronously. We have to be mindful of the fact that in case of asynchronous replication, the data stored in cache is inconsistent in nodes for a smaller duration until the replication is completed. Typically this duration is negligible when the data to be transferred across the network is smaller.
Ehcache provides different mechanisms for replicating cache across multiple nodes. Here we will see how JGroups can be used as the underlying mechanism for the replication operations in Ehcache.
In a distributed cache, all nodes in a cluster do not hold all entries. Each node holds only a subset of the overall cached entries. As in the case of a replicated cache, here you also have multiple copies (replicas) of data are maintained to provide redundancy and fault tolerance. The replication count here is generally lesser than the number of nodes in cluster unlike the replicated cache.
The way in which we store multiple copies of data across different nodes in distributed cache differs significantly from the way we store in replicated cache. Here, data is stored in partitions that are spread across the cluster. A partition can be defined as a range of hash keys. At any time when we want to cache a value against a key, we serialize the key to byte array, calculate the hash, and the value is stored in the partition in whose hash range the calculated key hash falls in. So the data is stored in the corresponding partition (primary) and its replica (secondary) partitions as well. This process of saving data into partitions is commonly referred to as Consistent Hashing.
Distributed cache provides a far greater degree of scalability than a replicated cache. We can store any amount of data by adding more number of nodes to the cluster as needed without making any modifications to the existing nodes. This is referred to as horizontally scaling the system. If we want to do the same with replicated cache, we need to do vertical scaling i.e. add more resources to the existing nodes itself as every node will store all the keys.
There are many open source distributed cache implementations like Redis, Hazelcast. Here we will see how Hazelcast IMDG (In Memory Data Grid) can be leveraged for a distributed cache implementation.
Finally, here is a quick comparison of both the cache types:
|Distributed Cache||Replicated Cache|
|Availability||Availability is improved as data stored across partitions in multiple nodes and every partition will have its replica partition as well.||Availability improved as the whole data cached is stored in all nodes.|
|Scalability||Highly scalable as more number of nodes can be added to the cluster.||Scaling will be a problem as we need to add more resources to the existing nodes itself|
|Consistency||Data reads are in general served by primary partitions here. So even when the data is getting copied to replica partitions, we get to read the correct data from primary.||We might not see a consistent view of data while the data is getting replicated to other nodes. Note that any node can serve the data as there is no concept of primary-secondary or master-slave here.|
|Predictability||When we don’t know exactly the data that can be cached before-hand, distributed cache is preferable as it can scale well.||Better option for a small and predictable number of frequently accessed objects|
You can find the source code for this post on GitHub.
Published at DZone with permission of Prasanth Gullapalli. See the original article here.
Opinions expressed by DZone contributors are their own.