Caching, Parallelism and Scalability
When your boss asks you to rewrite your application to be more performant and handle greater throughput, what do you do? Once upon a time, when Moore’s Law held steady, all you had to do was to go drink a soy latte, play some Doom, twiddle your thumbs for a few months on the pretext of rewriting your application, and then redeploy it. Unchanged. Just onto more current hardware. Moore’s Law - from Intel co-founder Gordon Moore - ensured that the rate at which computers got faster was enough to provide performance gains that can automatically be realized by your application to cope with growth in demand. The key word here is “automatically”.
"Moore’s Law is dead. Amdahl’s Law rules."
Since about 2005 though, Moore’s Law has stopped being relevant. Today, Amdahl’s Law - coined by Gene Amdahl - is far more applicable. This law has to do with the fact that by throwing more processors to work on a problem, you don’t get the expected overall throughput calculated by adding the individual throughputs of each processor. Factors, such as the degree to which the original problem can be broken down into subproblems, the efficiency with which the subproblems can be distributed to the multiple processors and the effort in collating results, will have an effect on your overall throughput.
Amdahl’s law is relevant because, since about 2005, hardware manufacturers have hit limits regarding how much performance they can squeeze out of CPUs. And their current approach to building faster computers and greater throughput has been to put more cores on a single chip, more multi-core chips in a single server, even more servers in a single cluster or grid. Because such throughput increases are achieved by distributing workloads, these are no longer “automatically” realized by your software the way they were during the era of increasingly faster cores.
So no more playing Doom, there is work to be done. You need to be conscious of the fact that workloads will be distributed and your software would need to be written specifically to take advantage of this parallelism. If your software wasn’t written with parallelism in mind in the first place, portions may have to be rewritten.
While making the most of parallelism does involve extra effort, the solution is still workable. It may involve some rewriting, using common approaches such as more piecemeal workloads, producer/consumer design patterns, use of efficient concurrency libraries and inter-thread synchronization mechanisms.
But what about libraries and subsystems outside of your control? Specifically databases that touch disks that inherently involve sequential, serial processing? And there is no way you can change your database vendor’s code. Systems that involve all but the simplest, most infrequent database use would be facing a massive bottleneck thanks to this serialization. Databases are just one common example of process serialization, but there could be others as well. Serialization is the real enemy here, as it undoes any of the throughput gains parallelism has to offer. Serialization would only allow a single core in a multi-core system to work at a time, and limit usable computing power. And this is made even worse in a cluster or grid. Using a bigger and beefier database server reduces this somewhat, but still does not overcome the problem that as you add more servers to your grid, you still have process serialization going on in your database server.
And this is where caching comes in. Let’s consider a few definitions first:
Cache: When I refer to a cache, I think of a cheaply accessible data structure that allows thread-safe access to in-memory data. A cache may or may not exhibit “enterprise” features such as transactional or querying abilities. A cache would, on the other hand, almost certainly exhibit features such as eviction and expiration of cached data, and passivation/overflow of in-memory state to a more persistent store.
Clustered cache: Here, I refer to a cache system where each cache instance is aware of other cache instances in a cluster and is capable of synchronizing operations with its peers. Cache contents are typically mirrored.
Distributed cache: Distributed caches takes clustered caches a step further. They distribute any cached state across a cluster to maximize retrieval efficiency, reduce overall memory used, and guarantee data redundancy. This means data could be fragmented and fragments stored on different cache instances. There is no requirement that a piece of data is resident on every cache instance in the cluster.
Now how do caches help you harness parallelism? Essentially, unlike a database which is inherently serial in nature, a cache lives in memory and is highly parallel due to the nature of hardware memory buses. A well-written cache would minimize any locking or the use of inter-thread signaling, and would use modern techniques such as compare-and-swap to ensure data integrity in a scalable manner.
So, caching can help systems scale and deal with far greater loads, achieving higher throughput. A local cache in front of your database - or any potentially serial data source - works great in a single-server, multi-core environment, allowing you to maximize usage and efficiency of all cores.
But why are clustered caches important? One of the biggest problems with caches is that cached data needs to be kept relevant and accurate. We need to detect when changes are made to the data store, and update any caches accordingly. Now in a cluster or grid, this is impossible unless the caches are aware of the goings-on in the other caches in the cluster, as non-cluster-aware caches would end up serving stale and incorrect data to the application tier. Clustered caches allow multiple cores and servers in a grid to work on the same data set in parallel, making the most of computing power available, and this is where large throughput gains are to be realized and highly scalable systems built.
Clustered caches employ several mechanisms to maintain synchronization between the caches and ensure data validity is maintained.
Pessimistic cluster-wide locks
One way clustered caches achieve this is to use cluster-wide locks, either using a central lock server or communicating with every instance in the cluster to request for locks. Each cache instance would acquire a cluster-wide lock, make changes which are then replicated to each cache instance, and then release the cluster-wide lock. This is a pessimistic approach and, as with most pessimistic approaches to shared access to data, suffers scalability issues due to the heavy use of locks.
A more optimistic approach is where data sets are copied, worked on, and when complete, locks are acquired and state is written to the caches cluster-wide. While far more scalable, this suffers from the need to retry a transaction if there is a collision - i.e. two transactions simultaneously writing to a data set.
Replicate or invalidate?
Orthogonal to how operations work in relation to each other, caches may also employ one of two approaches to deal with an update of cached state. They may either replicate modified state across a cluster to each node, or broadcast an invalidation message to inform other caches that a cached entry has changed and to remove the entry from their local caches. The latter approach only works if the caches front a data store, and the application tier can, upon not finding an entry in the cache, retrieve the entry from the data store again.
So the above so far only speaks of clustered caches in the sense that each cache is local, but aware of its peers and is able to synchronize changes with them. While this approach works well for most applications, it still limits the addressable memory the caches may use. Another approach is not to mirror data between caches, but to aggregate the usable memory of all the caches in a cluster, and distribute data across them. Locating where your data is could involve either replicated metadata - a map of sorts - or a consistent hashing function that would deterministically point to the instance containing the entry you are looking for. Typically, distributed caches maintain some degree of data redundancy such that the system is resilient to cache servers failing or restarting. While distributing data in this manner can involve unnecessary network lookups, it exposes a much larger usable memory space for use by the cache system. It also makes the overall system far more scalable.
Clustering for high availability
As a side effect of clustered or distributed caches, applications are now able to cluster their state for high availability. Any application state that is stored in a cache is available cluster-wide. Hardware or software failure would not be a problem to clients if the application tier is fronted by a load balancer, which could redirect the request to the next available application server, which in turn would look up conversational state in the cache system. When using a clustered cache, the state would be available on every instance. When using a distributed one, enough redundancy would exist in the cache system to ensure no data is lost. Applications and frameworks wishing to add high-availability characteristics typically employ clustered or distributed caches to achieve this.
To conclude, caches are an important tool in harnessing parallelism and removing data retrieval or calculation bottlenecks. Considering the architecture of your application and the volume and access patterns of your data, a local, clustered or distributed cache would be appropriate. Most mature caching frameworks or products would offer a choice of operational modes to suit your use case. As with most things though, caches aren’t silver bullets and should not be used indiscriminately. As quoted by Java performance tuning expert Kirk Pepperdine, “Measure, don’t guess.” . Make good use of a profiler, and introduce caches only where necessary, where data retrieval or calculation bottlenecks are hampering scalability.