Application Clustering For Scalability and High Availability
Application clustering is a sub-topic of “Parallel Computing”. Today many types of software support parallel computing in some form. Why do we need parallel, clustered applications? For 2 reasons; scalability and high availability. Does every application need clustering? No. However, if your system is underperforming and not available enough to meet customer requirements, then that signs that you need clustering. There are many kind of clustering at each software tier, please note that application clustering is different. A stand-alone application may not run smoothly in some types of clustered environment and needs some tweaks that I’ll share here in this article.
Which benefits can we gain from clustering? There are 2 main benefits:
1. Load Balancing (Scalability): By distributing processing, we can make it possible vertical or horizontal scalability. In this way, we can leverage resources more effectively. One example is Java GC advantage. If we distribute user sessions to different JVMs then GC would run less and performance would be better. Also we can overcome JVM heap size limit.
2. Fail-over (High Availability): Fail-over means less and shorter service interruption. Some SLAs may require some high availability percentages like two-nines availability. A server may be down because of many reasons; system failures, planned outage, hardware or network problems etc. To achieve HA, one important technique is clustering. Fail-over may occur automatically, or manually. We may prepare a redundant backup system (Cold Failover) or our load balanced servers may also serve the fail-over function (Hot Failover). Please note that there is an engineering branch for this topic; Reliability Engineering. Moreover, fail-over is not only an issue solved by clustering but it is also related with your system design and characteristics. If you design your applications against interruptions then it is called “Fault-Tolerant”. When dealing with this issue, we shouldn’t forget the recovery time (Disaster Recovery). You may fail but it is important how fast you can make system up. Downtime is costly and its cost should be calculated to get better understanding of HA.
There are many kinds of clustering in different tiers. In every tier, it may be named differently meaning the same thing like virtualization, partitioning, mirroring. Let me try to list clustering tiers:
1. Hardware Clustering: RAID is such technology used for both performance and reliable disks.
2. OS (Server) Virtualization: We know many OS-level virtualization programs that can run many logical servers within same physical server at the same time.
3. DB Clustering: Many DBMS supports clustered database instances via different topologies. We can add JDBC-level cluster libraries to this category that simulate clustered database feature.
4. JVM Clustering: Some vendors transparently cluster JVM without knowledge of the running applications. Some of them use byte-code instrumentation for this.
5. AS Clustering: Application servers (and HTTP servers) support clustering with varying capabilities. Some of them can make session state replication across AS instances. Some of them have load-balancers to distribute coming requests. Some of them can have transparent fail-over feature etc.
6. Application Clustering: This category is what I want to explore today. Application clustering is parallel with application server clustering but not the same thing. If you don’t make your application cluster-enabled, it won’t make use of AS clustering. If you configure your servers for clustering, you may still need to change your application code to run on these clustered servers.
There are some adjustments when preparing a stand-alone application to run in a clustered environment. Briefly, some of your application services need to be reorganized for state integrity. These services are probably using Singleton pattern. If you are building your application with some frameworks or libraries, they also must support clustering in a consistent way. Otherwise, you won’t make us of clustering by yourself.
To keep state integrity, you may choose one of following architectures (Some ASs may already support this infrastructure):
State Replication: You may have multiple service instances on cluster nodes and synchronize the state by replicating it. There are many replication techniques you can use; port communication, RMI, JavaSpaces. This architecture has advantage in fail-over and disadvantage in performance.
Hub-and-Spoke (Master-Slave): In this architecture, you would have only one instance of services on master cluster node and child nodes will invoke master node services through for instance RMI method calls. This architecture has advantage in performance but disadvantage in fail-over capability (if master server fails, child nodes would lose service states).
Let’s delve into some services that must support clustering:
1. Locking: Application locks should be shared across different cluster nodes. If DB locks are used then it is cluster-safe. If a singleton lock service is doing locking, then we need to make lock acquire-release calls among cluster nodes synchronously. For example; a user is running a process on AS cluster node1 and hold a pessimistic lock on Entity A, then another user’s process running on cluster node2 must receive an exception when trying to update Entity A.
2. Caches: Every cache service should be analyzed if a cluster-enablement is required. Some caches may repeat on cluster nodes but that may not cause any problem. On the other hand, let’s say we have a persistent object cache service that runs according to update information to tables and checks if any table modified. A process on another cluster may update a table and that change information should be reflected to all caches on the other cluster nodes.
3. Events: You may have designed some type of application events and those may need to be cluster-aware. For example, you have a notification event that triggers when a user sends a message to another user. If that event occurs in node01 and other user is on node02, he would never receive that event and message if that event is not cluster-enabled.
4. Identities: Let’s say we have a sequence generator service that has a cache to prevent accessing database every time a number is requested. If identity service is not cluster-enabled, identity services on different cluster nodes would generate same identity numbers and applications would get duplicate record errors.
5. Authentication: If user can login on each cluster node then some authentication information need to be shared. One example is a feature such that a user account is locked after multiple failed login tries. If failed attempts are not shared, a hacker may continue his attempts on other nodes. I don’t add authorization here since we didn’t need an adjustment for it to work in clustered environment (Your authorization service may need it).
6. Scheduling: Scheduled or batched tasks should be considered when clustering. Same scheduled processes may try to run on each instance whereas only one execution is enough.
7. Logging (File IO): You have to deal with file operations in a clustered environment. One file issue is log files. Multiple cluster nodes may try to write the same log file and make it corrupt or cause IO errors if not cluster-enabled. Our solution for this is to generate and use different log files for each cluster node.
8. Messaging (Network IO): If you are using some messaging services, you have to make sure that message service is cluster-enabled. If you are using a JMS product, you need to test in a clustered environment whether it behaves appropriately. If you wrote it and you plan to use multiple messaging service instances, then you have to use different ports if cluster nodes are on the same server. We just used only one instance of messaging service to overcome that kind of problems.
9. A Service That Need to be Clustered: Since there are many application services beyond this list, you may have a special application service that need to be cluster-aware. You should provide an API to execute in your cluster architecture compatibly. We did so. We have a very easy API to register services to run on cluster nodes seamlessly without thinking state sharing. Thus, any programmer can utilize that service without dealing with clustering issues and details.
It is very important to evaluate if a service really needs cluster-enablement. For example, ordinary persistent objects should not work in distributed and clustered architecture since it would be a very big overhead to run across network (Add extra code addition to this cost). Sometimes, by just cluster-enabling some services is good enough to utilize parallel computing.
A sample cluster topology: