Architecting for Failure
Read on for Gaurav Purandare's article from DZone's upcoming Guide to Building and Deploying Applications on the Cloud and glean some valuable tips for developing cloud environments—specifically, how to architect for failure.
Join the DZone community and get the full member experience.
Join For FreeThe design of all credible, fault-tolerant architectures is based on one extremely important principle:
"Anything that can go wrong, will go wrong." - Murphy’s Law
Murphy’s Law makes it mandatory to consider failures an integral part of any system, something that can be particularly tricky in cloud environments because users expect near 100 percent uptime. Today, there are many organizations that embrace the cloud in the belief that they are automatically shielded from infrastructure-related issues or that they are "just handled" by their cloud provider without impacting an application’s functionality. Although this can be the case for some incidents, there are many failures you still need to account for in your application design.
Isolate Components
As the level of complexity in applications continues to increase through new layers of abstraction and the introduction of cutting edge technology, isolation becomes a powerful aid in anticipating and mitigating failures. Isolating individual application components to different systems not only reduces overall complexity, but also provides clarity on resources utilized. Maxed-out resources are often an indicator for (imminent) failure, and isolation can help pinpoint such issues. Further, when component isolation is combined with replication, it prevents applications from having single points of failure, adding resilience and making failures less likely overall.
Determine the Reasons for Failures
Hardware and Infrastructure
Even though the risk of hardware failure might appear to be a dwindling issue, it cannot be ignored altogether. Applications should always be developed to be prepared for failure.
For example, having sticky sessions between a load balancer (LB) and the servers that sit behind that LB cause all traffic for a specific user session to be routed to a specific server. In the event of this server failing, a user would experience the application or service becoming unavailable.
Paradoxically, rapid success can breed failure as well. An application that experiences rapid adoption may not be able to cope with the corresponding server or networking load. If an application supports auto-scaling, then all implicated servers should be configured to start all necessary services at system boot up to scale out when demand spikes. Regular health checks (industry best practice is to do this every five minutes or less) within and via the LB infrastructure become essential.
Similarly, if failure occurs with a database server (assume the worst case scenario, i.e. the primary or master server are impacted), there needs to be a configuration in place to elect a new primary or master node. Application libraries need to adjust their database connections, which ideally should be an automated process.
Software and Configuration
Apart from expected, well-known and well-understood issues (such as resource over-utilization), the application should be able to cope with systemic and global changes. A classic example is the leap second bug from 2012 which caused many issues for Linux servers. Today, applications are prepared for this issue, hence the latest leap second edition in 2015 was a much smoother transition. Although no one can predict when the next undiscovered issue of this magnitude will strike, by following major industry blogs and key influencers on Twitter, you can discover and address new issues swiftly after they have been identified.
Sometimes it is neither the hardware, nor the application, but ancillary configuration that can cause failure. Perhaps it’s a simple firewall rule change, or lack of a static IP/address for a specific component when it is referenced by several others. This is exacerbated when multiple people or teams collaborate on an application. A useful strategy to avoid such issues is to restrict access to critical rules and configuration to specified individuals that are both trusted and have a holistic understanding of the overall environment.
Be Proactive
Having proper monitoring in place helps proactively determine possible failures before they become an issue. Build scripts to perform daily sanity tests, automate port scanning and access controls, so you’re not wasting time or getting sloppy with essential, but repetitive tasks. That way, you only need to pay attention if your script issues an alert.
However, be sure to also monitor the monitoring scripts themselves. If your monitoring system doesn’t show a status of "all green", we know we have a problem. However, if it shows all green across the board, it is still recommended to confirm that there isn’t a more nefarious issue with the monitoring system itself and to verify that the correct metrics and parameters are being monitored.
Rehearse Failure
Replicate your critical systems and then simulate failure. Review error messages and check the remedial actions documented on your recovery checklists. Try and determine the maximum scalability of different components in your system. While such measures might seem like overkill for a casual application, they are a worthwhile investment for all mission critical ones. This particular precaution won’t necessarily decrease your failure rate, but it will help you detect and respond to issues faster, whenever they occur on a key production system.
Never Stop Learning, Never Stop Tuning
Even with the best architecture, the most reliable technologies and optimal operations, failures–big and small–will occur. When they do, make it a point to learn the lessons each failure invariably offers and to prevent the issue from recurring. Perform a root cause analysis. Discuss with engineering, with QA, with DevOps, and with the team consuming the app or service that was ultimately impacted. Update your checklists and use each failure to make yourself smarter and your systems more resilient.
Cloud Architecture Blueprint
The above diagram illustrates a high-level architecture based on Built.io Flow. All internet-facing services make use of an Elastic Load Balancer (ELB) layer to balance traffic across a pool of Amazon EC2 instances associated with a corresponding auto-scaling group. These EC2 instances span across multiple availability-zones, thus avoiding downtimes for zone failures.
Because of built-in redundancy, a connection failure to database instances e.g. Redis and MongoDB instances is handled gracefully by the application. If connectivity issues are observed for a primary/master database server, for example, the application can immediately reconnect to newly elected primary/master server.
"Design for failure so things don’t fail."
Best Practices |
Tools/Techniques |
Isolate Components |
Review architecture diagrams |
Hardware and Infrastructure |
Perform stress testing using tools like Loader.io and LoadView for load testing. |
Software and Configuration |
Validate application configurations, e.g. if a config file is stored in JSON, validate it using JSONLint or an equivalent tool. |
Be Proactive |
Build port scanning scripts, CloudCheckr works great for AWS. |
Rehearse Failure |
Test your system by stopping or deleting a random component. For larger operations, consider deploying Chaos Monkey. |
Opinions expressed by DZone contributors are their own.
Comments