Lessons in Software Reliability
First, stop writing lousy code
It’s unfortunate that few developers are familiar with The MITRE Corporation’s Common Weakness Enumeration list of common software problems. The CWE is a fascinating and valuable resource, not just to the software security community, but to the broader development community. Reading through the CWE, it is disappointing to see how many common problems in software, problems that lead to serious security vulnerabilities and other serious problems, are caused by sloppy coding: not missing the requirements, not getting the design wrong or messing up an interface, but simple, fundamental, stupid construction errors. The CWE is full of mistakes like: null pointers, missing initialization, resource leaks, string handling mistakes, arithmetic errors, bounds violations, bad error handling, leaving debugging code enabled, and language-specific and framework-specific errors and bad practices – not understanding, improperly using the frameworks and APIs. OK there are some more subtle problems too, especially concurrency problems, although we should reasonably expect developers by now to understand and follow the rules of multi-threading to avoid race conditions and deadlocks.
The solution to this class of problems are simple, although they require discipline:
- Hire good developers and give them enough time to do a good job, including time to review and refactor.
- Make sure the development team has training on the basics, that they understand the language and frameworks.
- Regular code reviews (or pair programming, if you’re into it) for correctness and safety.
- Use static analysis tools to find common coding mistakes and bug patterns.
Design for failure
Failures will happen: make sure that your design anticipates and handles failures. Identify failures, contain, retry, recover, restart. Contain failures, ensure that failures don’t cascade. Fail safe. Look for the simplest HA design alternative: do you need enterprise-wide clustering or virtual synchrony-based messaging, or can you rely on simpler active/standby shadowing with fast failover?
Use design reviews to hunt down potential failures and look for ways to reduce the risk, prevent failure, or recover. Microsoft’s The Practical Guide to Defect Prevention, while academic at times, includes a good overview of Failure Modes and Effects Analysis (FMEA), a structured design review and risk discovery method similar to security threat modeling, focused on identifying potential causes of failures, then designing them out of the solution, or reducing their risk (impact or probability).
Cornell University’s College of Engineering also includes a course on risk management and failure modes analysis in its new online education program on Systems Approach to Product and Service Design.
Keep it Simple
Attack complexity: where possible, apply Occam’s Razor, and choose the simplest path in design or construction or implementation. Simplify your technology stack, collapse the stack, minimize the number of layers and servers.
Use static analysis to measure code complexity (cyclomatic complexity or others) and trending: is the code getting more or less complex over time. There is a correlation between complexity and quality (and security) problems. Identify code that is over-complex, look for ways to simplify it, and in the short term increase test coverage.
Test… test… test….
Testing for reliability goes beyond unit testing, functional and regression testing, integration, usability and UAT. You need to test everything you can every way you can think of or can afford to.
A key idea behind Software Reliability Engineering (SRE) is to identify the most important and most used scenarios for a product, and to test the system the way it is going to be used, as close as possible to real-life conditions: scale, configuration, data, workload and use patterns. This gives you a better chance of finding and fixing real problems.
One of the best investments that we made was building a reference test environment, as big as, and as close to the production deployment configuration, as we could afford. This allowed us to do representative system testing with production or production-like workloads, as well as variable load and stress testing, operations simulations and trials.
Stress testing is especially important: identifying the real performance limits of the system, pushing the system to, and beyond, design limits, looking for bottlenecks and saturation points, concurrency problems – race conditions and deadlocks – and observing failure of the system under load. Watching the system melt down under extreme load can give you insight into architecture, design and implementation weaknesses.
Other types of testing that are critical in building reliable software:
- Regression testing – relying especially on strong automated testing safety nets to ensure that changes can be made safely.
- Multi-user simulations – unstructured, or loosely structured group exploratory testing sessions.
- Failure handing and failover testing – creating controlled failure conditions and checking that failure detection and failure handling mechanisms work correctly.
- Soak testing (testing standard workloads for extended periods of time) and accelerated testing (playing at x times real-life load conditions) to see what breaks, what changes, and what leaks.
- Destructive testing – take the attacker’s perspective, purposefully set out to attack the system and cause exceptions and failures. Learn How to Break Software.
- Fuzz testing: simple, brute force automated attacks on interfaces, a testing technique that is valuable for reliability and security. Read Jonathan Kohl’s recent post on fuzz testing.
Get in the trenches with ops
Get the development team, especially your senior technical leaders, working closely with operations staff: understanding operations' challenges, the risks that they face, the steps that they have to go through to get their jobs done. What information do they need to troubleshoot, to investigate problems? Are the error messages clear, are you logging enough useful information? How easy is it to startup, shutdown, recover and restart – the more steps, the more problems. Make it hard for operations to make mistakes: add checks and balances. Run through deployment, configuration and upgrades together: what seems straightforward in development may have problems in the real world.
Build in health checks – simple ways to determine that the system is in a healthy, consistent state, to be used before startup, after recovery / restart, after an upgrade. Make sure operations has visibility into system state, instrumentation, logs, alerts – make sure ops know what is going on and why.
When you encounter a failure in production, work together with the operations team to complete a Root Cause Analysis, a structured investigation where the team searches for direct and contributing factors to the failure, defines corrective and preventative actions. Dig deep, look past immediate causes, keep asking why. Ask: how did this get past your checks and reviews and testing? What needs to be changed in the product? In the way that it is developed? In the way that is implemented? Operated?
And ensure that you followup on your corrective action plan. A properly managed RCA is a powerful tool for organizational learning and improvement: it forces you to think, to work together, creates a sense of accountability and transparency.
Change is bad…. but change is good
You don’t need to become an expert in ITIL, but if you have anything to do with developing or supporting enterprise software, at least spend a day reading Visible Ops. This brief overview of IT operations management explains how to get control over your operations environment. The key messages are:
Poor change management is the single leading cause of failures: 80% of IT system outages are caused by bad changes by operations staff or developers. 80% of recovery time (MTTR) is spent determining what changed.
The corollary: control over change not only improves reliability, it makes the system cheaper to operate, and more secure.
Change can be good: as long as changes are incremental, controlled, carefully managed and supported by good tools and practices. When the scope of change is contained, it is easier to get your head around, review and test. And with frequent change, everyone knows the drill – the team understands the problems and is better prepared if any problems come up.
Implement change control and release management practices. Include backout planning, rollback planning and testing. Taking compatibility into account in your design and implementation. Create checklists, reviews.
Reliable software, like secure software, doesn’t come for free, especially up front, when you need to effect changes, put in more controls. You must have management, and customer, support. You need to change the team’s way of thinking: to use risk management to drive priorities, shape design and implementation and planning. Get your best people to understand and commit: the rest will follow.
Keep in mind of course that there are limits, that tradeoffs need to be made: most of us are not building software for the space shuttle. In Software Quality at Top Speed, Steve McConnell shows that development teams that build better quality, more reliable software actually deliver faster, up to a peak efficiency of 95% defects removed before production release. However, you reach a point of rapidly diminishing returns as you approach the end of the curve, attempting to hit 100% defect-free software, where costs and schedule increase significantly.
Timeboxing is an effective technique to contain scope and cost: do as much as you can, as good as you can, within a hard time limit. Following Japanese manufacturing principles, make sure that anyone on the team can pull the cord and postpone a release or cancel a feature because it is unstable.
It is sobering, almost frightening, how easy it is, how natural it is, for developers and managers to short-change quality practices, to place feature delivery ahead of reliability, especially under pressure. Ensure that you build support across the organization, build a culture that puts reliability first. Like any change, it will require patience, commitment, and unrelenting followup.