On Architects, Architecture, and Failures
Thoughts on architecture, the things that can go wrong, and how to deal with them
Join the DZone community and get the full member experience.Join For Free
Let’s consider two things:
1.) Bad things happen to good people
2.) Architects are people
Ergo, bad things happen to good architects.
In other words, at some point, no matter how much effort you and your team put into designing resilient, high-performing, well-architected systems – something is going to blow up spectacularly and make you look silly. Call it Murphy’s Law.
Why Things Fail
When we design systems, we usually try to do it well. We write good code, we write test cases, we follow frameworks and best practices. All of these things are under our control (and even these aren’t always bulletproof). However, the problem comes in when things are not under our control or we don’t even consider the possibility that something can go wrong (the unknown unknowns).
A couple of examples of why things fail:
- Code can have unexpected bugs that we couldn’t foresee (“that can be null? No way!”).
- Third-party services that we depend on can go down.
- A server can run out of disk space.
- Databases can go down (or misconfigured anti-virus software can scan your data files 24/7 and cause horrible slow-downs).
- The network may not be as reliable as you think.
Ultimately, if we look beyond the code that we interact with, there’s a ton of complexity under the surface – from the hardware that something runs on (yes, that’s there, even if you are in the cloud) to operating systems, containers, virtual machines and runtime environments, networks, etc.
Consider the code below.
A multitude of things have to come together in order for it print some characters to a console. Now, think beyond “Hello World!” to distributed enterprise systems with multiple components, produced by multiple parties, running on multiple different tech stacks.
As such, we should be wary of overconfidence in our own ability – systems are far from trivial, even if they are “simple” systems.
How to Make it Less Painful When Things Fail
In order to make it easier to deal with failure, we should first accept that failure at some point is pretty much inevitable. Once you’ve made peace with this and it becomes a nagging concern in the back of your mind when you design something, you can start looking past some of your blind spots.
Do Not Make Things More Complex Than They Need to Be
Systems are complex already, so if you design something, consider whether it can be simplified (while still being fit-for-purpose). Unnecessary complexity increases both the likelihood of failure (due to more moving parts) and the difficulty involved in trying to fix a failure. This is a good point to plug in a reminder of Kernighan’s Law.
“Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?”
One of the dangers here comes in the form of resume-driven development. Sure, the shiny tech/framework/approach will look great on your CV, but is it actually necessary?
Microservices have lots of benefits, but if your employee-leave tracking system will only ever have 50 users – does “LeaveService + EmployeeService + HolidayService + orchestration + all the overhead that goes with it” really give you any meaningful benefit over “LeaveTrackingSystem.jar”?
There’s already a problem to solve, so be aware of the essential complexity and try to avoid creating additional problems through accidental complexity.
Detect Failures Early
Since there are many moving parts to a system, see if you can find a way to detect issues early.
For code issues, the first place to catch problems is in your automated testing process. So make sure you have CI/CD pipelines and decent unit- and integration test coverage (we can debate what “decent” means, but it’s definitely not 100%). This will tell you if something obvious breaks before you put it in production.
Once your application is deployed, you also start caring about all the other moving parts. This is where tools like log aggregation systems and infrastructure monitoring become critical. There are loads of fancy commercial offerings, but I’ve also worked on a team where we had our own set of monitoring tools – nothing complex, but enough to let us know if a process didn’t kick off when it was supposed to so that someone can intervene. This is the canary in your coal mine.
What is key, regardless of what kind of tooling you use, is to make sure that failures are visible as soon as they happen. In the pre-pandemic days, a big screen in the office was a great way to do that. Now that WFH has become the norm, messages from the tooling to Slack or Teams might be a better option. Nonetheless, if responses from a service start taking longer than expected or if a database goes down, you want someone to know about it immediately.
Gather as Much Information as You Can
Once you know that something has failed, you want as much information as possible to track down that failure.
There are a couple of ways of doing this.
- If you pass request/response messages around (and you have something that logs those messages), use a unique identifier (think of something like correlation IDs) throughout the entire interaction to tie messages together. This is particularly useful if a failure is caused by data in the payload of a message.
- Log timestamps (and make sure that the time is synced across all the servers in your solution). This in itself doesn’t give you an answer, but it helps you to figure it out. For example, if something fails exactly 120 seconds after invoking a service, that smells like a timeout.
- In your implementation (and it’s up to your team to decide what you deem to be best practice), make sure you log sufficient detail in your log messages to help you trace something through the system. “Payment failed” is a terrible message, especially if there are hundreds of instances of it in your logs. “Payment failed. Payment ref: 123-456. Error: java.lang.NullPointerException at …” is much better.
Make it Easy to Isolate a Failure
Your approach to design can be used to isolate failures. Think of high cohesion and low coupling, as well as single responsibility for different components.
If you use a layered architecture and you apply coding standards that are clear on the separation of concerns and what belongs in each layer, your errors and exceptions alone will provide some context as to where something went wrong. Database connection issues? Go check the data access layer. Business-specific exceptions? Go check the service layer.
If you depend on different services and some are not critical, how do you prevent a failure in one of those components from bringing down your entire solution? If a third-party service provides you with a list of public holidays (let’s say that is non-critical to your leave tracking system) and it becomes inaccessible for a while, do you really want to render your entire user interface unusable, or do you want to display a message saying something like “we can’t display holidays right now” but leave everything else in a working state? The latter option definitely feels preferable in my mind.
Alternatively, if you have a set of third parties that you depend on for some or other service (let’s say a delivery partner for your e-commerce system) – do you really want an outage on their side to knock your business out and prevent you from generating revenue? In this scenario, isolate yourself from failures on the other side by putting an asynchronous messaging layer in between. Assuming that it’s not time-critical, your solution can put a message on a queue and something else can process that request once it is available.
Make it Easy to Recover From Failure
What can be a particularly difficult failure to recover from is the kind of failure that brings down an entire environment and requires it to be recreated. This is easier to work around in cloud-based environments where things like auto-scaling groups can automatically bring up new instances. You also want to make sure that you have a DR strategy in place.
Much of this hinges on making sure that it is actually easy to recreate your environments – infrastructure-as-code (IaC) makes this easy. Having someone set up servers by hand means that it takes much longer to recreate an environment, with an increased likelihood of human error and more time-consuming debugging. Think cattle, not pets.
What to Do After Things Have Failed
Once something has failed, don’t just frantically rush to fix it. Do a root cause analysis and figure out exactly why it failed. Then, put something in place to prevent the same kind of failure from happening again.
If it was a NullPointerException in your code, add a unit test to make sure your code can deal with it if it happens again.
If a server runs out of disk space and your application crashes, allocate more disk space, put a clean-up job in place to clean out old files and make sure you have monitoring to alert you when disk space is low again.
Also, make sure you learn from failures. Have retrospectives, document the outcomes, and make sure that knowledge is distributed to the rest of your team and your organization. Failure isn’t always cheap, so make sure you don’t have to pay for the same lesson multiple times.
In summary – accept that architectures are not infallible, and be prepared to deal with failures when they happen. If you have any other thoughts on the topic, let me know in the comments!
Published at DZone with permission of Riaan Nel. See the original article here.
Opinions expressed by DZone contributors are their own.