Solve Every Problem Twice
One habit that I think every software developer, if not practically every professional in any field, can benefit from is that of solving every problem twice.
Join the DZone community and get the full member experience.Join For Free
Fix Everything Two Ways
Almost every tech support problem has two solutions. The superficial and immediate solution is just to solve the customer’s problem. But when you think a little harder you can usually find a deeper solution: a way to prevent this particular problem from ever happening again.
Obviously, I believe this principle applies to more than just customer service.
A related concept comes out of the Toyota, the Five Whys. Quoting from Wikipedia:
Five Whys is an iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem. The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question "Why?". Each answer forms the basis of the next question.
When tackling an observed problem, whether it be in code, business processes, or potentially even a leaky sink, I like to combine these principles into a technique I call “Solve Every Problem Twice”.
Solve Every Problem Twice
But as with the Five Whys, don’t take “twice” too literally. In practice, this technique should always yield a bare minimum of two solutions, but will often result in 5 or more practical solutions.
The steps I follow are:
1. The Five Whys
Use the Five Whys to determine the multiple causes of the observed problem.
2. Solve Each Problem At Least Once
Apply Joel Spolsky’s advice of solving each cause at least once. Each cause should have an immediate fix, and most will also have at least one deeper solution.
3. Repeat for the Problem-solving Process
Go through the first two steps again, this time for the process of solving the original observed problem.
A Real-World Example
To illustrate the technique in practice, let me describe a problem I ran into recently.
I wanted to do an update to one of the websites I own when I ran into a problem. I host the code for this website on GitLab, where I use GitLab-CI for my continuous integration and deployment. I have GitLab-CI configured to create a review environment for me whenever a merge request is created.
When I recently pushed a change, I discovered the review environment was not working, with the famous “Your connection is not private” warning from Chrome which happens when an SSL certificate is broken.
I use Let’s Encrypt to manage my SSL certificates. Sometimes it can take a few minutes to get a new certificate, so I was patient. But half an hour later it was still not working, so I knew I had a legitimate problem.
With a little digging through my Kubernetes logs, I found the cause of the problem:
Status: Acme: Uri: Conditions: Last Transaction Time: 2019-11-18T08:41:49Z Message: Failed to verify ACME account: acme: urn:ietf:params:acme:error:rateLimited: Your ACME client is too old. Please upgrade to a newer version. Reason: ErrRegisterACMEAccount Status: False Type: Ready
I then looked in the configuration for my Kubernetes cluster and found that I was requiring version 0.5.2 of the cert-manager package.
helm install stable/cert-manager \ --name cert-manager \ --version 0.5.2 \ --set ingressShim.defaultIssureName=letsencrypt-prod \ --set ingressShim.defaultIssureKind=ClusterIssuer \ --namespace kube-system \ --tls
At the time, was 0.11.0, so clearly, an upgrade was in order. With the immediate and root causes determined, let’s go through the steps outlined above.
The Five Whys
The observed problem was that the SSL certificate is broken, which leads to our first why:
1. Why is the SSL certificate broken?
The reason, as discovered above, is that I was using an old, unsupported version of the cert-manager package. This leads to the second why:
2. Why do I need to upgrade the certificate manager?
As you may recall from above, I was explicitly requesting version 0.5.2 of the
cert-manager package. Perhaps it would be reasonable to always install the latest version.
Solve Each Problem At Least Once
Now I can go through the two problems I identified above, and resolve to solve each at least once.
1. Install the latest version of the certificate manager
This will solve the immediate, superficial problem, and get my website working again.
2. No longer require a specific version, and always install the latest
This will prevent the problem from reoccurring in the future. Of course, this may open up my system to a new risk, in case a new version of
cert-manager somehow breaks something, but it may be a risk worth taking.
Repeat for the Problem-solving Process
But don’t forget the final step! Repeat for the problem-solving process itself. In my example, I found two areas where I believe I could have improved the process of fixing the problem.
1. I should have noticed the problem sooner
I don’t update MinimalPairs.net very often. For all I know, this problem may have been lying in wait for weeks before I attempted an update and noticed. Two possible solutions come to mind for this problem. The first is to use a simple monitoring service to alert me when the website’s SSL certificate is no longer working.
Second, and more proactively, I could use the same error logs which I used to debug the problem, and have them sent to a service such as Sentry.io, which can notify me immediately whenever a problem occurs.
In the spirit of solving each problem twice, I should do both of these.
2. It should have been easier to find the failure logs
Once the problem was identified, debugging it took longer than should have been necessary. This was largely due to the fact that Kubernetes doesn’t keep all logs in a centralized location. This could be solved by setting up a centralized logging system. I already use Loggly for most of my logging, so I can just set it up to track my Kubernetes logs, as well.
Using my technique, I came up with five potential solutions to a simple SSL certificate problem:
- Upgrade the certificate manager
- Don’t depend on a specific version of the certificate manager
- Set up monitoring for the website
- Set up error alerting
- Set up better logging
By applying all five of these solutions, I can ensure that not only have I solved the immediate problem, but that the overall health of my entire system is improving, and the next problem, no matter where it happens in the technology stack, will be that much easier to solve.
Published at DZone with permission of Jonathan Hall. See the original article here.
Opinions expressed by DZone contributors are their own.