Building High-Quality Software
How do you define good quality software? How do you develop, measure, and ensure its quality? Read this post to learn the answer to these questions and more.
Join the DZone community and get the full member experience.Join For Free
I have interviewed many engineers and managers lately, and one of the standard questions I ask is how to build high-quality software. Of course, I provide more context and explanations, but the gist is the same. I heard all kinds of answers. However, I was puzzled that almost none were systematic, and people immediately went into a specific pet peeve. As part of this exercise, I felt that I had to crystalize my answer to this question and write it down.
Let me start with high-level thoughts (specifically to make it systematic). First of all, I want to concentrate on software code quality (vs. larger topics, including problem definition, documentation, UX, design, etc.). High-quality software is software that has fewer bugs (and a shorter tail of fixing remaining issues). There are a bunch of other things like code readability, maintainability, debugability, and so on which can easily be swept under the quality umbrella. Let’s concentrate on the core that the product operates as expected.
I visualize the software development process as a pipeline going from idea to the product used by customers. There are other ways to imagine it. However, the bottom line is that it goes through multiple steps to get to the usable/delivered product.
The way to get high quality is reasonably straightforward. We need to catch issues in this process of going through these steps. (Yes, Captain Obvious is reporting for duty).
I don’t believe that there is one magical approach that will capture all problems. As a result, we need to have defense in depth. We need to have multiple gates through this pipeline which should gradually filter out all issues. The more gates you have, the higher probability that in the end, you will have fewer issues.
The gates that catch problems earlier in the process are better because it is cheaper to fix them earlier. Automated gates are better than manual ones. Blocking Gates that prevent you from moving to the next stage is more efficient than gates that sit on the side. Gates which catch a higher percentage of problems are better, too.
Ok, all of the above is a very generic/vague description useful for a theoretical book about quality. However, it is useless without specifics. Let me get down to the real stuff applicable to most of the bread and butter software companies build.
Start as Early as Possible
It’s better to add these gates as early as possible. It’s much better to build your process around quality checks than retrofit these checks into the existent process. NIST did classic research to show that catching bugs at the beginning of the development process could be more than ten times cheaper than if a bug reaches production. If you start catching bugs early, it will save you time fixing them later.
Design review is a very powerful tool when used in a good way. It sits at the very beginning of the process, before the code is written, and can save an immense amount of time down the road (preventing someone from spending tons of time just to get to a dead-end). It really helps to talk through the problem, the solution, alternative ideas, corner cases, and so on. I really like what one of the smartest people with whom I worked said: “A good design is a design where you can see the code.” It’s like working with the code without writing it.
Unfortunately, I know multiple very senior engineers who really like to go with the “fire, aim, ready” approach. Let’s put together a prototype (even before thinking about different alternatives); let’s call this prototype an alpha version, and fix bugs and limitations in it for the years to come. Saving several hours preparing and doing a design review will cost hundreds (if not thousands of hours) of fixing issues down the road.
I don’t believe that I have to say so in 2021, but I have never seen a quality product without unit tests. Period. There are so many benefits. It helps to prove that your code does what it should do. Unit tests remove all simple problems. They help to get rid of a list of flaky behaviors, and this list goes way beyond catching bugs. It’s not a silver bullet, but unit tests can easily catch a very high percentage of all your bugs.
Again, nothing new here. Somebody looking at your code and saying “WTF?” is a great way to see where your code is over complex/brittle/doesn’t handle some scenarios. Important note: as with any non-automatic checks, you get as much from it as you invest (rubber stamping PR won’t add any value).
We (humans) are terrible at imagining all possible permutations of the system with billions and billions of possible states. All of our testing (both unit tests and integration tests) covers a tiny sliver of all states. Unfortunately, the only place where you can see everything that can happen is the production.
It’s incredible how many people entirely ignore monitoring. You may think that you know how the system works. In the best case, you know only how the system was designed to work. Many more complex and subtle problems emerge only in production and could be caught via monitoring/alerting/analysis.
This is probably the newest addition to my list. Like everything else on this list, I had to learn it the hard way. After several outages which could have been prevented by trivial monitoring/alerting/analysis, you start treating your monitoring as a first-class citizen.
Yep, I said it. We live in a time when everybody is irked by manual testing. I tend to agree that you don’t want to spend tons of time doing only manual testing. However, it’s a must-have for most products to work well. Automation testing catches predicted problems but is almost useless for unpredicted issues.
There were so many times when one of the best QA persons who I worked with came to me saying something like, “I don’t know. It works, but there is something funky in there.” This sentence is not a binary result of tests, and if it was reported by some automated tool, people would easily ignore it as a false positive. However, as soon as I hear it from this QA person, it raises a huge red flag.
You don’t need to analyze each tiny bug. However, as soon as you have some severe bugs escaping, you need to figure out whether you need to beef up one of the games (which should have caught it) or whether you need to introduce additional gates to detect such types of bugs.
The Static Code Analysis Tool (And Similar Tools)
Efficiency depends a lot on the language and a tool. The beauty of it is that it’s completely automatic and, as a result, very cheap. There are some languages (like C++) where this should be on the must-have list. Other languages may be harder to handle with such tools.
End-to-End (Integration) Test
Some level of integration tests is helpful to see that your system works as a whole. However, it’s useful as a seasoning for unit tests and not as a main dish.
I suggest you have maybe one or two end-to-end tests for some major features. However, it’s not a unit test. You can’t cover everything, and more importantly, supporting it will cost you, so you don’t even want to try to cover everything.
Excessive Manual Regression Testing
On the one hand, I understand that as the company’s customer base grows, the impact of bugs becomes more significant. As a result, there is a desire to catch all regression bugs. However, usually excessive regression testing shows a lack of other gates that catches the problems. As a result, there is an overemphasis on the last regression verification.
End-to-End Tests as a Replacement for Unit Tests
As a counter-reaction to manual regression testing, which takes more and more time, companies will try to replace it with excessive automated end-to-end tests. Unfortunately, this especially often happens for a code with poor quality and low unit test coverage. It almost always ends up a costly endeavor (even more expensive than regression testing), resulting in many very fragile tests that are failing left and right.
I saw a company that tried to retrofit quality like that and created a set of 8,000 copy/paste end-to-end tests. The last time I heard, about 80% pass and 20% fail on each run. This 20% is somewhat ignored because trying to analyze 1,600 failed tests is pretty much impossible. In the best case, they are rerun, and thus defeating the whole purpose of this exercise while also spending a great deal of time/money/energy.
Manage Quality Purely via Metrics
Making high-quality products requires a lot of attention to detail (understanding where the problems are, the best way to catch them, where are the strong places, etc.). Metrics abstract you away from all details. You can gauge metrics fast, but you can’t (read "shouldn’t") make a decision purely based on them.
To be honest, this concentration on metrics boggles my mind. I saw a company spending a nontrivial amount of time gathering all these statistics, asking people to constantly fill out gazillion JIRA fields, Google spreadsheets, and so on, just to say at the end, “This component is in good shape, and this one is in bad.” The funny thing is that any SRE working in a company for more than a year could have provided this info in 10 minutes without wasting the time of half of the engineering.
A side note: as soon as some process (like gathering metrics) becomes a goal (vs. being a tool), you will see more of these time-wasting activities with little or no output.
As you can see, nothing is magical, and very little is unconventional here. However, as I mentioned initially, the thing I see missing in a lot of these discussions is this systematic analysis: defense-in-depth, choosing the proper gates, being retrospective and detail-oriented. Even more sobering is that many companies have very few people who have a clear mental model for building high-quality software.
The list above is obviously not exhaustive, but more high-level items that could be easily plugged into the development process and applied to the whole team. There are many of different practices which can improve quality on a personal level (e.g., TDD, thinking through edge cases, code conciseness, and so on).
Published at DZone with permission of Victor Ronin. See the original article here.
Opinions expressed by DZone contributors are their own.