In a recent blog post, I wrote about Test Gap analysis — our analysis that identifies changed code that was never tested before a release. Often, these areas (the so-called test gaps) are way more error prone than the rest of the system, which is why test managers try to test them most thoroughly.
Test Gap analysis provides test managers with an overview of remaining test gaps during the test phase. In practice, our code quality software Teamscale identifies changed but untested methods in the source code and displays the results on a treemap.
However, in many projects testers are non-coders or do not know the source code, and thus cannot infer any actions only from knowing which methods have been executed or not. In fact, to our experience this requires knowledge about both source code and test cases. Typically, developers and testers need to interpret the results together. It would be helpful to represent test gaps in a way that testers can understand and allows them to decide which actions to take.
Fortunately, there are artifacts that help us to bridge this gap. Change requests document the changes that have been performed in a software system. They are written in natural language and therefore also understandable to non-coders. In addition, in many systems, there is a reliable connection between change requests and code changes via commit messages, which allows us to link from change requests to test gaps. Since new features typically correspond to one or more change requests, we can lift test gap information from code to features.
Building on this, we came up with feature coverage — the aggregation of code changes and test execution based on change requests. Our theory was that this would make Test Gap analysis results more understandable and actionable to testers, as feature coverage lifts test gaps to the level of functional units that have not been executed.
To better understand the benefits and limitations of feature coverage, we carried out an empirical study, which is part of the Bachelor’s Thesis of Jakob Rott. In the following, I will give an overview of selected results.
How to Calculate Feature Coverage
Test Gap analysis combines two analyses: a static analysis to detect code changes, and a dynamic analysis to detect method execution. What do we need to modify to get change-request-specific results?
Figure 1. Commit with associated change request in Teamscale.
First, in many systems, developers write commit messages that contain the ID of the change request that they worked on in the commit. In Teamscale, we use these identifiers to group commits according to change requests to provide an overview of the quality-related effects per change request, as illustrated in Figure 1.
Since Test Gap analysis works on the level of methods, the link between change request and commit is not enough. We need to establish a link between change request and each method that was touched during the commit.
The second part of Test Gap analysis is gathering execution information. This information is already collected on a method-granular level, so we do not need to modify anything on this end.
Figure 2. Feature coverage treemap.
Figure 2 shows the change-request-specific results on a treemap: red and orange rectangles are test gaps within one specific change request. Light green rectangles correspond to methods that have been changed in the course of the change request but have been tested afterward. Finally, dark green rectangles depict other test coverage.
To further investigate feature coverage, we chose our tool Teamscale as a study subject. As the first step, Jakob calculated feature coverage for a random sample of 54 change requests.
How to Separate the Wheat From the Chaff?
In general, not all test gaps are equally relevant. While some test gaps are rather simple methods, like getters just returning a value, others comprise more complex logic — a sorting method, for instance. An important aspect of the Test Gap analysis is that it allows us to make a conscious decision if a specific piece of code should be tested or not. Typically, this decision is made based on the criticality of the method at hand.
To make Test Gap analysis as helpful as possible, it should not find test gaps that are not worth testing. Of course, what is worth testing will differ from project to project. But there could as well be methods that would never be considered worth testing in any context. Those could be excluded automatically.
To investigate the relevance of the different test gaps we had found for the 54 change requests, Jakob asked the respective developers to rate them. Only around 22% of the test gaps were rated irrelevant. These included simple getters, methods with only one or two statements, and implementations of the
toString() method. These methods could be filtered out automatically, allowing us to report more accurate test gaps.
The rest, around 78% of the test gaps, were rated relevant. Interestingly, a substantial number of test gaps were caused by refactorings. Since these are carried out by the IDE and do not change functionality, they could safely be ignored, as well.
The takeaway is two-fold. First, a large amount of the found test gaps are relevant, which means not testing these methods bears a risk. Second, most of the test gaps that were rated not relevant can be filtered out automatically.
How Test-Specific is Feature Coverage?
Before manual tests can be executed, the system under test typically has to be started or set up. When collecting execution information, this startup procedure already creates coverage. However, the generated coverage (we call it startup coverage) is independent of the actual test case. Consequently, it does not allow us to draw conclusions about the tested functionality itself.
Figure 3. Different test coverage types for a single change request.
Figure 3 illustrates the different types of test coverage for a single change request. The gray portion represents the methods that have been executed during startup. The light green portion depicts the methods that have been executed in a specific manual test of the change request. Red methods have not been executed at all.
If most methods already get executed during startup, the significance of feature coverage is limited. Therefore, one goal of Jakob’s thesis was to investigate how test-specific feature coverage is. To this end, he studied which portion of the coverage is independent of the manual test at hand, and therefore meaningless.
For our sample of 54 change requests, Jakob studied the portion of the change set that was executed simply by starting Teamscale compared to the portion executed when performing the associated manual test case.
Figure 4. Differentiation of method execution.
Figure 4 depicts how many methods have been executed specifically by a test case, how many have been executed by the startup procedure, and how many have not been executed at all—across all studied change requests.
In half of all studied cases, more than 75% of all methods were executed specifically by the test case. In contrast, the startup coverage was rather low. For 75% of the studied change requests we measured a startup coverage of under 18%.
The results show that feature coverage depends to a high extent on the actual test case. In fact, for many change requests, it was completely test-specific. This means that feature coverage provides meaningful insights about how thoroughly a feature has been tested.
Of course, results will differ slightly for other systems. However, our experience with Test Gap analysis at our customers makes us confident that the portion of test-specific coverage is similar. Moreover, creating startup coverage is simple for most systems, so the »noise« could be filtered out.
Jakob’s thesis shed light on the benefits and limitations of feature coverage. In this proof of concept, we learned that we can calculate feature coverage on most systems. We are also confident that feature coverage is test-specific in many cases and that irrelevant test gaps can be excluded systematically. This suggests that feature coverage provides meaningful insights about how thoroughly a feature has been tested.
Overall, we are confident that feature coverage adds value for our customers. It shows the test progress for change requests, which are functional, human-readable descriptions. This gives testers a better understanding of what their tests missed. We hope that feature coverage will also slightly simplify the selection of test cases necessary to close test gaps since it associates them with system functionality, for which test cases are typically written.
Given these positive results, we decided to implement feature coverage in Teamscale. It will come with one of the next releases, so stay tuned. I’ll keep you posted.