Finding meaning in manual tests
How do you assess the overall quality of your application when you have too many manual/functional acceptance tests to run them all after every sprint? Perhaps you’ve been working on an application for some time and want to predict when the quality will be good enough to ship.
(Here some will say, “We don’t need manual tests; We have unit tests for everything,” If your automated suites thoroughly test integration and fully exercise your UI, fine. Otherwise, we’ll assume that you need or want to augment your automated tests.)
One approach is to run all the manual tests for a functional area with each iteration. This is often coordinated with a push to fix bugs in the same area. It’s an efficient way to use testing resources, and when coordinated with a bug sweep, it helps you find the things you broke when you swept.
Be aware, though, that it tells you little about the quality of the entire application. A different approach, which can be used in combination with a focused testing effort, is to select a set of tests at random and execute those.
- select a different random set of tests to run with each iteration.
- execute each test and record whether it passes or fails
- calculate an overall pass rate for the suite.
Easy. Now what do you do with the failing tests? In terms of learning about your application, it doesn’t matter whether you fix the issue or not - but it’s essential that if you do fix it, you don’t change the original pass rate. That just pollutes your data.
Lets say 90% of your sample tests pass. Can you assume that 90% of the tests you didn’t run would also pass? Not necessarily. What’s cool about sampling is that it tells you how much to trust your results.
How many tests is enough?
To know how many tests to run for a given level of precision, you can use a sample size calculator like the one at http://www.surveysystem.com/sscalc.htm.
To calculate sample size, you have to provide some guidance. First, tell it how many tests you’re sampling from. This is your population. Lets assume you have 1,000 tests.
Next, select a confidence interval of say, plus or minus 5%. If your sample tests pass at 90%, you can now say the pass rate for all tests (run and not run combined) is probably 90% plus or minus 5%, i.e. between 85% and 95%.
Note that I said “probably.” To be more specific, select a confidence level (usually 95% or 99%). If you pick 95%, you can now be very specific about what “probably” means: “I’m 95% sure the pass rate for all tests is between 85% and 95%. Or rather you could say that if your sample size is big enough. In this case, the calculator shows that you’d have to run 278 randomly selected tests for that level of precision.
The moral of the story
If that seems like that’s a lot of tests for little precision, then you’ve uncovered the most important lesson here. Think about how many times you’ve seen someone use a similar pass rate, taken from an even smaller, non-random sample and act like it was perfectly accurate: “Last month we passed 91%, this month we passed at 90%. Why are we getting worse?”
If you’re using sampling, you know that a difference that small is probably meaningless. The real value of being precise about the limits of your knowledge is that it can keep you from chasing random fluctuations and making things worse. The way to judge your improvement is to wait until you have a handful of results, plot them and look for trends.
The frame is not the universe
Before ending, we should be clear about one more thing: All we really know is the pass rate for our tests. We’ve been making an implicit assumption that the suite would provide an accurate measure of overall quality if we ran them all. That remains to be proven.
If you think of each test as exercising a particular path through the application, then some terms from sampling theory can help make the remaining limits of our understanding clearer:
Universe: What we really want to measure. In this case perhaps our quality over the set of all possible user paths.
Frame: The set of accessible paths from which we draw our sample. In this case all the list of paths we’ve documented as unique test cases.
Sample: A randomly selected subset of the frame.
In software terms, we need to understand the coverage our test suite provides. There are numerous ways we can define coverage, but that’s a subject for another day.