Hypothesis based development
I think that the reason that agile development works is because it is the application of the scientific method to software development.
A fundamental aspect of that is the importance of forming a hypothesis before you start so that you can understand the results that you observe. By this I don’t mean some grand-unified theory kind of a hypothesis at the beginning of a project but the small, every-day, fine grained, use of predictions of outcomes before you commence some software development activity.
I have adopted this approach over the years and I think it has probably become a fairly ingrained part of my approach to things. It is probably most obvious when we are debugging. When working with a pair, or small group of people, discussing a bug I find myself regularly pausing and restating what we know to be facts, and discussing what our current theory is. The outcome of this is invariably that we can more clearly identify the next experiment that will move our understanding on a step or two.
But it is more than that. I almost always, before running a failing test, which I always do before writing the code that will make it pass, state the nature of the failure that I expect to see out-loud. That is I state my hypothesis (the nature of the failure) I then carry out the experiment (run the test). I then observe the results (look at the results of the test) and see if they matched my hypothesis. Often they don’t, it means I have got the test wrong, so I correct it until I see the failures that I expect. This is the main reason that we run the test before we write the code.
Ok, ok, I know that both of these example are a bit simplistic, but this is an enormously powerful approach to problem solving, in fact it is undisputedly the most proven model of problem solving in human history.
My bet is that as I describe it in my simple examples here everyone that reads this is nodding sagely and thinking “of course, that is how we always work.” but is it really? My observation is that this is actually a fairly uncommon approach. Certainly agile development in general and TDD in particular applies a gentle pressure in this direction, but it has been my experience that, all too often, we wander around problems in loose unstructured ways. We often randomly prod at things to see what happens. We frequently jump to conclusions and hare-off implementing solutions that we don’t know that we need.
We had a good example of this at work today. We have recently made some improvements in some of our high-performance messaging code. We put this into the system at the start of the iteration to give us time to see if we had introduced any errors. Our continuous delivery system which includes some sophisticated functional and performance tests as part of our deployment pipeline found no problems for the whole iteration, until today.
Today is our end of iteration, I’ll explain why we finish on Wednesdays another time. So we had taken our release candidate for the iteration and it was undergoing final checks before release. This morning one our our colleagues, Darren, told us at stand-up that he had seen a weird messaging failure on his development workstation when running our suite of API acceptance tests. He had apparently seen a thread that was blocked in our underlying 3rd-party pub-sub messaging code. He tried to reproduce it, and could but only do so on that particular pairing-station. Hmmmm.
Later this afternoon, we had started work on the new iteration. Almost immediately our build grid showed a dramatic failure with lots of acceptance tests failing. We started exploring what was happening and noticed that one of our services was showing a very high CPU load – unusual, our software is generally pretty efficient. On further investigation we noticed that our new messaging code was apparently stuck. Damn! This must be what Darren saw – clearly we have a problem with our new messaging code!
We reacted immediately. We went and told the business that the release candidate that we had taken may not be ready for release. We asked the QA folks who were doing their final sanity checks before the release to wait until we had finished our investigation before making any decisions. We started think that we may have to take a branch, something that we generally try to avoid, and back out our messaging changes.
We did all of this before we stopped and thought about it properly. “Hang on, this doesn’t make any sense, we have been running this code for more than a week and we have now seen this failure three times in a couple of hours.”
So we stopped and talked through what we knew, collected our facts; we had upgraded the messaging at the start of the iteration; we had a thread-dump that showed the messaging stalled; so had Darren, but his dump looked stalled in a different place; we had been running all of these tests in our deployment pipeline repeatedly and successfully for more than a week, with the messaging changes. At this point we were stuck. Our hypothesis, failing messaging, didn’t fit the facts. We needed more facts so that we could build a new hypothesis. We started where we usually start, but had omitted to earlier because the conclusion looked so obvious. We looked at the log files. Of course, you have guessed, we found an exception that clearly pointed the finger at some brand new code.
To cut a long story short the messaging problem was a symptom, not a cause. We were actually looking at a thread dump that was in a waiting state and working as it should. What had really happened was that we had found a threading bug in some new code it was obvious, simple to fix and we would have found it in 5 minutes with no fuss if we hadn’t jumped to the conclusion that it was a messaging problem – in fact we did fix it in 5 minutes once we stopped to think and built our hypothesis based on the facts that we had. It was then that we realized that the conclusions that we had jumped really didn’t fit the facts. It was this and this alone that prompted us to go and gather more facts, enough to solve the problem that we had rather than the problem that we imagined that we had.
We have a sophisticated automated test system and yet we ignored the obvious. It was obvious that we must have committed something that broke the build. Instead we joined together various facts and jumped to the wrong conclusion because there was a sequence of events that led us down the path. We built a theory on sand, not validating as we went, but building new guesses on top of old. It created an initially plausible, seemingly “obvious” cause – except it was completely wrong.
Science works! Make a hypothesis. Figure our how to prove, or disprove it. Carry out the experiment. Observe results and see if it matches you hypothesis. Repeat!