Let’s pretend that we’re in the house building industry (c’tors Inc.), One day, while you’re getting fresh air and working hard, the building inspector comes along, climbs to the top of our not-yet-complete structure and yells, “There’s something wrong with the left side of the building!” then goes away. As a pretend construction professional, what do you think are the chances that someone would fix the problem?
If the scenario above sounds crazy to you – that’s okay, unfortunately I see it unfold daily.
Most companies these days have some kind of automatic build process (and I use the term loosely), files get checked-in, submitted, pushed (all of the above?) to the source control and the server will try its best to build the new source (and maybe run some tests) at one point according to preconfigured trigger; anything from immediately to the next day. The problem starts when that process fails. At that point there are two possible outcomes: someone will fix the problem and make the build continue to pass (a.k.a green), or everybody will ignore that server for a long time which could become forever.
I’ve usually noticed that when software developers ignore the broken build they do not do so out of malice or laziness.
Unfortunately, a broken build means that although someone (perhaps yourself) took the time to automate parts (or all) of the build/test process, all of that hard work is wasted because no one will fix the build. I’ve noticed that when the build system is left broken for a long time, it happens due to one of the following reasons:
- No or little build visibility
- Lack of knowledge
- No definition of individual responsibility
Ideally, every relevant member of the team must know when a build fails. Better yet, all of the company should have easy access to the current build state.
Consider the following:
- All of the team has access to the build server by URL
- Email is sent to the relevant person when a build fails
- 60-inch screen in the middle of the dev room shows the current build status
- When a build fails a big red light mounted in the dev room or hallway blinks
- When a build fails a picture of the person who broke the build shows up on every screen in every conference room
I think #5 is going too far but you get the point.
If you think that installing a build server and making the URL available for the whole company is good enough – I got news for you. People are way too busy to go to that URL and try to understand what the build server is showing them. Adding email notifications in case of failure is also a good idea but not sufficient – after a few of those some (read: most) developers would learn to ignore them. If you add email notifications on a successful build you’ll only make this process (of ignoring builds) happen faster.
A failing build should be visible and impossible to ignore
At one company, I worked with some developers didn’t even know what the build URL was, and no idea how to find out why the build has just failed…
Another important factor is how easy or how hard it is to discover why the build has failed. Not all build servers were created equal – some do a better job of showing the root cause of the failure and some require reading 10 pages of logs. My point is that fixing a broken build happens when you need to do something else (developing software), and as such should be as simple and painless as possible.
This is usually a problem if the build script performs too many things. Let’s go back to our imaginary scenario where the build inspectors shouts about a problem in one of the build’s components – and I’m not familiar with that component or I don’t have the right expertise to fix that particular problem. In that case I’m going to continue working as if nothing happened – or go and grab a cup of coffee until the problem resolves itself.
The problem with big build scripts that do a lot of things is that it’s hard to tell why a specific step (or 100 tests) have just failed, and then everyone on the team gets a bad case of “it’s somebody else’s problem”.
After fixing the visibility problem we know the build has failed and with some investigation we can tell why – and yet it does not matter if the problem domain is so complex that no one how to understand the reason for the failure.
The right solution is to try and split the build into several individual builds where each team (and each team-member) knows exactly where their responsibility (development wise) starts and ends.
In the heart of a healthy process lies personal responsibility and integrity.
When a build fails, the last person to commit code is responsible to make sure that the build passes as quickly as possible. Anyone affected by this failure is responsible not to make the problem worse by blindly committing more code and to help if asked. Simple as that. This kind of personal integrity can only be achieved if the build failures are visible and easy to investigate. Some teams need a manager to tell them so and some need a simple reminder from time to time. It usually helps if there is someone who is passionate about the build, although this is a team effort, not “Joe’s” problem.
I would avoid shaming (e.g. show the build breaker's name on all conference screens), and instead try to understand why people don’t care that the build is broken. Usually it has something to do with one of the previous points and not because of lack of commitment.
A broken build is not a pretty sight and should be fixed as quickly as possible. The good news is that it’s easily solved with the proper tools, education and plain old nagging, as long as you take the time to understand the reasons that other talented developers seem content to leave it broken.
Try it out – you might be surprised to find out that you’re not the only one who cares.