The worst bug I’ve ever tracked down and fixed was a system freeze hidden in some 300.000 lines of code. It was only experienced when the device was left untouched for about an hour (typically a lunch break) while mounted in a grader and connected to a high precision GPS. I only had a few days to find and solve it.
The device used GPS measurements to automatically control the height and angle of the grader’s blade, by connecting to the graders hydraulic system. We’re talking about a system where the data sent out actually did things in the physical world instantaneously.
We had been working on the system for nearly a year and it was during the final field tests that the bug was found. It only occurred once every few days, but it was frequent enough to be a blocker for the release of the product. The large problem was that we were never able to reproduce the bug in a lab with a debugger attached. It only occurred when the device was wired to a proper million dollar machine. Just the GPS receivers cost tens of thousands of dollars.
The code base was large and was completely multi threaded. The only thing I could do was start reading.
I started at the core GUI message pump and tried to follow all code paths to find out where the GUI thread could possibly get stuck. After a few days of digging, I found that someone had created an extra message pump in a section of the GUI that displayed the results of some background tasks. It was a hack to keep the background tasks running and reporting status so that the GUI could be updated. Once removed and reworked into a proper message passing design we no longer experienced any hangs.
The project manager was happy too – he got another bug to mark as fixed in his excel sheet, although he complained that I had taken a long time to fix mine compared to the other bugs, which were mostly fix-a-typo or move-a-widget-2px-left style bugs…