Until I started working on the uSwitch energy website around 8 months ago I had not really done any support of a production system so I learnt some interesting lessons in my time there.
Look at the new code first
We had our application wired up to Airbrake so whenever a user did anything which resulted in an exception being thrown we received a report with the stack trace, environment variables and which page they were on.
When trying to work out what had happened I initially started from scratch and tried to work backwards from the source and create a scenario in my head of what they might have done to get that error.
After a few times of doing this it became clear that there was a reasonable chance that if a user was experiencing a problem it was probably because of some new code that we’d just introduced.
We therefore tweaked our bug hunting algorithm to initially check code that had been changed recently and only after ruling that out did we work back from first principles.
It may never have worked
Sometimes it became clear that new code wasn’t to blame but it seemed implausible that the error could have actually happened.
There was a tendency to assume that the user must be deliberately doing something to make the application break but it soon became clear that they had just managed to hit a code path that had not been hit before.
Even if you’ve done extensive testing on a system users still seem to find paths through the code that haven’t failed previously so it seems best to just assume that is going to happen at some stage.
Log all the things
As I mentioned earlier we were using a 3rd party service to collect errors and other helpful information which was really useful for helping us find the root cause of problems.
The type of logging that you need varies so for a product like neo4j as well as logging exceptions we also log system information and memory settings.
Obviously I’m quite new to this type of work so I’m sure others will have useful bits of advice to share as well.