You are probably aware that you need to monitor your production systems for errors, and to add health monitoring for your servers.
But are you monitoring negative events? What is a negative event, stuff that should have happened and didn’t.
For example, every week you have a process that runs to update the tax rates that applies to your customers. This is implemented as a scheduled process, but for some reason (computer was just being rebooted, the user’s password expire, etc) that process didn’t run. There isn’t an error, pre se. You won’t get an error because nothing actually had a chance to actually happen.
Another example would be getting a callback confirmation that an order payment has been correctly processed. That usually happen within 1 – 5 minutes, and you get an OK/Fail notification. But what happens if that notification just never came?
This is a much more dangerous scenario, because you have to not only be prepared for handling errors, you have to be prepared for… nothing to happen.
What it means is that you have to have some way to setup expectations in the system, and act on them when you don’t get a confirmation (negative or positive) within a given time frame.