On one of my recent projects, I just got added to a distribution list that receives alerts from our monitoring system. For the first few days I tried to read some of those notifications, but then one day when I opened my email it was flooded with about 500 or so messages. Some of them were more of less the same message coming every few seconds/minutes. Over the course of a few weeks, I got around 100+ messages every day, most of them while I was asleep.
The most interesting things that came to my notice are:
- Most of these alerts, while they are warnings, don’t really threaten to bring any of our services down anytime soon. Some of these aren't even acted upon and they self recover.
- The team only acts on a couple of these; everything else is more like noise.
- Inboxes get flooded during the night when our core support team is sleeping and there is no way to know for the core support team if something is going to fail soon.
it’s like jumping into my car and every time i see the dashboard every light in there is brightly lighted up – to the point that one day i stop caring. eventually, someday something will fail – I just hope it’s not the day when I am driving to someplace in an emergency
When I reached out to the team and articulated the issue I have with our notification strategy, the prompt response I received was to create a new DL, which I believe will be the goto list where all notifications go. Yes, I will be receiving fewer emails and maybe none, but it solves nothing.
This is just a huge symptom, and if you see in your organization you should wonder if the team is on top of knowing when something is really gonna fail. Or are you relying on a system that sends everything it sees wrong as a notification and lets a bunch of humans decide what to act on or not? You also can't avoid the fact that many of these notifications are going over a channel that has no way to “push” notify a user of an issue.
Think of a car dashboard with all these lights sitting not in front of a driver, but in the glove box. someone would have to open the glove box to see if a light is on or not. The light maybe on for hours before someone realizes something’s gone wrong.
I don’t have a technical solution in place for my project yet, but the analogy that I will leave you all with is to think of what a notification/alerting system should look like?
- Should your car’s dashboard light up with a green indicator when something has happened.
- Green and soft clicking sounds – eventually a driver will see it and will turn it off, but you don’t want to alarm the driver – it’s not detrimental
- Have the car’s dashboard light up in yellow like a warning. I have my car light up with a fuel warning as soon as the levels are dangerously low. I can still drive 80-100 kms based on how I drive but it’s more than enough for me to eventually see it and get to a refueling station
- Have car’s dashboard flash a bright red – like "doors open". Well, you won't want to drive your car with the doors or hood open. Hence, a bright red warning sometimes accompanied with a few sounds.
- Or have a sound beep every few seconds. I like how my car alerts me every few seconds when i dont have my seat belt on or when I drive over 120 kmph. It’s like reminding me every 10 seconds that I have something that could go fatally wrong.
How will this translate for me and my project team is something I don’t know yet. But, as we go about fixing this, I will post if here. What have you done to address your strategy, or is it still all dashboard lights flashing all the time?