Ten Questions to Ask After a Network Outage
Ten Questions to Ask After a Network Outage
So your network just went down and you got it back up. We take a look at ten questions you can answer to assess what happened and how to prevent it from happening again.
Join the DZone community and get the full member experience.Join For Free
Sensu is an open source monitoring event pipeline. Try it today.
A network outage can be disruptive and expensive, so you want to eliminate the causes. This checklist of the top 10 questions to ask after a network outage can help you identify the reasons for an outage and expose any gaps in your monitoring. Ideally, you'll learn from an outage to prevent others in the future. In addition to outages, your business may also experience network slowdowns or frequent bottlenecks that can be very disruptive. Here are the 10 questions to ask after a network outage or severe performance loss.
1. How was the outage detected?
The first step is to identify how you found out about the outage or network slowdown. If you got an alert immediately after the network went down, that's a good sign that your monitoring systems are in place and functioning properly. Even better would be if you received an alert before the system went down. The worst-case scenario is to receive a call from a customer or user who can't access the applications they need. Your network monitoring system should send root cause alerts that identify the precipitating event, rather than flooding you with hundreds of alerts that force you to search for the root cause.
2. Were the right people notified quickly?
The next item on your checklist should be identifying the people who received the alerts. A broadcast message that goes to everybody on the IT team is intrusive and counterproductive. You want a network monitoring solution that can identify the source of the issue and notify the responsible person. For example, it makes no sense to notify your hardware specialist if malware caused the outage. Isolating the problem to its root cause is important to a fast resolution. The person notified can then call in others for assistance if necessary.
3. What caused the outage?
If your network monitoring system notified you and identified a root cause of a slowdown or outage, you're ahead of the game. But once you're back up and running, make sure that the cause was properly identified. Sometimes you may think you have one problem when you actually have two. Or even three. Or an entirely different problem. Trace the problem back to its source so you can be sure you take the right corrective actions.
4. Is there malware on my network?
Anytime you have a network outage, you should suspect malware. Study your intrusion detection logs and run a full system scan with updated virus definitions. You may also want to add additional security to protect against 'non-malware attacks' - an increasing concern. According to research published in Small Business Trends, the number of ransomware attacks is up at least 500 percent this year. Don't get caught. Educate users and ensure you have good backups and procedures for isolating infected devices.
5. Has any of my hardware been damaged?
If the network outage was caused by hardware, make sure you replace the device. Even if you reboot, it may simply fail again. Also, even if a device didn't cause the problem, it may have been damaged during the shutdown. Run diagnostics on everything to be sure.
6. How long will it take for my systems to come back?
You should have a disaster recovery plan that can help you with predefined procedures for bringing the network back up. That can help you estimate how long you'll be down or not operating at peak capacity. A big portion of the recovery time will be your backup and restore procedures, so make sure you always have a good backup plan for data.
7. Did we lose any data?
Depending on the cause of the outage, your data may have been affected. With ransomware or malware, your data is almost certainly corrupted to some degree. If it was a mechanical failure, validate that all in-process transactions were completed. Check your logs and roll back or re-enter any transactions that didn't make it.
8. Were there other impacts on users or operations?
Check with critical operational departments to understand the impact. Were customers unable to place orders on your e-commerce site? Was finance unable to process invoices? Understand how the outage affected the business, not just the data center. Once you know, determine how you can minimize the impact on critical business areas if there's another outage or slowdown that hurts business.
9. Were SLAs affected or violated?
Did the outage or performance issue violate your contractual SLA with an ISP, SaaS or hosting provider? If so, you may be owed a refund or credit on your next invoice. Don't forget to ask for the money since many companies won't provide it unless you do.
If you are the one providing the hosting or otherwise the provider of an SLA, make sure you send out a note of apology to affected customers, and explain the steps you have taken to prevent a recurrence.
10. What preventive actions will we take as a result?
Now that you have identified the root cause of the outage or slowdown and all the affected business areas, either update your existing plan or pull together a new one that identifies actions, responsibilities and corrective actions. Publish the results, and incorporate everything into your operational and recovery procedures.
It may not be possible to prevent every network outage, but you can minimize the damage and prevent some future issues by asking the right questions when you do run into a serious problem.
Published at DZone with permission of AppNeta Marketing , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.