Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Lessons Learned From 2016's Biggest Outages

DZone's Guide to

Lessons Learned From 2016's Biggest Outages

From outages caused by DDoS attacks to purely accidental outages, 2016 saw an unprecedented number of outages. What can we learn from them?

· Performance Zone
Free Resource

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

We saw an inordinate number of outages last year, affecting everything from government institutions to security experts and major DNS providers. As a DNS provider, we are constantly monitoring and investigating reports of outages and latency issues from around the world. In almost every single one of our investigations, we have found that the outages could have been avoided. While we know hindsight is 20/20, we think that valuable lessons could be learned from these events — lessons that apply to businesses of all sizes and industries. 

NS1

When: May.

Duration: Hours.

Cause: DDoS attacks.

This attack was unusual, as most DNS providers only have to fend off attacks that are targeted at clients’ domains. In this case, the massive DDoS attack targeted the actual provider, taking down everything their status site, public website, and management services. Big-name clients like Yelp, Imgur, and Alexa were all unavailable during the outage. The majority of their thousands of clients were either unavailable, while the few that had a secondary DNS provider were able to stay online with minimal effects. The attack kicked off a big trend that seemed to define 2016, the movement towards widespread adoption of multiple DNS providers. 

Pokemon Go

When: July.

Duration: Two outages over two days.

Cause: DDoS attacks.

Pokemon Go suffered two large attacks during its opening weekend release in the U.S. and Canada. PoodleCorp and OurMine both claimed responsibility for the attacks. The app seemed to keep getting bogged down by both malicious and legitimate traffic over the first few weeks of its launch. However, this wasn’t surprising, given that the app broke multiple Guinness world records during its first week and month. The app had seen widespread latencies and server crashes in the week before the attacks but as a result of legitimate traffic.

Pokemon Go’s biggest downfall was its inability to scale quickly enough for it’s rapidly growing user base. This is a problem we see often with startups that go viral. The best way to scale up without draining your wallet is to look towards cloud-based networks. These networks are used to managing massive amounts of traffic and setup only takes a few minutes.

Scalr 

When: July.

Duration: Hours.

Cause: Accidental removal of all DNS records.

2016 saw a lot of “accidental” outages resulting from admins mistakenly deleting crucial records or even entire zones. Scalr was the first large-scale outage of this kind that made headlines around the world. Even DNS providers have been victims of such a folly. In 2015, a provider was downed for a short time when an admin accidently deleted a crucial record that kept the provider’s services online.

There are a few more of these “accidental” outages to come, so hold tight for the lessons learned.

Library of Congress 

When: July.

Duration: Three days.

Cause: DDoS attack.

It’s easy to say that the longest outage of the year was a result of a DDoS, but the real culprit was prehistoric IT practices. The Library of Congress had previously been outed by watchdogs in 2015, who found 74 critical weaknesses in their aging infrastructure and IT security systems. Many businesses and institutions that have been around for at least a decade are prone to hold on to the services and software that was handed down to their by their predecessors. IT is evolving so fast that five or even two-year-old strategies may not stand up against an attack (or even legitimate traffic). Admins and IT decision makers need to stay informed of industry trends and pick providers who are consistently pushing updates and new services to stay up to date.

Rio 2016 Olympics 

When: August.

Duration: Periodically over weeks.

Cause: DDoS attacks.

This wasn’t actually an outage, but a feat that teaches us an important lesson about preparedness. Over the course of the 2016 Olympic games, “the publicly facing websites belonging to organizations affiliated with the Rio Olympics were targeted by sustained, sophisticated DDoS attacks reaching up to 540Gbps,” according to Arbor Networks. Attacks of this magnitude are rare and usually take down their target for hours or even days. Thanks to months of hard work and preparing for the worst, their websites were able to stay online.

Blizzard (Creators of World of Warcraft)

When: August.

Duration: Three attacks in one month; periodic latency; and disconnections.

Cause: DDoS attacks.

Blizzard fell prey to a common plight of large gaming sites approaching a launch. The gaming giant was knocked offline by multiple waves of DDoS attacks after the launch of a new World of Warcraft game. They were previously attacked by the famed Lizard Squad in April. The same group who kept PSN and Xbox Live offline during Christmas 2015.

While attacks like this seem to be unavoidable, at least in Blizzard’s case, the key staying online is redundancy. The only way to withstand a DDoS is to have enough space to absorb as much of the attack as possible without affecting client connections. Cloud networks and having multiple service providers has proven to be the best strategy.

National Australia Bank

When: October.

Duration: Hours.

Cause: Accidental removal of DNS zone.

A member of National Australia Bank’s outsourced IBM team mistakenly deleted their production DNS zone. Customers were unable to make payments or use ATM’s for a few hours. Mistakes like these are difficult to avoid, as they usually are the result of an internal issue. However, there are security measures that can be taken like limiting the permissions of each account user or version control. IT decision makers should ensure that their service providers offer these features so they can avoid these kinds of situations.

Recently, an admin went to Reddit to confess that he had accidently deleted his company’s DNS zone and knocked their systems offline. He later reported that their team had been backing up their DNS configurations for awhile and were able to roll back their changes.

Krebs on Security

When: September.

Duration: N/A.

Cause: DDoS attack, Mirai botnet.

The largest attack of the year surprisingly did not result in an outage. Akamai, the company that Krebs used to mitigate the attack, reported that the attack maxed out at over 665 gigabits of traffic per second. That’s more than double the largest attack they had ever seen. The attackers reportedly used the Mirai botnet that uses millions of compromised IoT devices to flood a target with queries. This kind of attack was unprecedented and has ushered in a new age of security. In this case, the lesson learned didn’t have to be learned the hard way. Krebs had multiple layers of redundancy, used a cloud-based network, and had DDoS mitigation.

Dyn 

When: October

Duration: Three hours.

Cause: DDoS, Mirai botnet.

Most remember the Friday morning when “half the Internet was down,” at least according to the headlines. Big names like Twitter, Etsy, Spotify, and Netflix were all down for most of the morning. The common factor? They shared the same DNS provider that was knocked offline by the same kind of botnet that struck down Krebs on Security just a few weeks prior. Despite warnings against single-homed DNS after the NS1 outage earlier that year, over 58% of the top 100 domains that outsourced their DNS were still using only one DNS provider. In the weeks following the attack, our team monitored the top domains to see how many would add a secondary provider. Within three weeks, 5% of the top 100 domains had added another provider.

Overall, the tried and true best way to avoid downtime as the result of a DNS provider outage is to have more than one provider. Recent studies are also finding that it can actually give you a performance boost and cut down on DNS resolution times (more on this in the coming weeks). The second best method is more of a habit. Vet your service providers, your hosting services, CDN’s, and even their upstream providers. Look for a history of reliability, transparency, and innovation.

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

Topics:
dns ,ddos ,outages ,performance

Published at DZone with permission of Blair McKee. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}