DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

DNS Outage Was Doomsday for the Internet

What was supposed to be a quiet Friday suddenly turned into a real “Black Friday” for us (as well as most of the Internet) when Dyn suffered a major DDOS attack. From an internet disruption’s perspective, the widespread damage the outage caused made it the worst we may have ever experienced.

Mehdi Daoudi user avatar by
Mehdi Daoudi
·
Oct. 26, 16 · Opinion
Like (1)
Save
Tweet
Share
3.11K Views

Join the DZone community and get the full member experience.

Join For Free

what was supposed to be a quiet friday suddenly turned into a real “black friday” for us (as well as most of the internet) when dyn suffered a major ddos attack. from an internet disruption’s perspective, the widespread damage the outage caused made it the worst i have ever experienced.

at the core of it all, the managed dns provider dyn was targeted in a ddos attack that impacted thousands of web properties, services, saas providers, and more.

the chart below shows the dns resolution time and availability of twitter.com from around the world. there were three clear waves of outages:

  • 7:10 est to 9:10 est
  • 11:52 est to 16:33 est
  • 19:13 est to 20:38 est

dns-twitter

the dns failures were the result of dyn nameservers not responding to dns queries for more than four seconds.

we were impacted in three ways:

  • our domain catchpoint.com was not reachable for a solid 30 minutes until we introduced our secondary managed dns provider verisign. we also brought up and publicized to our customers a backup domain that was never on dyn, so our customers could login to our portal and keep an eye on their online services. all of these were in standby mode prior to the incident.
  • our nodes could not reliably talk to our globally distributed command and control systems until we switched to ip only mode, bypassing dns lookups. this was a feature we had developed, tested, and in production, but was not active as our engineering teams planned one more enhancement. due to the nature of the situation, we deemed the enhancement to be lower risk than what we were experiencing.
  • many of our own third party vendors that our company relies on stopped working- customer support and online help solution, crm, office door badging system, sso, 2 factor authentication services, one of the cdns, a file sharing solution, and the list goes on and on.

this blog post is not about finger pointing; the folks at dyn had a horrible day putting up with their worst nightmare. they did an amazing job of dealing with it, from notifications to extinguishing the fire. this is about how to deal with the worst case outage, as a company and an industry.

as with every outage, it’s important to take the time to reflect on what took place and how this can be avoided in the future.

here are some of my takeaways from friday, and the must-have solutions:

  • dns is still one of the weakest links in our internet infrastructure and digital economy. we have to keep learning and sharing that knowledge with each other. here are several articles we have written on dns.
  • a single dns provider is not an option anymore for anyone. no company, small or large, can rely on a single dns provider.
  • dns vendors should create knowledge base articles about how to introduce secondary dns providers, and they must be easy to find and follow.
  • dns vendors need to make the setup of auto – transfer easier to find. having to open a ticket in a middle of a crisis to find out the ip of the xtransfer name servers is simply not a viable option.
  • dns vendors should not set high ttls (two days) on the authoritative nameserver records they pass on the dns queries, and it should be easy to drop or change ttl. while this is great to bypass changing records on the tlds, making the nameservers authoritative for two days becomes a headache when trying to switch to or migrate from a back-up solution.

image001

introducing another dns vendor wouldn’t have achieved 100% of the result until you go into the dyn configuration and add that other solution in the mix:

  • the community must work together to come up with commercial or open source solutions to make dns configurations compatible between vendors (this is for complex dns setups like failover, geo load balancing, etc.). this is a no longer a nice-to-have, but a must-have.
  • there needs to be a way to push registrar configurations faster. we need an emergency reload button at the registrar levels, but also at the root and tld levels. and this means someway to tell dns resolvers to purge their cache. waiting for two days is not going to work after this catastrophic event, just like having multiple dns vendors active does not scale financially for most companies.
  • lastly, it is time for the internet engineering task force to take a very close look at the dns standards and figure out how to make this key protocol more redundant and flexible to deal with the challenges we are facing today.

some takeaways from a monitoring standpoint:

i had people tell me, “but mehdi, i am not seeing a problem in my rum.” when your site isn’t reachable, rum won’t tell you anything because there is no user activity to show. this is why your monitoring strategy must include synthetic and rum.

  • dns monitoring is critical to understand the “why.”
  • dns performance impacts web performance.
  • the impact was so incredible, some sites that didn’t rely on dyn still suffered outages or bad user experience. this is because they used third parties that did rely on dyn.

we interact with many things on a daily basis (cars, cell phones, planes, hair dryers) that have some sort of certification. i urge whoever is responsible to consider the following:

  • a ban on any internet-connected device that does not force the change of default credential upon starting it. there shouldn’t admin/admin for anything including cameras, refrigerators, access points, routers, etc.
  • a ban on accessing of such devices from any place on the internet. there should be some limitation, either access through the provider interface or from local network.
  • consumers should also pressure the industry by not buying the products that aren’t safe. maybe we need an “internet safety rating” from a governmental agency or worldwide organization.
  • a must-have feature on every home and smb router and access point is the ability to detect abnormal traffic/activity and turn it off or slow it down; sending thousands of dns requests in a minute is not normal. we should learn from microsoft and what they did with windows xp to limit an infected host.
  • local isps must have capabilities to detect and stop rogue traffic.

cyber security is dire. i hope this incident serves as a huge wake-up call for everyone. what happened friday was a code blue event; we rely on the internet for practically everything in society today, and it’s our job to do everything we can to protect it.

thank you, dyn, for the prompt response times to the support tickets, to verisign for last-minute questions, our customers who were very patient and understanding, our entire support organization, and some special friends in major companies who offered a helping hand by providing some amazing advice around dns.

Internet (web browser) Domain Name System

Published at DZone with permission of Mehdi Daoudi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • Unlocking the Power of Polymorphism in JavaScript: A Deep Dive
  • Last Chance To Take the DZone 2023 DevOps Survey and Win $250! [Closes on 1/25 at 8 AM]
  • Load Balancing Pattern

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: