Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Silos and Clouds: Design for Growth

DZone's Guide to

Silos and Clouds: Design for Growth

In a world where breaking down barriers is the key, it's important to understand network limits, VPC, and Amazon Cloud features that can kill your website.

· Cloud Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

At 16:34 PDT on 2016-09-29, the operations team at Puppet received a notification from users that our website, puppet.com, was throwing HTTP 504 errors. This was also true for our other websites hosted on AWS: forge.puppet.com, puppetconf.com, etc.

The web infrastructure for puppet.com, forge.puppet.com, and other sites is a straightforward three-tier architecture. Traffic initially comes in to a cluster of HAProxy-based load balancers, which route the traffic to application servers, and then the application servers use databases hosted in Amazon’s Relational Database Service (RDS).

One crucial implementation detail is that RDS databases must be accessed via a DNS FQDN, rather than a bare IP address. AWS handles RDS failover by pointing the RDS database’s DNS record to a database replica, so that clients don’t have to do anything to fail over when a replica is promoted during failover. Consequently, the application servers must be able to resolve DNS to access the database servers, and the DNS lookup results cannot be cached on the application servers, because that would prevent RDS failover from working as intended.

By default, EC2 instances use Amazon’s DNS recursive DNS resolvers. We reconfigured them to use our internal corporate DNS servers, because we needed our AWS-based infrastructure to access internal services on our corporate network via the VPC IPSEC connection. For example, our EC2 instances need to reach our internal Puppet masters, send logs to rsyslog, and the Icinga2 satellite in our AWS VPC needs to connect to the internal Icinga2 master.

It was determined that we had hostname resolving capability from our EC2 environment to hosts in our co-location, which is connected via a VPC IPSEC connection. We had a known and documented dependence on corporate DNS servers for our AWS stack to function.

Investigating the Network

Network Operations was asked to look at AWS VPC connections between our co-location and AWS. We peer with Amazon over the Northwest Access Exchange here in Portland (NWAX).

The IPSEC tunnels looked good (phase 1 and phase 2):

Phase 1

kris@opdx-fw3> show security ike sa | match 54.239.50.133
3284700 UP     7efe797e3f403535  767973eafb80c056  Main           54.239.50.133


{primary:node0}
kris@opdx-fw3> show security ike sa | match 54.239.50.132
3284701 UP     ee2939b1fa9c0aab  9d4acda0545743a3  Main           54.239.50.132


Phase 2

kris@opdx-fw3> show security ipsec sa | match 54.239.50.133
  <131073 ESP:aes-cbc-128/sha1 6c699ff5 1134/ unlim - root 500 54.239.50.133
  >131073 ESP:aes-cbc-128/sha1 76af7337 1133/ unlim - root 500 54.239.50.133


We also had IP reachability to the remote side of each tunnel interface:

kris@opdx-fw3> ping count 1 169.254.249.57
PING 169.254.249.57 (169.254.249.57): 56 data bytes
64 bytes from 169.254.249.57: icmp_seq=0 ttl=64 time=34.163 ms
<snip>


{primary:node0}
kris@opdx-fw3> ping count 1 169.254.249.61
PING 169.254.249.61 (169.254.249.61): 56 data bytes
64 bytes from 169.254.249.61: icmp_seq=0 ttl=64 time=8.015 ms
<snip>


Next was routing. We allocate 10.224.0.0/16 to our AWS hosts which we receive over BGP from the VPC peer.

{primary:node0}
kris@opdx-fw3> show route 10.228.0.0/16


{primary:node0}
kris@opdx-fw3>


No route; getting interesting. Looking at BGP peers confirms we have a routing problem.

kris@opdx-fw3> show bgp summary | match 169.254.249
169.254.249.57         7224      69521      74770       0      20     15:33 Connect
169.254.249.61         7224      69526      74770       0      18     15:33 Connect


Can the logs tell us anything? We look at the BGP traceoptions logs.

kris@opdx-fw3> show log trace_bgp | match 7224
Sep 29 17:19:09.217  opdx-fw3 rpd[1780]: bgp_read_v4_message:10642: NOTIFICATION received from 169.254.13.161 (External AS 7224): code 6 (Cease) subcode 1 (Maximum Number of Prefixes Reached) AFI: 1 SAFI: 1 prefix limit 100


BGP peering is failing because of a prefix limit — a limit not configured on our end of the link. Once this information hit the chat room, another team member pointed us to the AWS policy that limits received routes over a VPC link to 100 — a limit we had not familiarized ourselves with.

When we went over the 100-prefix limit and BGP failed, DNS resolution stopped working for puppet.com and forge.puppet.com application servers, because they were configured to access our DNS servers via that IPSEC link. After DNS resolution stopped working, application servers could no longer resolve DNS-based RDS database endpoints, and requests failed. Our sites can’t function without their database backends, so those sites served error pages instead of the expected content.


Previously, we sent routes to AWS by redistributing OSPF into BGP, and we had fallen over the 100-route cliff. Here is the policy that was in place:

[edit policy-options policy-statement AS7224_OUT]
    term ADVERTISE-OSPF {
        from {
            protocol ospf;
            area 0.0.0.0;
        }
        then accept;
    }


To get back under the 100-route limit, this was quickly turned into a summarization policy:

[edit policy-options]
   prefix-list PUPPET-10NET {
       10.0.0.0/8;
   }
[edit policy-options policy-statement AS7224_OUT]
    term ADVERTISE-SUMMARY {
        from {
            protocol static;
            prefix-list PUPPET-10NET;
        }
        then accept;
    }


This reduced our announced routes from 100-plus (internal OSPF routes) down to 14 (one summary, one static, and directly connected). BGP peers came up, reachability returned to normal, DNS resolution from app servers to RDS databases started working again, and websites came back online.

Now rewind back in time, and the root cause of all of this traced back to our Belfast office. In preparation for a new test environment, our engineer based out of that office brought up some new subnets (properly) that propagated throughout the corporate routing tables, causing us to cross that AWS threshold. A case of the butterfly effect, though any of us could have triggered this problem; it was a trap just waiting for us to grow to that hundredth route.

Lessons

Silos Can Happen in Small Teams

When we started using AWS, we gave one team overall responsibility for the AWS infrastructure; a second team was already responsible for network infrastructure; and a third team was responsible for the applications on top of that infrastructure. The information needed to prevent this outage never crossed the boundaries of those silos, despite our teams being small and having strong working relationships. We’ve since changed team structures, but the effects of putting AWS knowledge in one team’s silo impacted us more than a year later.

Document and Communicate Failure Modes

When Puppet Forge was moved to AWS using this architecture, we did a failure mode analysis that correctly identified DNS resolution as a potential point of failure, and documented it. However, since that was a one-off assessment outside of a structured process, those known failure modes were documented, tickets created, and then took a back burner to more urgent tickets.

Some team members knew about the risk, but most didn’t realize that the IPSEC connection was in the critical path for website requests. We’re currently working on a more structured approach to assessing the operational readiness of new services so that failure modes can be identified and mitigated systematically, rather than depending on the best efforts of individuals.

Route Summarization

When we originally set up these VPCs, our company was much smaller. The amount of routes in our corporate environment was well below the AWS limit, and modern network hardware can handle many routes with ease. Regardless, it's always a good practice to summarize as you build, whether it be remote offices, data centers, or tunnels to cloud infrastructure. Taking the time to plan your routing also provides a moment to think about the greater IP allocation planning and growth.

Top Takeaways

  • If you're using AWS, you need to read the docsEveryone involved needs to read the AWS docs!
  • Route summarization is a good practice. We can't say this too emphatically.
  • AWS is an all-teams responsibility. You run much greater risks when you're siloed.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
cloud ,aws ,iaas ,vpc

Published at DZone with permission of Kris Amundson, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}