Cloud services have paved clear so many of the obstacles, entanglements, and distractions to creating a robust, scalable, maintainable infrastructure for all businesses, big and small. In the past, infrastructure could take hours or even days to be spun up and configured. Many businesses continue to require dedicated and isolated infrastructure, mainly due to collocation with important services (e.g., the financial industry with the stock exchange), but for the majority, moving to a distributed platform that is self-managed and yet very hands off yields a tremendous increase in efficiency and decrease in the need for hands-on resources. Out of this transition, many new questions arise, related to monitoring, logging, debugging issues, and security. In this article, I seek to answer the question of security: has the security model and best practices changed with the advent of the cloud infrastructure? I’ll be using AWS (and AWS terminology) as my examples herein.
The traditional architecture is fairly complex, having evolved to compete against threats that are growing both in number and severity. There is an outermost layer that represents the public-facing services a business provides. This might be a website, a service-based platform, or more recently sets of APIs. These services sit behind hardware load balancers, which in turn sit behind network infrastructure, intended to do things like network intrusion detection to weed out known bad traffic, provide access control lists (ACLs) to limit availability of the service, load balance to ensure that only healthy servers are being served traffic, and so on. The public-facing endpoints have their own security measures, running a firewall (e.g., iptables /netfilter) that likely mirror the rules of the hardware load balancer while still providing direct access for occasions such as code deploys and system updates. They also tend to have intrusion detection systems (perhaps even SELinux) to ensure that the operating system isn’t modified, that everything is logged, and that threats are identified, dropped, and in some contexts banned automatically.
These public-facing endpoints then communicate to the internal services, which could be controlled through the use of additional ACLs, or a reverse proxy, and so on. Likewise, there is hardware here that mirrors that of the external services, only all of it exists behind the corporate datacenter firewall. This is where the analogy to the cloud infrastructure truly begins to break down. In the cloud, everything is internal to your VPC, which is split into subnets that can be considered public or private.
Note that once we've left the public-facing load balancer we are in the private subnet, which locks down all incoming traffic to the VPC regardless of the security group. Within AWS we have an opportunity to mimic the corporate model, in fact, we could mimic it exactly by utilizing dedicated hardware (literally or figuratively) for firewalls and the detection services discussed earlier. Before we decide what’s right let’s take a closer look at two prominent attack types, those with the lowest barriers to entry, that are likely to be with us for the duration of the internet.
The simpler of the two, this involves throwing as many packets as quickly as possible at a given service. Typically done using many sources (hence, Distributed Denial of Service), protecting against this kind of attack can be difficult, as distinguishing good traffic from bad may be complicated.
In both the traditional corporate datacenter and in the cloud there could exist a piece of hardware responsible for investigating and routing incoming traffic. In the traditional sense it would be one or more dedicated servers, and in a cloud service such as AWS it would be an AMI on one or more instances. The network intrusion detection system (NIDS, e.g., Oink) has rules that determine what packets smell fishy. If the request looks invalid or malicious the software will drop the packet. At first DDoS attacks typically pass these tests, as they are utilizing the service as intended. It is not until repeated (or much repeated) requests that they are detected.
We’ll pretend this attack is on a web service directly (as is so often the case).
In this environment, each server is running a web server like Apache, where Apache is configured (correctly) with mod_security and mod_evasive. These are configurable to whitelist certain CIDRs, set caps on the number of requests per individual IP per second, and so on. Unfortunately by this point in time the malicious traffic is already reaching your services behind the load balancer, so despite Apache dropping the requests on the floor they still utilize an Apache worker in the process in order to process the request and return a 403 error.
Once those connections are likewise overrun by requests servers will become unreachable or latency will climb to unacceptable levels. A human would then be alerted to the issue by alarm bells and klaxons related to the first metrics that exceed thresholds, which could be related to the amount of incoming network traffic, service latency, or even server availability in the event where CPU or memory has been pegged as a result of the attack. If this is a large-scale datacenter there’s a good chance someone is already awake and on-site, otherwise, they’ll have to be woken, which adds additional minutes while they wake up and get online. Then the investigation begins. Once again the odds are that this person has all of the tools available to identify and null route (divert traffic to a dead end) bad IPs, but this has to be done by the human only after investigation, as there is the distinct possibility that this is not an attack.
If this turns out to not be an attack but rather unusually spikey traffic or a sudden influx of valid traffic (e.g., a lottery website directly after a big drawing) then there is little that the on-call can do at that point other than divert traffic to a page apologizing for the interruption of service and, if authorized, request additional dedicated servers to handle the spikey traffic. Once these are in service they are likely never spun down, becoming a fixed cost, in order to accommodate the rare times of unusually high traffic volume.
In this environment mod_security and mod_evasive are likely also run, but are probably not utilized. Instead, there should be an outer DDoS protection service, such as the ones offered by Akamai, Cloudflare, and so on. These services work like a more sophisticated mod_security and mod_evasive to automatically identify DDoS attacks and route them away from your service or hosts. Like their human counterpart, they also attempt to keep good traffic flowing through the system at the same time, but may have additional means of thwarting bad actors, such as interstitial waiting pages that eliminate impatient bots.
A network intrusion detection tool may still be used internally (as aforementioned), but with cloud services they can be scaled automatically to meet the amount of traffic coming through the load balancer. The same goes for the services themselves. This creates a dynamic cost for both systems based on the actual throughput of the system, which, when traffic is tied to revenue, makes a lot of sense.
It’s fairly obvious that one can use a DDoS protection service in front of their traditional datacenter. What is less obvious is how one goes about routing good traffic appropriately, as well as scaling for it automatically. A lot of people use the term “webscale” and “big data,” which mostly means low-latency high throughput systems to varying degrees. I instead wish to introduce the term “rightscale” (no association with the company), which means that your system is capable of scaling up (and down) to meet the current needs of the platform while remaining flexible enough to foresee and meet future needs.
In this scenario, an attacker has found a publicly accessible authentication service, for the purpose of this example I’ll use the most common service (and typically most powerful), SSH.
We’ll assume that the attacker can bypass the NIDS mentioned above.
The systems administrator in this environment has installed some sort of brute-force banning automation on each server. It’s likely something like fail2ban. This scans loglines and uses jail rules to determine if an IP should be added to the ban list within its iptables chain. This has obvious shortcomings, such as not supporting IPv6 (yet), and not looking for key brute-force attempts for SSH (as opposed to password authentication). Arguably a private key created with adequate entropy will never be broken, so perhaps that’s a moot point. Other means are put into place to ensure SSH security: root login is disallowed; typical usernames (like admin) go unused; the average user and group are isolated and sudoers locked down and well-maintained.
Additionally, servers internal to a load balancer are likely unavailable from outside of the corporate intranet, and yet, in the present day, this is one of the most misconfigured means of security in this setting. Instead, much of the time servers are given IPs that are in fact static and routable, which makes the job of the network admin easier when setting up hardware load balancing. ACLs are then used to ensure that the server can only be reached by certain people on certain ports. Iptables/netfilter is employed to further lock this down to ensure that services such as SSH can only be accessed by certain CIDRs. Maintenance of this can become tiresome, many companies do not control anything but the servers themselves, and a corporate datacenter is a very busy place. In all of that chaos, it becomes exceedingly difficult to pick good IP ranges for whitelists. As a result, any hole in the security of the datacenter could very well mean further opportunity for exploit of the remainder of servers in the intranet.
In AWS (and likely other cloud services) the network / systems administrator creates their own networking configuration, meaning their own intranet and subnets. That intranet (or VPC) specifies a CIDR as large as /16, which is then typically cut into public / private subnets. This puts the power of routing and network communication, as well as IP assignment, directly into the hands of the consumer.
Once again, in this environment, a system administrator is likely to run a service such as fail2ban. The difference here being that within a VPC (in the case of AWS) only IPv4 is routable, and, as a result, fail2ban becomes much more powerful (where password authentication is the norm). Instead of iptables/netfilter, AWS offers security groups, which allow or disallow incoming and outgoing traffic from CIDRs and security groups. When security group (A) utilizes another security group (B) in its incoming rules the expectation is that any server employing security group (B) will have access to any server employing security group (A) using the listed port(s).
This dynamic whitelisting is incredibly powerful, limiting access to a given service to specific groups of servers or load balancers while maintaining incredible flexibility, all the while retaining the collection of privileges to the well-defined intranet.
When everything in the traditional corporate datacenter is configured correctly the delta is non-existent from a security perspective. On the other hand, from a maintenance and scalability perspective, there is an evident discrepancy, one that, when played out in real life, often leads to security problems down the line within the traditional datacenter.
A traditional datacenter environment is just as secure (if not more so) than its cloud counterpart when set up and maintained appropriately. Unfortunately for the traditional datacenter, one of the well-known aspects of security is that it must be easy to accomplish, otherwise, consumers will subvert security for the sake of ease of access.
Cloud services make security easy by providing powerful abstractions to networking and security, and supplement those services with those that add functionality and intelligence to assess and mitigate common threats.