Why servers and stuff are never provisioned to 100% capacity
It is an unwritten rule that web/app servers should never, ever pushed to 100% capacity.
Ignoring this unwritten rule will invariably result in the phenomenon we’ll call “up for thee but not for me,” which is simply the situation in which a web site or app responds to the guy in the next cube – but not for you.
This usually occurs because the connection limit of the resources serving that app have been reached. The guy in the next cube already managed to get a connection before resources ran dry and thus he’s actually part of the problem. He’s got an open connection (or four) to the server and thus the fact that the server has no more connections to offer doesn’t affect him.
Until he times out or closes the connection. Then you get it, and he’s the one who can’t connect – no matter how many times he tries. Until someone else drops their connection. And so on and so forth. The thing is that you aren’t guaranteed to get a connection when someone else drops theirs. It all depends on whether or not your request to grab it gets there before anyone else’s.
In the end, this causes one person to think your site is up while another thinks it’s down. The struggle is real, guys. Very real.
Every web/app server has a connection limiting upper bound; a number at which the server can no longer create new connections but is able (mostly) to service existing connections. If you’ve every looked at a network-hosted proxy (one that does load balancing) you should be able to find two different performance metrics: connections per second (CPS) and concurrent connections. The former is the rate at which the system can create new connections. The latter, the total number of connections the system can maintain, i.e. it’s the total capacity of the system.
To avoid running afoul of the phenomenon, you need to make sure that you never hit the maximum, upper bound of systems’ capacity. The simplest way to do this is to do a load test and find out not only the ultimate limit (the “we can’t handle any more connections period” limit) but where the "performance load limit” is. That’s the point at which operational axiom #2 really kicks in and performance starts degrading. That’s your penultimate limit.
You need to figure out which one you’re going to use, because in some cases (probably a lot more than we’d like to admit) business defines “availability” as not only accessible, but performing within a prescribed set of parameters, like 8 seconds*.
Regardless of which metric you use – the ultimate or penultimate limit – that number becomes the “Number that shall not be reached.” Even with the advent of cloud and just-in-time provisioning, servers and systems in the data path between the user and the app should not be pushed past this threshold. Provisioning of additional capacity should occur at some point before full capacity is reached.
If you have the numbers (you did load test, right?) then provisioning additional resources (i.e. scaling) becomes a matter of mathematics.
While I’m simplifying somewhat, this is a basic process through which you can figure out when to start provisioning to avoid the appearance of an outage for some users and not others. It’s very basic, and doesn’t take into consideration factors like the actual rate at which connections are occurring, a historical metric that you can extract from the right load balancing proxy. I use the CPS value here because it’s treated as an absolute maximum; any value you get will likely be lower than this and therefore you’ll actually have more time in which to provision new resources. Using the theoretically maximum means baking in a bit of padding, which isn’t always a bad thing given the variability of times with provisioning, especially if the process is manual. Which is really a good reason to expound on the value of automation (DevOps) and the value of predictability, but falls outside the scope of this post.
A modern load balancing proxy will be programmatic and offer up the means by which either you can monitor (via an API or other control plane mechanism) the current load on any given resource (web/app server) in terms of connections and thus recognize when it is time to provision another resource.
You can also, if you’ve got the right proxy, use its programmability to trigger a notification upon reaching the provisioning threshold (or even kick off the provisioning process itself).
Whichever means you choose, the important point is that you know when to start scaling an app so as to avoid the “up for thee but not for me” phenomenon.
Because knowing is half the battle.
The other half is either red and blue lasers, or programmability. All depends on how you plan on dealing with the situation.
* Yes, I know about the 3 (and 5) second rule. But we’re talking about the “point at which an application’s performance is so bad a user thinks it’s not responding” rule, which is about 8 seconds.