The networking industry has a somewhat unique relationship with high availability. For compute, storage, and applications, failures are somewhat tolerable because they tend to be more isolated (a single server going down rarely impacts the rest of the servers). However, the network’s central role in connecting resources makes it harder to contain failures. Because of this, availability has been an exercise in driving uptime to near 100 percent.
It is absolutely good to minimize unnecessary downtime, but is the pursuit of perfect availability the right endeavor?
Device uptime vs application availability
We should be crystal clear on one thing: the purpose of the network is not about providing connectivity so much as it is about making sure applications and tenants have what they need. Insofar as connectivity is a requirement, it is important, but the job doesn’t end just because packets make it from one side to the other. Application availability and application experience are far more dominant in determining whether infrastructure is meeting expectations.
With that in mind, the focus on individual device uptime is an interesting but somewhat myopic approach to declaring IT infrastructure success. By focusing on building in availability at the device level, it is easy to overlook more network-wide resiliency, which might be far more impactful to the application. Worse yet, things that are easy to measure have a funny way of finding their way into scorecards that drive how companies and teams operate. So we end up with some organizations maniacally focused on the wrong measure of success.
Designing for correctness
On top of looking at the wrong thing, the industry generally takes the wrong approach to availability as well. Far too much of networking is about designing for correctness. The premise behind this approach is that you can test and fix your way into more highly available systems.
But after many decades of trying, it is probably about time we all admit that producing a defect-free experience given the complexities of networking is probably not a realistic goal. Code bases from the major vendors are measured in the 10s of millions of lines of code. When you deploy that type of code base over a distributed network, defects are going to happen.
And even if you could wave a defect wand and guarantee perfectly functioning equipment, users would have to be equally as diligent before an environment could be declared pristine.
In pursuit of correctness, we focus a lot of effort—on both the vendor and customer sides—on testing. It’s not that testing isn’t necessary (obviously you want to catch what you can), but executing thousands of tests doesn’t guarantee correctness. If you really want to create an infrastructure that produces highly available application experiences, at least as much effort needs to be put into making sure applications are available even in the face of defects.
Put differently, we need to add another arrow to our quiver. In addition to striving for correctness, we need to be architecting for resilience.
Resilience in horizontally-scaled applications
In scale-out applications, the role of the network is to provide the interconnect. Whenever that interconnect is compromised, resources can be stranded, which impacts application availability. So how can the network help?
If the network provides paths from A to B, then the surest path to resilience is ensuring there are multiple paths to get from A to B. This means that the metric that architects ought to be looking at is path diversity. How many different ways are there to get between resources? And how quickly can a device switch from one of those ways to another?
Here, some of the age-old mechanisms at the heart of networking actually work against us. The underlying pathing algorithms that virtually every protocol uses (including ECMP) are based on Shortest Path First. Even where multiple paths between resources exist, the network is forced to choose between those with the smallest number of hops—even if that path is very obviously not the best path.
If we want to increase the path diversity, we need to free ourselves from limitations like these. In effect, where we have become reliant on equal cost multi pathing (ECMP), we need to consider fanning out to all available paths. We need to look at non-equal cost multi pathing to increase the path diversity between scale-out application resources.
Make failover fast
Simply having multiple ways to get from A to B is not sufficient. Individual switches have to be able to quickly detect that something is wrong with a particular path and then distribute workloads along other available paths.
The need for fast failover adds two additional requirements to the underlying network: failure detection must be fast, and backup paths must be computed ahead of time. If failure detection is fast but paths are not precomputed, there is a delay between the failure and the path calculations. To drive availability up, the time between failure and failover must be minimized.
SDN provides a useful architectural framework here. With a global view of the network, the controller can identify all available paths, precompute the paths for use by the switches, and then distribute those paths before they are needed. When an issue is detected, the switch can immediately forward traffic along backup paths.
The bottom line
Architecting a perfect device is impossible, and implementing a perfect network is doubly so. For application architects charged with improving uptime, this means a change in design criteria. Rather than looking at MTBF numbers and relying on ITIL practices to reduce the likelihood of downtime, architects should embrace calamity, designing for downtime rather that trying to avoid it. Only by changing the mindset and explicitly building for what happens when things inevitably go wrong can anyone really impact application uptime in a meaningful way.
[Today’s fun fact: Beavers were once the size of bears. Not sure if bears have grown or beavers have shrunk though.]