For those of you into expanding your experience through reading, there is a foundational reference at the core of many MBA programs. The book, Eliyahu Goldratt’s The Goal, introduces a concept called the Theory of Constraints. Put simply, the Theory of Constraints is the premise that systems will tend to be limited by a very small number of constraints (or bottlenecks). By focusing primarily on the bottlenecks, you can remove limitations and increase system throughput.
The book uses this theory to talk through management paradigms as the main character works through a manufacturing problem. But the theory actually applies to all systems, making its application useful in more scenarios than management or manufacturing.
Understanding the Theory of Constraints
Before we get into networking applications, it is worth walking through some basics about the Theory of Constraints. Imagine a simple set of subsystems strung together in a larger system. Perhaps, for example, software development requires two development teams, a QA team, and a regressions team before new code can be delivered.
If output relies on each of these subsystems, then the total output of the system as a whole is determined by the lowest-output subsystem. For instance, imagine that SW1 is capable of producing 7 widgets, SW2 5 widgets, QA 4 widgets, and Regressions 3 widgets. The total number of widgets that get through the system is 3, and you create a backlog of partially-completed widgets with each run through the process.
If you want to increase the output of the system from 3 to 4, the only thing to focus on is increasing the Regressions output from 3 to 4. Any other optimizations across the other subsystems will not impact the total throughput for the system.
Theory of Constraints in a network context
We can apply the same basic framework to the datacenter. If you have servers that are separated by some number of datacenter links, then the total capacity from server to server is determined not by the aggregate bandwidth in the datacenter but by the smallest link in the path.
Part of the design philosophy in datacenter networks is that you cannot know with certainty what servers are talking what servers, so predictive analysis on bandwidth utilization is either prohibitively difficult or outright impossible to determine. In the absence of information, you build out capacity across every link in the datacenter so that your weakest link is sufficiently provisioned to handle any load.
Of course, overbuilding a datacenter of even moderate size is costly. So you start playing the odds a little bit. You assume that not all servers will be driving traffic at the same time, so you can safely oversubscribe the network and rely on the exceedingly small chance that traffic ends up being shunted through a single link at one time.
But also of course, in networks, stuff happens. So you protect yourself from the statistical anomalies by increasing buffers across the board, essentially trading the cost of bandwidth for the cost of memory. In doing so, you pick up additional queuing delays, but so long as the applications can deal with that, it might not be that big a tradeoff.
The nature of the problem
Understand that the real problem to be solved here is ensuring adequate transport bandwidth along the bottleneck link. Not every link is a bottleneck link, so distributing bandwidth uniformly across the datacenter is a bit of a heavy-handed approach. The argument here generally goes something along the lines of: Bandwidth solves everything, and bandwidth is cheap. Keep in mind that this argument is typically made in the same breath as Networking is too expensive.
The reason over-provisioning and buffering is acceptable is because most networks, at least from a physical perspective, are extremely static. You cannot add or move cables, change connectivity, or otherwise alter the physical characteristics that determine the smallest link.
The protocols that determine path are equally limiting. If a link is congested, you cannot just forward packets over other links because virtually every protocol in the network relies on underlying Shortest Path First algorithms. Sure, you can load balance across a small number of equal-cost paths, but once you are on those paths, there is no other redress. The best you can do is make as many equal-cost paths as possible, which is essentially over-provisioning, except from an interconnect perspective.
Another way to solve the problem
Solving the constrained-link problem is actually not conceptually difficult. If you have traffic that exceeds the capacity of a link, you would add additional capacity to it if you could. And if that proved impossible (or just not pragmatic), you would forward traffic along other paths to avoid congestion.
In the first case, the missing ingredient is movable capacity. If paths are statically tied to the physical cabling, then moving capacity is not feasible. However, technology has existed for years on the carrier side of the house that allows for the dynamic allocation of bandwidth via programmable transceivers. Applying that same technology in the datacenter makes perfect sense. However, introducing optical technology has traditionally been more expensive than just over-buying capacity. But what happens if costs come down? Well, they have.
In the second case, the missing ingredient has been a different set of pathing algorithms that do not rely on SPF. Non-equal-cost multi-pathing would allow traffic to take alternate paths. Of course, if those paths require additional switch hops, then the question is whether the additional switch lookups incur a steeper latency penalty than the queuing delay on congested links. But combining non-ECMP with optical paths allows for direct connectivity minus the queueing delays. You end up avoiding congestion AND the additional switch hops in between.
The bottom line
The real reason the Theory of Constraints remains a statistical game in networking has nothing to do with technology. The reality is that once anyone grows accustomed to solving a problem in a particular way, they use that method to solve it over and over again. When the current approaches were devised, the technology was too expensive or didn’t exist to do things sensibly. That is no longer the case. New capabilities open up new options, some of which might even be faster than traditional views. Check out this paper fromGoogle research: A 10 µs Hybrid Optical-Circuit/Electrical-Packet Network for Datacenters