Through http://blog.ipspace.net I landed on this article on acm.org discussing the complexity of distributed systems. Through some good examples, George Neville-Neil makes it clear that creating and scaling distributed systems is very complex and “any one that tells you it is easy is either drunk or lying, and possibly both”.
Networks are of course inherently distributed systems. Most everyone that has managed a good sized network before knows that like the example in the article, minor changes in traffic or connectivity can have huge implications on the overall performance of a network. In my time supporting some very large networks I have seen huge chain reactions of events based on what appear to be some minor issues.
Very few networks are extensively modeled before they are implemented. Manufactures of machines, cars and many other things go through extensive modeling to understand the behaviors of what they created and their design choices. Using modeling they will look at all possible inputs and outputs, conditions, failure scenarios and anything else we can think of to see how their product behaves.
There are few if any true modeling tools for networks. We build networks with extensive distributed protocols to control connectivity and reachability. The protocols and their behaviors are fairly well understood and we design our networks based on our knowledge of how these protocols will react to failures. There is very little modeling of actual traffic or expected traffic patterns. We have all seen networks melt down because of forwarding loops, excessive multicast or broadcast that affect portions of the networks we had not expected it would affect. And a chain reaction of failures follows. I have seen single transceivers go through soft failures (lots of errors and intermittent link state) and bring large networks to their knees.
Of course we learn from our mistakes (mostly) and created protection mechanisms for the things we know can go wrong. We created “hello” based protocols to guard against those soft transceiver failures. We create rate limiting and overload protection mechanisms to guard against those broadcast and multicast storms. We have implemented dampening mechanisms in many of our protocols to ensure state transitions can be managed more gracefully.
But in the end, like the example in the article, we have created artificial limits and protection mechanisms. And they are based on the things we know, because they have hurt us in the past. Less protective but more pro-active, we have implemented hashing mechanisms in our networks to distribute the traffic. As explained, these are generic mechanisms that are designed for uniform best case scenarios. Provide it with less uniform traffic, and the expected distribution is not what we want it to be. The 2014 favorite example of “elephant flow” is perfect in how a single change in a network can drastically change its behavior and performance.
Yes, we have gotten better at protecting this distributed animal that is the network. We have stubbed our toes, bashed our heads and figured out how to protect against it. But we have not taken it to that next level of engineering. The network is a product. It is a highly distributed system where we know way too little about the inputs. We do not model the network and its behavior. We do not test its resiliency. I have asked this question before in this forum and to customers. Every customer has maintenance windows to do proactive maintenance on servers, OSs and the like. Every school and company has fire drills. I have yet to meet a single customer that actually schedules network failure drills to ensure that the resiliency designed into the distributed system actually performs as expected.
We have to start treating the network like the complex distributed system it is. That means that complex algorithms that manage the distribution of traffic and take failure scenarios into account need to be in charge of our networks. Local decision making based on partially complete information can never be enough to truly create a network that pushes the boundaries of what it is capable of. It is not easy, but we have to recognize what the network is. And then approach it with the right tools and methods to model and manage it.