There is probably no group of companies in the infrastructure space more envied than the major cloud properties. Senior executives across the globe are actively engaged in discussions with their teams to determine how to evolve their infrastructure operations practices. They throw out words like DevOps, automation, and machine learning, all in a bid to transform their practices to be more agile and scalable.
But some of the most foundational principles that make cloud operations work are too often overlooked.
For those who have traveled within the United States, Southwest is a well-known airline. They were originally founded in Texas with the corporate objective to democratize the skies. At the time, air travel was a luxury. For the average family, it was unattainable because it was expensive to fly from here to there. Southwest took on the mission to change that, effectively aiming to make air travel affordable to the everyman.
At the center of their mission is the strategic thesis that a plane on the ground is not making money. So they focus on one primary thing: get planes back in the air as quickly as possible. Everything they do in their operations is to speed up gate turnaround times so that planes are always making money. If planes are making more money, airfare will be cheaper, and their corporate mission will be served.
If you have ever wondered why Southwest does not assign seats and boards in a free-for-all, it’s because that’s the fastest way to board a plane. Some of you engineers are probably imagining that a back-to-front or outside-in approach is faster, but this has been tested, and the fastest way to board a plane is free-for-all. The reason no one else does this? Because passengers hate the experience. But Southwest isn’t optimizing for customer satisfaction. If it doesn’t allow them to lower cost, it isn’t a priority (also why they have traditionally not had meals, by the way).
So how else does Southwest minimize gate turnaround times? It doesn’t matter where you are flying. If you are on Southwest, you are on a 737. Flying 500 miles? 737. Flying 2000 miles? 737. Flying with 140 people? 737. Just 17 of you on the flight? 737. Southwest has standardized their fleet on the Boeing 737.
The Power of Uniformity
Because they fly the same plane, it means that every pilot can fly every plane. Every crew can work every plane. Every mechanic can service every plane. Every spare part can be used on every plane. When the operating environment is uniform, it allows for all kinds of simplification, which leads to optimization.
When the surrounding tools and processes can be narrow in scope, they don’t have to consider a lot of variation. This means that they can be uniquely tailored to very specific situations, delivering very specific outcomes. This is the key to operating anything efficiently at scale.
Cloud Properties and Uniformity
It is well known that the cloud architectures run BGP to top-of-rack. But why?
BGP fans will tell you that BGP is superior in all kinds of ways. But the operational optimization required to make clouds work is only possible if you standardize on the smallest set of technologies possible. In many ways, the standardization on draft-lapukhov is about operations more than transport (though non-blocking architectures are obviously critical).
The cloud companies innovate not in the infrastructure, but in the operations of that infrastructure. For their tools and processes to work, they need them to be applied ubiquitously across all of the infrastructure. That means that anything that causes variance has to be designed out. Anything that requires contextual differences needs to be justified heavily, or it is architected out of existence.
In a word, the cloud companies want uniformity. Randy Bias’s phrase about servers being treated as cattle rather than pets is how to think of all infrastructure. It’s not about unique snowflakes. The more uniform, the more optimized. And if you want to extract every iota of operational performance, you have to be draconian in your application of this principle.
737s Are Not Analogous to Hardware
When I tell this story, it’s important to point out that the thing to learn is not that all the hardware has to be the same. This isn’t a case of going single vendor. This is about driving variance out, which is a level below that. It’s not about the vendor so much as it is about the protocols that drive network behavior. Having a rainbow of devices from the same vendor, all with different behavior, would create the same madness that divergent architectures creates. This is about common technology building blocks.
Uniformity and Disaggregation
A lot of people seem to think that the reason that disaggregation is so important to cloud properties is because of the cost of networking equipment. This is a convenient way to think about it, though it lacks nuance. The truth is that the cloud companies are basically printing money. And they order in enough volume that they could negotiate great deals with any vendor. And even if they could save a few dollars on one vendor over another, it would have to be weighed against the operational implications.
The real reason that disaggregation matters is that it breaks large systems down into smaller parts. This means that those parts can be developed independently, which in turn allows for the cloud properties to settle on a set of building blocks that are uniform across as much of the infrastructure as is practical.
Put differently, disaggregation is a means of achieving uniformity, especially in the face of multi-vendor solutions. If everyone is using Broadcom-based switching silicon, for example, the ability to lock that component in and then iterate on top is interesting. If everyone is using a similar protocol stack, then locking that in and moving higher up makes sense.
Disaggregation goes downward too. If everyone is using the same optics and cables, the principle holds up. For larger platforms, they would like to settle on a common set of silicon, so the isolation of more than just Broadcom will ultimately be a discussion as well.
The Bottom Line
For companies that aspire to be more cloud-like, they need to recognize that their own decisions impact their ability to survive the migration. For every legacy feature that is absolutely critical, there is another impediment to uniformity. This, more than anything, will render efforts to evolve moot. It’s not about the tools and the machine learning that helps drive automation. If you are not fundamentally building on the simplest possible infrastructure, no amount of tooling is going to save you.
In essence, evolving like the clouds means removing the evolutionary cruft that has attached itself to your infrastructure over decades. If you want to grow, you will start by first shrinking.