Explicitly-defined failure domains in the datacenter
Explicitly-defined failure domains in the datacenter
Join the DZone community and get the full member experience.Join For Free
Take 60 minutes to understand the Power of the Actor Model with "Designing Reactive Systems: The Role Of Actors In Distributed Architecture". Brought to you in partnership with Lightbend.
While the bulk of the networking industry’s focus is on CapEx and automation, the two major trends driving changes in these areas will have a potentially greater impact on that which matters most: availability. In fact, despite that datacenter purchasing decisions skew towards CapEx, the number one requirement for most datacenters is uptime.
If availability is so important, why is it not the number one purchasing criteria already?
First, it’s not that availability doesn’t matter. It’s more that when everyone is building the same thing, it ceases to be a differentiating point. Most switch vendors have converged on Broadcom silicon and a basic set of features required to run datacenter architectures that have really been unchanged for a decade or more. But are those architectures going to continue unscathed?
SDN and network architecture
For those who believe in the transformative power of SDN, the answer is unequivocally no. If, after all of the SDN work is said and done, we emerge with the same architectures with an extra smidge of automation sprinkled on top, we will have grossly under-delivered on what should be the kind of change that happens once every couple of decades.
The rise of a central controller is more than just pushing provisioning to a single pane of glass. It is about central control over a network using a global perspective to make intelligent resourcing and pathing decisions. While this does not necessarily mean that legacy networking cannot (or should not) co-exist, the model is dramatically different than what exists today.
Switching is just a stop along the way for bare metal
Bare metal switching will also change IT infrastructure in a meaningful way. Again, the scope of public discourse is fairly narrow. The story goes something like this: if someone makes a commodity switch, pricing will come down. But there are two things that are really happening here.
First, what we are really seeing is a shift in monetization from hardware to software. This shift should not be surprising, as software investment on the vendor side has dwarfed hardware R&D for 15 years now. The real change here is that the companies that emerge have a tolerance for lower margins than the behemoths already entrenched in the space. Anyone can drop price; the question is at what margins is a business still attractive. What will play out over the next couple of years is a game of chicken with price.
Second, the objective of bare metal switching is less about the switching and more about the bare metal. Taken to its logical conclusion, the hope has to be that all infrastructure is eventually run on the same set of hardware. Whether something is a server, a storage device, an appliance, or a switch should ultimately be determined by the software that is being run on it. In this case, we see multi-purpose devices whose role depends on the context in which they are deployed. This would eventually allow for the fungibility of resources across what are currently very hard silos.
Domain, Domain, Domain
Both SDN and bare metal lead to very different architectures than that which exists today. But as architects consider how they will evolve their own instantiations of these technologies, they need to be clear about a couple of facts that get glossed over.
If availability really is the number one requirement for datacenters, then architectures need to explicitly consider how they impact overall resource availability. Consider that there are a number of sources for downtime:
- Human error - By far the leading source of downtime in most networks, human error is why there is such momentum around things like ITIL. Put differently, when is uptime the highest for most datacenters? Holidays, when everyone is away from the office.
- System issues – After human error, the next biggest cause of downtime is issues in the systems themselves. These are most likely software bugs, but they can include device and link failures as well.
- Maintenance – Another major contributor to uptime is overall infrastructure maintenance. When individual systems need to be upgraded or replaced, there is frequently some interruption to service. Of course, if maintenance is planned, then the impact to overall downtime should be low.
- Other – The Other category covers things like power outages and backhoes.
Of these, SDN promises to improve the first one. By expanding the management domain, it reduces the number of opportunities for pesky humans to make mistakes. Automated workflows that are executed from a single point of control and orchestrated across disparate elements of the infrastructure should help drive the number of provisioning mistakes in the datacenter down.
Additionally, a central point of control helps improve visibility (or at least it will over time). This helps operators diagnose issues more quickly, which will lower the Mean-Time-to-Repair (MTTR) for network-related issues.
But the management domain is not the only one that matters. There are at least two others that impact downtime: failure domains and maintenance domains. The impacts on these by SDN and bare metal need to be explicitly understood.
While there are tremendous operational benefits of collapsing domains under a single umbrella, one thing that becomes more difficult is managing the impact of failures when they do occur.
For instance, if a network is under the control of a single SDN controller, what happens if that controller is not reachable? If the controller is an active part of the data path, there is one set of outcomes. If the controller is not an active part of the data path, there is a different set of outcomes.
The point here is not to advocate for one or the other, but rather to point out that architects need to be explicit in defining the failure domain so that they can adjust operations appropriately. For instance, it might be the case that you prefer to balance the control benefits with failure scenarios, opting to create several smaller management domains, each with a correspondingly smaller failure domain. This gives you some benefit over a completely distributed management environment (where management domains are defined by the devices themselves) without putting the entire network under the same failure domain.
The same is true with bare metal. If bare metal leads to platform convergence, it allows architects to co-host compute, storage, and networking in the same device. Whether you actually group them depends an awful lot on how you view failure domains. Again, any balance is useful so long as it is explicitly chosen. Collapsing everything to a single device creates a different failure domain, which might make sense in some environments and less so in others.
The same discussion extends to maintenance domains. Collapsing everything might create a single maintenance domain (depending on architecture, of course). Keeping things separate might enable much smaller maintenance domains. There is no right size, but whatever the architecture that is chosen, the maintenance domain needs to be an explicit requirement.
The bottom line
Architectures are changing. There is little doubt that technology advances in IT generally and networking specifically are enabling us to do things we couldn’t really even consider just a few years ago. When deciding how to do those things, though, we need to be explicitly designing for availability. What has always been a requirement but has seen less and less talk as architectures matured should be dominating discussions again. Depending on your own specific requirements, this could lead to some unexpected architectural decisions.
Published at DZone with permission of Mike Bushong , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.