This article was originally written by Marten Terpstra at the Plexxi blog.
A book I have had on my fairly small pile of books to keep is “The Five Dysfunctions of a Team” by Patrick Lencioni. In this book he outlines 5 critical failures of teams, key areas for any leader to address. Patrick walks through these with a fictional story to outline how each of these can have a disastrous effect on team dynamics and performance. Each of these dysfunctions has its network equivalent, probably many variations. These are the ones I have come across more than once.
Absence of Trust
Trust in a network comes in many different flavors. In a true network sense, trust is closely related to security, access control, firewalls, intrusion detection and everything else that comes with it. It is easy to pick at network security in relationship to the absence of trust because it is so hard to create, implement and maintain a consistent and logical security strategy, but there is something more dangerous when it relates to trust.
It is Simon. Simon is that one and only guy that understands the ins and outs of the network. The network map is in his head, the one printed or online is out of date and he knows it. Noone else really understands all the intricacies of the design and implementation. The network is Simon’s baby and he protects it like a mother or father would. And any suggestion you may have to improve the network, well I am sorry, Simon will at least tweak it to “improve” your improvement, or flat out make it the most ridiculous idea ever. You have to trust Simon and Simon trusts no one. If you have a Simon, he has to go. Trust me.
Fear of Conflict
Even without a Simon, making changes to a network is the most scary task for most network engineers and operators. Most networks have evolved from their totally logical initial design and implementation into something that is not that well understood any more and at least in certain areas extremely fragile. And this fragility may not be real, but the perception is enough to scare us away from making anything more than the most basic changes. We certainly know how to build robust protocols and networks, why are so many changes still relegated to those extremely hard to get “maintenance windows”?
We need to create systems where we are no longer afraid to make changes beyond the most basic. We need to get to an infrastructure where even complex routing, forwarding and topology changes can be made without fear, because we have made it predictable, understandable and may even have the simulation tools to let us know what will change and how it will impact the services provided.
Lack of Commitment
In a very literal sense, few networks, except those delivered and run by carriers, provide true Service Level Agreements. Most networks are created and run as a best effort service still. There is very little traffic accounting, capacity planning and true performance engineering in support of creating guaranteed network services. Having sFlow and friends running in the network and creating pretty graphs is a good first step, but unless this information is used to truly engineer, plan, measure and enforce service levels for the users of the network, it is nothing more than a pretty set of graphs.
Those users are not humans, these are applications, distributed, with a combination of local and remote storage, interacting with other applications. With very specific service expectations for performance, availability, connectivity. True SLAs. We may not like the term much, but we expect very specific service levels from our applications. They should expect the same from the network. The network needs to commit to the services it provides.
Avoidance of Accountability
I have mentioned our collective lack of accurate documentation and bookkeeping in the past. We have no up-to-date diagrams, but it’s the first thing we need when something is not quite right. The rules of who can change what and when are somewhat loosely defined. Exactly what SLAs the network is expected to provide and how is typically not very well documented or understood. Business and network continuity plans may exist, but do you ever test them?
Except for those business where the network is the business, there is little process rigidity guiding the network and the services it provides. While for some networks, process frameworks like ITIL may be heavy, the only way to create very tangible accountability is to have a well defined and documented set of procedures, processes, tasks and checklists to govern the network and its services. Without it, what may have started out as a well defined and well functioning service will become unpredictable, untraceable and unwieldy very quickly.
Inattention to Details
The devil is in the detail. Minor changes sometimes have the most dramatic impact to how a network behaves and performs. The problem is that we rarely notice the minor changes. It is rare to see line by line configuration change accounting and logging. It is even more rare to see network attachment tracking. Who and what is connected where and when? Where and when did a specific MAC address come and go? How much of what type of traffic is normal for any device or application? How do we define normal?
Almost every network with serious issues I worked on in the past had no concept of “normal”. There was no well defined state of the network. I always tried to convince a customer that once a network was stable and performing as expected (usually measured by rather soft indicators), a baseline of just everything should be taken. Device configuration, traffic levels, network attachments, error indicators, memory utilization, cpu utilization, you name it, any piece of data that could contribute to a definition of normal. Because only if you understand what normal is can you determine when you leave normal and why. And only when you understand normal can you truly create an SLA based network.
While a network is not a team, it may have just as many dysfunctions that impact its ability to provide the service it is created for. Our job as a vendor is to provide tools and functionality to remove these.