[This article originally written by Marten Terpstra.]
While doing some competitive analysis, I read this paper that was presented at the 21st Symposium on High Performance Interconnects. The paper discusses the data path performance of spine and leaf networks and was written by Insieme Networks’ Mohammad Alizadeh and CTO Tom Edsall. Mohammad has co-authored several research papers on fabrics and networks, all very worthwhile reading. The paper describes findings from a leaf and spine simulation, focused on the impact of buffer space, fabric link capacity and oversubscription to the overall traffic load.
The paper describes how the authors created a leaf and spine model consisting of 5 racks with 20 servers each, connected with 10GbE. The 5 simulated ToR switches (the leafs) are then connected to modeled spine switches at various oversubscription rates. Unfortunately the paper does not specifically state whether each of the spine switches takes a single fabric link from each of the leaf switches (which would require a lot of spine switches), or multiple fabric links from each. And if the latter, are those links aggregated using LAG, or treated individually and provided as individual paths to ECMP. I believe each of those options would result in different observed behavior.
Within this network they create dynamic flows that are being tracked for performance, with background traffic that is tracked for large flows only and creates load and congestion that impacts the primary workload. The primary workload creates an average of 10 file transfer requests per second per server, created with normal arrival probability to remove synchronization. Each file transfer requests 1Mb files from n servers, where n is a random number between 1 and the amount of servers, and each server provides 1/nth of the file when asked. The time taken for all portions of the file to be received is tracked, in the paper this is called Query Completion Time (QCT). The large background flows that are being tracked are measured by Flow Completion Time (FCT).
When looking at the traffic patterns being generated, it is essentially an any to any traffic pattern. The query traffic is requested by any server from any other server, its random nature guaranteeing an even distribution across all servers over time. The background flows are also evenly distributed (although not explicitly stated), the size of the flows based on real world traffic pattern studies.
And this is where we at Plexxi would have a first difference of opinion. We do not believe that data centers create uniform data flows, in bursts or over time. We strongly believe there are patterns of communications in a data center that can be recognized. We also do not believe all traffic is equally important. There are workloads that are very sensitive to delays, some that are not at all. I fully understand why the traffic distribution was chosen, it is extremely hard to find patterns of cause of effect in unbalanced environments. The simulation also forced all traffic to leave a rack, there was no traffic between servers inside the same rack (at least for the background flows). Many solutions in real life are engineered to make use of this locality providing cheap 1 hop non-oversubscribed bandwidth, Hadoop being a great example.
The authors found that changing link speeds from 10GbE between leaf and spine to 40GbE or 100GbE (without changing the overall bandwidth between spine and leaf) improves the FCT significantly. This is a good finding, but the way the conclusion of this finding is phrased leaves me with more questions. They state that “… ECMP load-balancing is inefficient when the fabric link speed is 10Gbps.” While true for the specific simulated environment, I believe the explanation is really that ECMP hashing does not create perfect distribution. When multiple background flows originating from 10GbE based servers, with their TCP windows wide open, start blasting traffic full speed, it will only take 2 of these to land on the same hashed fabric link to have a significant impact. With fewer uplinks of a higher speed, ECMP inefficiency will be less pronounced, the link speed alleviates some of this burstiness. If you look at the primary query traffic for the same comparison, the delta between multiple 10GbEs and fewer 40GbEs is much less pronounced and barely noticeable under very heavy load. It’s a nice result to highlight, but I believe it is directly related to type of traffic offered, TCP with a wide open window.
When comparing oversubscription levels, the paper finds that the oversubscribed versions of the network perform very similar to the non oversubscribed when offering up to about 60% relative load. When pushing the relative load to 70% of higher, oversubscribed spine and leaf networks degrade faster than non oversubscribed versions. No explanation is given, but a combination of ECMP effectiveness and buffer effectiveness has to contribute to this degradation.
The challenge to the network architect is trying to understand what the right oversubscription ratio is. Once designed and deployed with a certain ratio, changing that ratio in a fixed spine and leaf network is extremely cumbersome. Would it not be nice if you could dynamically change connectivity and therefore oversubscription based on workload needs?
The previous two parameters are fully under the control of those that build a network. You can pick how much oversubscription you want or need (or can afford) and with many of the latest generation switches, 10GbE vs 40GbE has become pretty much available as user configurable options. The last parameters examined in the simulation is the impact of buffer space available at each of the switches. They picked 10Mb as the standard shared buffer, very similar to what today’s 1U data center switch will have. Not surprising, the simulation showed that more buffer made the network perform better. A minor surprise is that increasing buffer space on the leaf is more impactful than doing the same on the spine. While the paper mentioned even queue utilization due to the all to all traffic patterns in use, it does not explain why with this traffic pattern leaf buffer size is more valuable than its spine equivalent, a suggestion that incast issues in this simulation are found at the leaf egress rather than the spine egress ports. Unfortunately you as a buyer have little control over the amount of buffering in your leaf switch. Modular spine switches have always had more buffer memory, perhaps this paper is a reason to ask why.
In the end, the paper is a very interesting read and provides some insight on how leaf and spine networks could behave. There is however a challenge in translating this to any network. The traffic simulation was (not without reason) specifically designed to create an even any to any mix, with explicit burstiness due to open TCP windows and no traffic between ports on the same switch.
We believe there is no one size fits all network. Applications are different. Workloads are different. There is no such thing as uniformity, there is localization, there are hot spots and different workloads want different things from the network. That is why we believe we should not be building uniform networks, or forward traffic based on uniform algorithms. By understanding what the workloads are and where large portion of traffic flows are, would it not be nice to adjust topologies to create less oversubscription between those portions of the network, while allowing more oversubscription for other portions? By understanding workloads, hot spots can be avoided, more links and associated buffer space can be applied where they are needed. There is no question that spine and leaf networks are a great improvement over multi tiered networks of the past, but why stop there?
[Today's Fun Fact: The fingerprints of Koala Bears are virtually indistinguishable from those of humans, so much so that they could be confused at a crime scene. Reasonable doubt anyone?]- See more at: http://www.plexxi.com/2014/03/plexxis-view-data-path-performance-spine-leaf-networks/?utm_source=feedly&utm_reader=feedly&utm_medium=rss&utm_campaign=plexxis-view-data-path-performance-spine-leaf-networks#sthash.jMbiVJXE.dpuf