This post was originally written by Marten Terpstra at the Plexxi blog.
This week, Ivan Pepelnjak wrote an article describing queueing in today’s ethernet switches. He walks through FIFO queueing, on to Class Based Queueing to Virtual Output Queueing in line card based switches. It is a very nicely written article explaining the basics of queueing mechanisms and how they are used in ethernet switches (and routers alike). I blatantly took Ivan’s post title to create my own.
Throughout the packet processing in a switch, a packet that has arrived is placed in the switching hardware buffer memory. For most single chip 10GbE ToR switches there is very little buffer memory. Depending on which of few vendors is used, the amount of buffer memory is typically somewhere between 8 and 12 Mbytes. Yes, that’s all. 12 Mbytes is the equivalent of running a single 10GbE interface for about 100 milliseconds. On a 96 port ToR switch, that is the equivalent of about 1 millisecond worth of traffic if each interface was running full out. And on these switches, this buffer memory is shared among all interfaces.
The queues that Ivan describes essentially hold pointers to the packets that have been placed in buffer memory. More buffer memory means the ability to support more queues, or longer queues. As Ivan mentioned, the whole point of queuing is the management of packets when too much traffic is attempting to leave an egress interface. It protects against momentary bursts of traffic, by slowing things down.
Which brings us to the first impact of queueing, delay. Every time a packet gets queued, it needs to wait until the packets ahead of it have been transmitted. A 1Kbyte packet takes over 800ns to transmit on 10GbE. If there are only a handful of these sized packets ahead of you, you’ll be waiting 4 microseconds before it’s your turn. If this handful are jumbo packets, you will have to wait 35 microseconds. Think about that against the switch latency discussions. Big buffers may sounds good at first, because it means you can withstand more bursts of traffic, but the size of the buffer directly impacts the delay through a switch under load. And with big buffers goes low latency switching.
Any time a switch needs to make a queueing decision, it makes a priority decision based on the policy you have instructed it with. The amount of queues, the de-queueing strategy, the depth of the queues, drop priority, and everything else associated with queueing creates a “one packet is more important than another” choice.
I have previously mentioned Lossless Ethernet, essentially a mechanism implemented on top of ethernet that uses queue utilization as its key to take action. Once a specific queue fills to a configured threshold, the switch will push pack on all sending interfaces for traffic that is meant for that queue, and on a sample basis send “slow down” messages to the original sender. Larger queues, heavier queue utilization leads to more push back, and again that pesky delay.
You cannot avoid queueing in modern day networking. Traffic is bursty in almost any application, and some tolerance to this burstiness is needed to avoid throwing away packets at a whim. There have been plenty of times where we have had lively conversations with customers and other experts about queueing and buffering. Some have tried to convince us that only specific types of switching hardware had enough buffering to support their traffic.
Our view however is a bit different. Datacenter switches are built with relatively little buffer space. And that is by design. At relatively small distances with high speeds, the cost of a dropped packet is much different than that same dropped packet transmitted at lower speeds and longer distances.
More importantly though, if your network requires constant queueing and switches are consistently pushing their buffering capabilities, you do not have a “lack of buffer space” problem, you have a network engineering problem. And network engineering problems require network engineering solutions. Full queues and full buffers point to congestion, they point to hot spots in your network. For each hot spot you will likely find another spot that is running well under capability.
The network engineering solution is to distribute that capacity, those queues, those buffers and the paths that drive them in such a way that a more even utilization of all these resources is achieved. And that requires a different level of control over your traffic, the type of control only an end-to-end traffic managed fabric can give you. We are all still hand managing our queues and buffers, attempting to squeeze the best performance out of them. When you really think about it, it’s pretty much impossible for a person to get this right. 100s of queues per switch, 10s or 100s of switches in a network. It does not matter how big your whiteboard is, no matter how much like Sheldon you may be, this is the age of algorithms and number crunching.
If your network is one big hot spot, well, the engineering solution to that is relatively straightforward, you need more network for your traffic. This you can try and brute force, but physics is not on your side. It’s better to have math help you solve that.