One of the really attractive aspects of working at a start-up like DataSift is managing the challenges that come with the rapidity of organic growth. Our experiences with the networking aspects of Hadoop are an excellent example of this.
When we started working with Hadoop in mid 2011 we very quickly started to find all the little gotchas in networking kit that academically you knew were there but never thought you’d hit.
Our initial Hadoop network was very simple. At the core were two Cisco 3750′s stacked together for redundancy with multiple Cisco 2960′s LACP’d to create 4Gb uplinks.
Almost immediately Cacti showed that the busiest servers were experiencing unacceptable volumes of discards. The 2960′s are only rated for 2,700,000 packets per second with switching / forwarding bandwidth of 16 / 32Gbs respectively and this didn’t seem to be sufficient.
After following the traffic patterns and confirming our suspicions we experimented by upgrading some of the TOR switches to the Cisco 3560 platform which is lab rated at 38,700,000 packets per second with a similar switching / forwarding capacity of 32Gbs.
Packet loss dropped but was still in the thousands per second for certain hosts. Attempts were made to tweak the buffers to make better use of what buffer space we had but it just couldn’t keep up.
At this point we’d reached the limit of the Cisco Enterprise Access layer / Campus LAN portfolio and were heading into what they term ‘Core and Distribution’. Doubling the 3560′s packet per second throughput was the Cisco 4948 with a healthy 72,000,000 packets per second throughput and a 96Gbs switching capacity. The 4948 is effectively a 4500 without the modularity.
The 4948′s had cracked it; intra switch transit was flawless but unfortunately we had Core grade switches on our access layer that then got bottlenecked by performance of the uplinks to the ‘core’ which were still suffering heavily from discards.
The Hadoop cluster was growing and the network needed to be ready so we took the opportunity to redesign from scratch.
Designing the Network
Creating a network that will scale as fast as your platform where you need upwards of 20Gb/s at the aggregation layer things can get a little complicated.
Challenge #1 – Buffers
One of the primary reasons we were suffering such high discards was due to the inability for the switches to allocate enough buffer memory to consume the volumes of data that we could pile into the platform (Cisco has some documentation on the issue). Couple that with the issues regarding head of line blocking when all the servers in the cab need to utilise the uplinks back to the core then there are plenty of things to optimise and check.
Challenge #2 – Uplink Over-subscription
Ideally there should be enough bandwidth on the uplinks to allow the servers to utilise the full extent of their individual 1Gb NICs without contention.
Challenge #3 – Extensibility
The volume of data we process climbs every day and there’s no point designing something without anticipating how it’ll scale and how long for.
Several ideas were planned; the Mesh, the Chassis and the traditional Core + Access.
Since we already had some 4948′s which have Layer 3 capabilities and because Hadoop is cab aware therefore likes to talk between cabs one idea was to build a meshed ring using 2Gb/4Gb LACP links. Creating links to each and every other switch using OSPF allows the switches to efficiently route traffic only to the switch that contained the target subnet rather than sending all data up to a core just to immediately egress again on what is likely to be a heavily over subscribed link.
As soon as the network hits 13 cabinets it starts using more ports for routing than you actually get ports for hosts and the cost per host port just keeps climbing and there is still an issue that if all hosts need to talk to one particular rack then the link is over subscribed.
The mesh was dead in the water.
The Cisco Chassis
With the 2T supervisors you’d be looking at around 720,000,000 packets per second of performance with 80Gbs of inter blade transit. Unfortunately a chassis switch brings several complications aside from performance concerns.
With 577 potential ports that’s a lot of Cat6a to be consolidating into a single cab which even with decent cable management it was going to be a lot of cable to run to one place.
Another disadvantage of the chassis is that instead of having x amount of TOR switches with y PSU’s each the entire infrastructure relies on just 2 PSU’s from 2 divergent power supplies and it doesn’t take a fancy formula to see that this reduces redundancy.
Additional issues arise from consolidating switching as you’d want to make full use of each line card which could (in the event of the loss of a line card) affect twice as many servers if there was just 20 servers per TOR switch and there is also the issue of how to cross connect additional chassis’ without over-subscription once there were more than 500 servers.
During our investigations it would turn out that even the 6513 suffers from head of line blocking and has insufficient per port buffers to fully handle what we’re planning on doing with the platform.
So when you take into consideration the reduced redundancy, the larger impact of a failed linecard, the allocation of almost an entire cab footprint and the impact of a failed fan tray / backplane the chassis was out of the running too.
Core and Distribution a.k.a Leaf and Spine
So far we’d dismissed two designs due to the lack of scalability, resilience, redundancy, uplink subscription, head-of-line blocking, port buffering, colour scheme of the chassis and a bunch of other factors it was time to pull something out of the hat.
The 4948 comes in several flavours some with 10Gb XENPAK slots and those without. We had the ones without which meant either trading in and finding a compatible core too or looking at a completely new vendor.
Switching vendors is a daunting task, it comes with risks about reliability, training, costs, inter-operability with existing infrastructure, scalability and if you’re really unlucky uses some crazy cable and serial config so it pays well to stand on the shoulders of giants. In this case our closest documented peers were StumbleUpon and Benoit Sigoure’s presentation at a Hadoop user group in 2011. We got in touch with Benoit who provided more information than was in the slides and after a few emails back and forth we were sold on the idea of using Arista as our new vendor.
I then took over one of our meeting rooms to create a mini network lab. Luckily our CTO likes shiny kit as much as we in the Operations team do;
— Nick Halstead (@nik) August 7, 2012
Within hours I’d resolved most of the initial reservations about switching vendors; with dual PSUs, hot swap fans, an OS (named EOS) that is similar enough to IOS to not present any training problems, up to four 10Gb links per 7048 and even a normal serial cable in ‘Cisco blue’.
I was confident that the Arista gear was the way forward and a couple of days later it was racked and ready for the transition;
A beautiful start to our new network. 80Gbs of awesome and we’re just getting started….. twitter.com/NetworkString/…
— Gareth Llewellyn (@NetworkString) August 29, 2012
This article won’t cover the migration of a multi petabyte Hadoop cluster onto an entirely new infrastructure but I will tell you that the first time we tested a HDFS rebalance it resulted in throughput that exceeded 24Gb/s without a single dropped packet and we’ve been running for some time since without a single issue;
Ethernet49 is up, line protocol is up (connected) Hardware is Ethernet, address is 001c.7316.4bd0 Description: ECMP 10Gbit Fibre Uplink Internet address is xx.xx.xx.xx/yy Broadcast address is 255.255.255.255 Address determined by manual configuration MTU 1500 bytes, BW 10000000 Kbit Full-duplex, 10Gb/s, auto negotiation: off Last clearing of "show interface" counters never 5 minutes input rate 456 Mbps (4.6% with framing), 48972 packets/sec 5 minutes output rate 1.16 Gbps (11.7% with framing), 100369 packets/sec 108948784171 packets input, 141933737051584 bytes Received 1183 broadcasts, 2155053 multicast 0 runts, 0 giants 0 input errors, 0 CRC, 0 alignment, 0 symbol 0 PAUSE input 142359795981 packets output, 190129640181753 bytes Sent 2 broadcasts, 136398 multicast 0 output errors, 0 collisions 0 late collision, 0 deferred 0 PAUSE output Ethernet50 is up, line protocol is up (connected) Hardware is Ethernet, address is 001c.7316.4bd0 Description: ECMP 10Gbit Fibre Uplink Internet address is xx.xx.xx.xx/yy Broadcast address is 255.255.255.255 Address determined by manual configuration MTU 1500 bytes, BW 10000000 Kbit Full-duplex, 10Gb/s, auto negotiation: off Last clearing of "show interface" counters never 5 minutes input rate 499 Mbps (5.1% with framing), 51724 packets/sec 5 minutes output rate 1.13 Gbps (11.4% with framing), 102552 packets/sec 108946562727 packets input, 139989778742013 bytes Received 14 broadcasts, 2155115 multicast 0 runts, 0 giants 0 input errors, 0 CRC, 0 alignment, 0 symbol 0 PAUSE input 141614511135 packets output, 189850408828125 bytes Sent 2 broadcasts, 136409 multicast 0 output errors, 0 collisions 0 late collision, 0 deferred 0 PAUSE output
That was August, it’s now October and we’re already at 12 switches powering hundreds of Hadoop servers plus with more data sources and augmentations being added the cluster will only grow further.
So, if you need to run Hadoop at scale I can’t recommend Arista highly enough.
Whilst none of these books have any specific information about dealing with this topic I would strongly recommend giving them a read through;