DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Bringing Healthcare Into the Cloud-Driven Future
  • Cloud Migration: How To Overcome Fears and Capitalize on Opportunities
  • Modern Data Backup Strategies for Safeguarding Your Information
  • Performance Optimization Strategies in Highly Scalable Systems

Trending

  • GenAI-Infused ChatGPT: A Guide To Effective Prompt Engineering
  • A Guide to Data-Driven Design and Architecture
  • Microservices With Apache Camel and Quarkus (Part 5)
  • Generative AI: A New Tool in the Developer Toolbox
  1. DZone
  2. Data Engineering
  3. Data
  4. Platform Performance Gains with Arista Switches

Platform Performance Gains with Arista Switches

Gareth Llewellyn user avatar by
Gareth Llewellyn
·
Mar. 29, 13 · Interview
Like (0)
Save
Tweet
Share
12.01K Views

Join the DZone community and get the full member experience.

Join For Free

In late 2012 I wrote about the migration of our [DataSift.com's] Hadoop cluster to Arista switches but what I didn’t mention was that we also moved our real time systems over to Arista too.

Within the LAN

During our fact finding trek through the Cisco portfolio we acquired a bunch of 4948 and 3750 switches which were re-purposed into the live platform. Unfortunately, the live platform (as opposed to Hadoop sourced historical data) would occasionally experience performance issues due to the fan-out design of our distributed architecture amplifying the impact of micro-bursts during high traffic events.

For every interaction we receive it is augmented with additional meta data such as language designation, sentiment analysis, trend analysis and more. To acquire these values an interaction is tokenised into the relevant parts (e.g. a Twitter user name for Klout score, sentences for sentiment analysis, trigrams for language analysis etc). Each of those tokens is then dispatched to the service endpoints for processing. A stream of 15,000 interactions a second can instantly becomes 100,000+ additional pieces of data traversing the network which puts load on NICs, switch backplanes and core uplinks.

If a particular request were to fail then precious time would be wasted on waiting for the reply, processing the failure and then re-processing the request. To combat this you might duplicate calls to service endpoints (e.g. speculative execution in Hadoop parlance) and double your chances of success but and those ~100,000 streams would become ~200,000 putting further stress on all your infrastructure.

At DataSift we discuss internal platform latency in terms of microseconds and throughput in tens of gigabits so adding an unnecessary callout here or a millisecond extra there isn’t acceptable. We want to be as efficient, fast and reliable as possible.

When we started looking at ways of improving the performance of the real time platform it was obvious that many of the arguments that made Arista an obvious choice for Hadoop also meant it was ideal for our real time system too. The Arista 7050′s we’d already deployed have some impressive statistics in regards to latency so we needed little more convincing that we were on the right path (although the 1.28 Tbps and 960,000,000 packets per second statistics don’t hurt either). For truly low latency switching at the edge one would normally look at the 7150 series but from our testing the 7048′s were well within the performance threshold we wanted and enabled us to standardise our edge.

We made use of our failure tolerant platform design (detailed further below) to move entire cabs at a time over to the Arista 7048′s with no interruption of service to customers.

Once all cabinets were migrated and with no other optimizations at that point we saw an immediate difference in key metrics;

Simply by deploying Arista switches for our ‘real time’ network we decreased augmentation latency from ~15,000µs down to 2200µs. Further optimisations to our stack and how we leverage the Linux kernels myriad options have improved things even more.

Epic Switches are only half the story

One of the great features of the Arista 7048 switches is their deep buffer architecture but in certain circumstances another buffer in the path is that last thing you want. Each buffer potentially adds latency to the system before the upstream can detect the congestion and react to it.

The stack needs to be free of bottlenecks to prevent the buffers from filling up and the 7048 switches can provide up to 40Gb/s of throughput to the core which fits nicely with 40 1u servers in a 44u cabinet. With that said we’re not ones to waste time and bandwidth by leaving the TOR switch if we don’t have to.

By pooling together resources into ‘cells’ we can reduce uplink utilisation and decrease the RTT of fan out operations by splitting the workload into per-cabinet pools;

With intelligent health checks and resource routing coupled with the Aristas non-blocking full wire speed forwarding in the event of a resource pool suffering failures the processing servers can call out cross-rack with very little penalty;

That’s Great but I’m on the other side of the Internet

We are confident enough in our ability to provide low latency, real time, filtered and augmented content that we publish live latency statistics of a stream being consumed by an EC2 node from the other side of the planet on our status site; http://status.DataSift.com.

We can be this confident because we control and manage every aspect of our platform from influencing how data traverses the Internet to reach us, our routers, our switches all the way down to the SSD chipset or SAS drive spindle speed in the servers. (You can’t say that if you’re on someone’s public cloud!)

When you consider the factors outside of our control it speaks volumes about the trust we have in what we’ve built.

User Latency
(They could be next door to a social platform DC or over in Antarctica)
10ms – 150ms
+
Source Platform Processing time
(e.g. Time taken for Facebook or Twitter to process & send it on)
???ms
+
Trans-Atlantic fibreoptics
(e.g. San Jose to our furthest European processing node)
~150ms
+
Trans-Pacific fibreoptics
(e.g. From a European processing node to a customer in Japan)
~150ms
=

~500ms

When dealing with social data on a global scale there can be a lot of performance uncertainty with under-sea fibre cuts, carrier issues and entire IX outages but we can rest assured that once that data hits our edge we know we can process it with low latencies and high throughput.

In conclusion I’ve once again been impressed by Arista and would whole heartedly recommend their switches to anyone else working with high volume, latency sensitive data.

Reading List:
Arista switches were already a joy to work with (access to bash on a switch, what’s not to love?) but Gary’s insights and advice makes it all the better.
Arista Warrior – Gary A. Donahue

Even with all the epicness of this hardware if you’re lazy with how you treat the steps your data goes through before it becomes a frame on the switch you’re gonna have a bad time so for heavy duty reading The Linux TCP/IP stack book may help.
The Linux TCP/IP Stack: Networking for Embedded Systems – Thomas F Herbert




Data (computing)

Published at DZone with permission of Gareth Llewellyn, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Bringing Healthcare Into the Cloud-Driven Future
  • Cloud Migration: How To Overcome Fears and Capitalize on Opportunities
  • Modern Data Backup Strategies for Safeguarding Your Information
  • Performance Optimization Strategies in Highly Scalable Systems

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: