DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • 5 Key Concepts for MQTT Broker in Sparkplug Specification
  • Managing Data Residency, the Demo
  • Why You Should Consider Using React Router V6: An Overview of Changes
  • What Is React? A Complete Guide

Trending

  • 5 Key Concepts for MQTT Broker in Sparkplug Specification
  • Managing Data Residency, the Demo
  • Why You Should Consider Using React Router V6: An Overview of Changes
  • What Is React? A Complete Guide
  1. DZone
  2. Data Engineering
  3. IoT
  4. Lies, Service Level Agreements, Trust and Failure Mores

Lies, Service Level Agreements, Trust and Failure Mores

Oren Eini user avatar by
Oren Eini
·
Dec. 29, 14 · Interview
Like (0)
Save
Tweet
Share
2.25K Views

Join the DZone community and get the full member experience.

Join For Free

i had a very interesting discussion with kelly sommers in twitter. but 140 characters isn’t really enough to explain things. also, it is interesting topic in general.

kelly disagreed with this post: http://www.shopify.ca/technology/14909841-kafka-producer-pipeline-for-ruby-on-rails

image

you can read the full discussion here .

the basic premise is, there is a highly reliable distributed queue that is used to process messages, but because they didn’t have operational experience with this, they used a local queue to store the messages sending them over the network. kelly seems to think that this is decreasing reliability. i disagree.

the underlying premise is simple, when do you consider it acceptable to lose a message. if returning an error to the client is fine, sure, go ahead and do that if you can’t reach the cluster. but if you are currently processing a 10 million dollar order, that is going to kinda suck, and anything that you can do to reduce the chance of that happening is good. note that key part in this statement, we can only reduce the chance of this happening, we can’t ensure it.

one way to try that is to get a guaranteed sla from the distributed queue. once we have that, we can rely that it works. this seems to be what kelly is aiming at:

image

and that is true, if you could rely on slas. just this week we had a multi hour, multi region azure outage. in fact, outages, include outages that violate slas are unfortunately common.

in fact, if we look at recent history, we have:

  • february 2012 – azure – incorrect leap year calculation took down multiple regions.
  • october 2012 – aws – memory leak due to misconfiguration took down a single availability zone, api throttling caused other availability zones to be affected.
  • december 2012 – aws – a developer was running against production, and delete some key data , resulting in netflix (among others) being unable to stream video.
  • august 2013 – azure – more servers brought online to increase capacity caused a misconfigured network appliance to believe that it is under attack , resulting in azure europe going dark.

there are actually more of them, but i think that 5 outages in 2 years is enough to show a pattern.

and note specifically that i’m talking about global outages like the ones above. most people don’t realize that complex systems operate in a constant mode of partial failure . if you ever read an accident investigative report, you’ll know that there is almost never just a single cause of failure. for example, the road was slippery and the driver was speeding and the abs system failed and the road barrier foundation rotted since being installed. even a single change in one of those would mitigate the accident from a fatal crash to didn’t happen to a “honey, we need a new car”.

you can try to rely on the distribute queue in this case, because it has an sla. and toyota also promises that your car won’t suddenly accelerate into a wall, but if you had a toyota camry in 2010 … well, you know…

from my point of view, saving the data locally before sending over the network makes a lot of sense. in general, the local state of the machine is much more stable than than the network. and if there is an internal failure in the machine, it is usually too hosed to do anything about anyway. i might try to write to disk, and write to the network even if i can’t do that ,because i want to do my utmost to not lose the message.

now, let us consider the possible failure scenarios. i’m starting all of them with the notion that i just got a message for a 10 million dollars order, and i need to send it to the backend for processing.

  1. we can’t communicate with the distributed queue. that can be because it is down, hopefully that isn’t the case, but from our point of view, if our node became split from the network, this has the same effect. we are writing this down to disk, so when we become available again, we’ll be able to forward the stored message to the distributed queue.
  2. we can’t communicate with the disk, maybe it is full, or there is an error, or something like that .we can still talk to the network, so we place it in the distributed queue, and we go on with our lives.
  3. we can’t communicate with the disk, we can’t communicate with the network. we can’t keep it in memory (we might overflow the memory), and anyway, if we are out of disk and network, we are probably going to be rebooted soon anyway. sol, there is nothing else we can do at this point.

note that the first case assumes that we actually do come back up. if the admin just blew this node away, then the data on that node isn’t coming back, obviously. but since the admin knows that we are storing things locally, s/he will at least try to restore the data from that machine.

we are safer (not safe, but more safe than without it). the question is whatever this is worth it? if your messages aren’t actually carrying financial information, you can probably just drop a few messages as long as you let the end user know about that, so they can retry. if you really care about each individual message, if it is important enough to go the extra few miles for it, then the store and forward model gives you a measurable level of extra safety.

Network Trust (business)

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • 5 Key Concepts for MQTT Broker in Sparkplug Specification
  • Managing Data Residency, the Demo
  • Why You Should Consider Using React Router V6: An Overview of Changes
  • What Is React? A Complete Guide

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: