DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. Application Safety and Correctness Cannot Be Offloaded Onto Istio or Any Service Mesh

Application Safety and Correctness Cannot Be Offloaded Onto Istio or Any Service Mesh

Let's talk about what needs to be done to ensure correctness and safety in distributed applications, and why we can't leave it to the service mesh.

Christian Posta user avatar by
Christian Posta
·
Aug. 15, 18 · Opinion
Like (5)
Save
Tweet
Share
3.19K Views

Join the DZone community and get the full member experience.

Join For Free

i've recently started giving a talk about the evolution of integration and the adoption of service mesh, specifically istio. i've been excited about istio ever since i first heard about it back in january 2017; in fact, i've been excited about this new wave of technology helping to make microservices and cloud-native architectures a possibility for organizations. maybe you can tell, as i've been writing a lot about it ( follow along for the latest @christianposta ).

istio builds on some of the goals of containers and kubernetes: provide valuable distributed-systems patterns as language-agnostic idioms. for example, kubernetes manages containers across a fleet of machines by doing things like start/stop, health check, scaling/autoscaling, etc regardless of what's actually running in the containers. similarly, istio can solve challenges of reliability, security, policy, and traffic by transparently applying that outside of the application's container.

with the announcement of istio 1.0 on july 31st, 2018 , we're seeing a large uptick in istio usage and adoption. one question i have been seeing is "if istio provides reliability for me, do i have to worry about it in my application?"

the answer is: abso-freakin-lutely!

i wrote a post almost exactly a year ago that included this distinction, but didn't make it forcefully enough; this post is my attempt to help rectify that and builds on the talk earlier referenced .

just to set some context, istio provides application-networking "reliability" capabilities like

  • automatic retry
  • retry quota/budget
  • connection timeout
  • request timeout
  • client-side load balancing
  • circuit breaking
  • bulkheading

these capabilities are essential when dealing with distributed systems. networks are not reliable and break a lot of the nice safe assumptions/abstractions we have in a monolith. we're forced to either solve these problems or suffer unpredictable system-wide outages.

taking a step back

the larger problem here is actually just getting applications to talk to each other to solve some business functionality . that's why we write software, ultimately - to deliver some kind of business value. and that software uses constructs from the business's domain like "customer," "shopping cart," "account," etc. we see from domain driven design that each service may have slightly different understandings of each of those concepts.

these poorly specified concepts, and the larger business constraints (ie, customer is uniquely identified by name and email, or customer can have only one type of checking account, etc), along with unreliable networking and overall unpredictable infrastructure (build your services with the assumption that things can, and will, fail!) make building things correctly very difficult.

end-to-end correctness and safety

the fact remains, however, that in terms of building correct and safe applications, the responsibility of doing so becomes that of the application (and all those who support it). we can try to build lower-levels of reliability into components of the system for performance or optimizations, but the overall responsibility still remains with the applications. this principle was covered in "end-to-end arguments in system design" by saltzer, reed, and clark in 1984. specifically:

"the function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system."

here, "function" is meant to be one of the application requirements like "book a reservation" or "add an item to a shopping cart." this kind of functionality cannot be generalized to the communication system or its components/infrastructure (the "communication system" here refers to the network, the middleware, and anything providing infrastructure for applications to do their job):

"therefore, providing that questioned function as a feature of the communication system itself is not possible."

however, we can do things to the communication system to make parts of it reliable and generally assist in accomplishing a higher-order application requirement. we do these things to optimize an area so the application doesn't have to worry about it "as much," but it's not something the application can ignore:

"sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement."

for example, in the saltzer paper, they used the example of transferring a file from application a to application b:

what do we need to do (safety) to ensure the file gets delivered, intact (correctness)? at any point in the diagram, things can fail: 1) the storage mechanism can have failed sectors/transposed bits/corruption, so when application a reads the file, it's reading a faulty file; 2) the application could have a bug reading the file into memory or sending it out; 3) the network could mix up the byte ordering, duplicate parts of the file, etc. there are optimizations we can make, like using a more reliable transport like tcp or a message queue, but tcp doesn't know the semantics of "delivering a file correctly" so the best we can hope for is at least when we put things on the network they'll be delivered reliably.

for full end-to-end correctness, we may need to use something like a file checksum that gets stored with the file on its initial write and then have b verify the checksum when it receives the file. however we choose to verify that the transfer took place correctly (implementation detail), the responsibility lies with the application to figure out the solution and to get it right, not tcp or a message queue.

what are typical patterns that crop up?

in an effort to solve for application correctness and safety in distributed applications, there are patterns that crop up that we can use. earlier we mentioned some of the reliability patterns that istio gives us, but those are not the only ones. generally, there are two classes of patterns that crop up that we can use to assist in building applications correctly and safely and both are related. i call those classes "application integration" and "application networking." both are the responsibility of the application. let's take a look:

application integration

these patterns crop up in the form of

  • call sequencing, multicasting, and orchestration
  • aggregate responses, transforming message semantics, splitting messages, etc
  • atomicity, consistency issues, saga pattern
  • anti-corruption layers, adapters, boundary transformations
  • message retries, de-duplication/idempotency
  • message re-ordering
  • caching
  • message-level routing
  • retries, timeouts
  • backend/legacy systems integration

using a simple example of "add an item to a shopping cart," we can illustrate these concepts:

when a user clicks "add to cart" they expect to see the item added to their shopping cart. in the system, this may involve coordinating calls/call sequencing to a recommendation engine (hey, we added this to the cart, wonder if we can compute recommended offers to go along with it), an inventory service, and others before we actually call the service to insert into the shopping cart. we need to be able to handle transforming the message to the different backends, dealing with failures (and rolling back any changes we initiated), and in each one of the services, we need to be able to deal with duplicates. what if for some reason the call ends up being slow and the user clicks "add to cart" again? no amount of reliable infrastructure can save us from a user doing this; we need to detect and implement duplication checking/idempotent services in the application.

application networking

these patterns come in the form of:

  • automatic retry
  • retry quota/budget
  • connection timeout
  • request timeout
  • client-side load balancing
  • circuit breaking
  • bulkheading

there are also other complications of dealing with applications communicating over the network:

  • canary rollout
  • traffic routing
  • metrics collection
  • distributed tracing
  • traffic shadowing
  • fault injection
  • health checking
  • security
  • organizational policy

how do we use these patterns?

in the past, we tried to commingle these areas of application responsibility. we would do things like shoving everything into centralized infrastructure that was counted on to be basically 100% available (application networking + application integration). we put application concerns into this centralized infrastructure (which was supposed to make us more agile) but then suffered bottlenecks and rigidness when it came to making changes to applications quickly. these dynamics manifested in the way we implemented enterprise service bus:

alternatively, i believe the big clouds (netflix, amazon, twitter, etc) recognized this "application responsibility" aspect to these patterns and just commingled the application networking code into the application. think things like netflix oss where we had different libraries for circuit breaking, client-side load balancing, service discovery, etc.

as you know, netflix oss libraries around application networking were very java focused. as organizations started to adopt netflix oss and derivatives like spring-cloud-netflix, they met head-on with the fact that operationalizing an architecture like that became prohibitive as soon as you started adding other languages. netflix had the maturity and automation in place to pull it off, other organizations are not netflix. some of the problems when trying to operationalize application libraries and frameworks that solve the application-networking spectrum of problems:

  • each language/framework has its own implementation of these concerns.
  • the implementations won't be 100% exactly the same; they'll vary, differ, and sometimes be wrong.
  • how do you manage, update, patch these libraries? i.e. lifecycle management.
  • these libraries muddy up the logic of the application.
  • lots of trust in developers implementing the basics correctly.

istio and service mesh in general aim to solve the application-networking class of problems. moving the solution to these problems to the service mesh is an optimization for operability. this does not mean it's not the application's responsibility anymore, it just means the implementation of these capabilities exist out of process and must be configured.

by doing so, we can optimize operability by doing the following:

  • one single implementation of these capabilities everywhere
  • consistent functionality
  • correct functionality
  • programmable by both application operators and application developers

istio and service mesh don't allow you to offload responsibility to the infrastructure, they just add some level of reliability and optimize for operability. just like in the end-to-end argument, tcp doesn't allow you to offload application responsibilities.

istio helps with application networking class of problems, but what of the application-integration class of problems? luckily for developers, there's a myriad of frameworks to help with the application-integration aspects. my favorite for java developers is apache camel which provides a lot of the pieces needed to write correct and safe applications including:

  • call sequencing, multicasting, and orchestration
  • aggregate responses, transforming message semantics, splitting messages, etc
  • atomicity, consistency issues, saga pattern
  • anti-corruption layers, adapters, boundary transformations
  • message retries, de-duplication/idempotency
  • message reordering
  • caching
  • message-level routing
  • retries, timeouts
  • backend/legacy systems integration

other frameworks include spring integration and even an interesting new programming language from wso2 called ballerina . mind you, it's nice to reuse existing patterns and constructs, especially if they exist and are mature for your language of choice, but none of these patterns require you to use a framework.

what about smart endpoints, dumb pipes?

with respect to microservices, a friend of mine posed a question regarding the catchy but simplistic "smart endpoints and dump pipes" phrase regarding microservices and how does "making the infrastructure smarter" affect that premise:

the answer i gave was:

the pipes are still dumb; we're not coercing application logic about application correctness and safety into the infrastructure by using a service mesh. we're simply making it more reliable, optimizing for operational aspects, and simplifying what the application has to implement (not be responsible for). feel free to leave comments or reach out on twitter @christianposta if you disagree or have additional thoughts.

if you'd like to learn more about istio, check out http://istio.io or the book i wrote about istio .

application microservice Correctness (computer science) Kubernetes Infrastructure

Published at DZone with permission of Christian Posta, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Build an Automated Testing Pipeline With GitLab CI/CD and Selenium Grid
  • Seamless Integration of Azure Functions With SQL Server: A Developer's Perspective
  • Specification by Example Is Not a Test Framework
  • 19 Most Common OpenSSL Commands for 2023

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: