Distributed Systems 101

Learn the core challenges of distributed systems and how to address them with scalability, observability, and smart design choices.

Bartłomiej Żyliński

CORE ·

Apr. 18, 25 · Tutorial

Likes (2)

Comment

Save

5.7K Views

Distributed systems are all around us: Facebook, Uber, Revolut — even the Google search engine is one of them. One search in Google can trigger tens (or hundreds) of calls to different microservices owned by Google.

What is more, they are the core of what we work with: multiple services working together, or maybe a database, or just a service or two with some cache layer, or even some service that connects via an async message queue.

All of them share similar traits and problems. In this text, I will try to describe at least the most common of these problems — what they are, how they may impact your system, and how you can potentially mitigate them.

Let’s start with a definition of distributed systems.

What Are Distributed Systems?

And the topic gets tricky from the beginning, because there are multiple answers to that question. As far as I am aware, there are at least three different ones, and almost everybody writing a book on the topic comes with their own approach.

For sure, I am not willing to add yet another to the list. Instead, I would like to point out that all of these definitions describe systems that share a few common traits:

Distribution (hehehe) — the system is split across more than one node, usually much more than that
Communication — different nodes in the system communicate with one another in either an asynchronous or a synchronous manner
Cooperation — nodes in the system work together towards a common goal, like allowing you to order your ride in the case of Uber

As for the exact definition, in my opinion, the oldest and the funniest one is the best. Quoting Leslie Lamport, a man with a tremendous impact on how the distributed systems landscape looks:

A distributed system is one in which the failure of a computer you did not even know existed can render your own computer unusable.

This definition, while somewhat humorous, perfectly describes a key aspect of distributed systems — or in fact any system built using a microservices architecture. Cooperation towards the common goal and splitting across multiple nodes.

Key Challenges In Distributed Systems

As you may see, while the definitions may be ambiguous, there are a few traits that describe each distributed system. The same holds true for challenges related to distributed systems. There are a few key problems that you will encounter sooner or later while working with this class of systems.

Availability

In current times, while each millisecond of delay may lead to the loss of multiple dollars or thousands of dollars, availability is probably the single most important trait that systems expose.

Availability describes how our systems handle failures; it also determines the system’s uptime. Usually, we describe the availability of a system in “nines” notation. 99% availability guarantees a maximum of 14.40 minutes of downtime per day, while 99.999% — the so-called 5 nines — reduces this time to 846 milliseconds.

Most cloud services have an SLA with either 3 to 5 nines availability guarantees for end users.

Availability (%)	Downtime per day (~)	Downtime per month (~)	Downtime per year (~)
90	144 minutes (2.4 hours)	73 hours	36.53 days
99	14 minutes	7 hours	3.65 days
99.9	1.5 minutes	44 minutes	8.77 hours
99.99	9 seconds	4.4 minutes	52.6 minutes
99.999	846 milliseconds	26 seconds	5.3 minutes
99.9999	86.40 milliseconds	2.6 seconds	31.5 seconds

Additionally, the term high availability or HA is used to describe services that have at least 3 nines of availability guarantees.

There is a famous struggle related to availability and consistency. The common notion is that in case of a failure, we can have either one or the other. While in most cases this is true, the topic as a whole is vastly more nuanced and complex. For example, CRDTs put this whole statement into question; the same is true for Google’s internal Spanner.

Moreover, we can use various techniques to balance both of these traits. Further, our system may favor one over the other in certain places while not in others. Just remember that this struggle exists and is one of the most important cases of study in distributed systems research.

What Limits Availability?

Single points of failure — using a single-instance service or tools like a database is an availability killer. In case of any serious failure, our service goes offline right away, and we start burning money.
Stateful — while in some cases stateful services or processes are required and totally understandable, we should at least limit them as much as we can and/or reduce the number of services involved in stateful flows.
Synchronous communication — synchronous communication creates a direct dependency between services. If one side of this communication becomes slow, the availability of the other is automatically impacted. Similarly, as in the case of stateful processing, if you put too much focus on synchronous communication, you can easily impact the whole system’s availability.

What increases availability:

Redundancy — having multiple instances of a service that can handle incoming requests in case of failure. If one fails, the other can easily take over and continue to do the job.
Automatic failover — switching to a healthy instance of a database or other service in case of failure will provide no downtime.

Scalability

This property describes a system’s readiness to handle increased load. The better the system scales, the more concurrent incoming requests it can process before users start to notice any performance degradation.

It is crucial to design your system with scalability in mind from day one, as if there is no design correlation, meeting the handling of increased load will require architectural changes. That may not be the most pleasant experience on a living, breathing production system.

In terms of importance, there is a tie between scalability and availability. Deciding which one is more important is very hard and in most cases depends on the exact system use case. However, in most cases, they go head to head, and the same actions may mitigate problems with both traits.

Limiting factors are almost the same as in the case of availability, as it is hard to scale a system that is not available. Additionally, we can add tight coupling between components and monolithic architecture, to a degree, as both make it hard to scale individual components separately. Thus forcing us to scale the system as a whole.

How to increase scalability:

Asynchronous communication
Load balancing
Caching
Microservice architecture

Sounds interesting? I have dive somewhat deeper into the topic of Scalability in text.

Maintainability

Besides making a system available and scalable, there is also one important thing that we have to keep in mind. We will have to maintain this system after we release it. Some may say that this trait is even more important than both of the previous ones. Even the perfect system may cause a lot of headaches if we have problems maintaining it.

What to do when there is a production issue, and you have no logs to reason about? How to notice the performance degradation when there are no metrics? Sure, we can count on our users to report it, but it may not be the smartest business decision.

How to make a system maintainable:

Observability — a catch-all term for logs, metrics, alerting, and tracing. Without them, maintaining the system is an order of magnitude harder. We need timely and adequate responses in case of an issue, and probably nobody would like to be woken up by a phone call that something bad is happening with our software.
Tests — it’s as simple as that: tests are mandatory. Saying “I know it should work” is not a good approach.

Complexity

System design is a constant struggle to handle more/faster/better. With all this race, we must not forget about the complexity of our solution. It is important not to overengineer the system.

The tendency in software is for everything to grow sooner or later into an unmanageable engine that does everything. We do not even remotely understand half of what it does. As engineers, we must delay this process as long as possible.

The trivialisms like “The more complex our system is, the harder it will be to add or change something inside it” or “The more complex it is, the more business costs can increase” are so common, but nevertheless, here they are. It seems they are not reaching the correct people anyway.

Just remember to keep everything as simple as possible at a particular point and leave as much design space for later — complexity will come anyway with time.

As for a few more concrete examples:

Maybe you do not need this or that technology or tool;
Maybe we do not need another language somewhere in our overall architecture.
This or That fancy trend, while fancy, may not be a good long-term option.

Common Parts and Trade-Offs

This is just a few common problems arising in distributed systems. There are more of them, and some points are even more nuanced.

Some approaches can address more than one pitfall.

For example, replication can be a tool to impact both availability and scalability. We can use replicas for reading data and continue writing to the primary.
The same is the case for load balancing: we can spin up one or more load balancers that would split the load between our services, which impacts both availability (automatic failover) and scalability by routing requests to multiple services for processing.
On the other hand, migrating to microservices or some other horizontally scalable architecture extensively impacts the system’s complexity.

Moreover, all this redundancy, load balancers, monitoring, etc. also affects complexity. With each new component, it grows.

Probably the biggest game-changer for all four problems is stateless processing/services. It addresses all concerns in one manner or another:

Availability — You can spin multiple services, and they can easily pick up the job of failed ones
Scalability — Spin new instances as long as your budget allows
Complexity — No state means fewer moving parts, and it’s easier to reason about what exactly is happening

Here you can find Neal Ford’s talk on the trade-offs and consequences. I strongly recommend watching.

Summary

With this optimistic note on trade-offs, we reach the end of today’s journey. Designing systems is hard, and it is even harder to do it well.

Below key takeaways I would like you to take out of this blog:

Remember that your choices have consequences and may impact multiple areas of your system.
Always try to keep everything simple.
Stateless is better than stateful.
Observability and tests are both essential.

Good luck with designing architectures, and have fun doing it. Thank you for your time.

Scalability systems Cloud microservices

Published at DZone with permission of Bartłomiej Żyliński. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending