Debugging Microservices Networking Issues — An Introduction
Let’s take a look at some of the possible friction points first, and then explore a few of the possible tools we can use to tackle them.
Join the DZone community and get the full member experience.Join For Free
Broadly speaking, many of the new debugging challenges you are expected to face with distributed microservices can be categorized as networking problems between the different parts of the infrastructure.
Note that inter-service communication in distributed systems is implemented either as a request/response synchronous communication (REST, gRPC, GraphQL) or asynchronous event-driven messaging (Kafka, AMQP, and many others).
Synchronous mechanisms are the clear winners – at least as of late 2020 – because it is much easier to develop, test, and maintain synchronous code. But they bring with them a host of problems. Let’s take a look at some of the possible friction points first, and then explore a few of the possible tools we can use to tackle them.
Inconsistent Network Layers
Your microservices might be deployed in various, different public clouds or on-prem, which means the networking layer service is based on top of can varies drastically between services. This is often the cause of sudden, non-reproducible timeouts and bursts of increased latency and low throughput. These are often a sad daily routine, the majority of which is out of your control.
Microservices are dynamic, so the routing should be as well. It’s not clear to a service where exactly in the topology its companion service is located, so specialized tooling is needed to allow each service to dynamically detect its peers.
Cascading Failures and Propagated Bottlenecks
Any microservice may start responding slower to the network requests from other services because of high CPU, low memory, long-running DB queries, and other factors. This may end up causing a chain reaction that will slow down other services, causing even more bottlenecks or making them drop connections.
Error Recovery and Fault Tolerance
Microservices – by definition – have many more moving parts that can fail along the way in comparison to monolith applications. This makes the graceful handling of the inevitable communication failures both critical and complicated.
Language-specific networking SDKs may handle various edge cases in a different way, which adds instability and chaos to inter-service communication.
Load Balancing Complexity
In a world of monolithic applications, the traffic is primarily north-south (from the Internet to the applications servers), and there are plenty of well-known solutions like API Gateways and load balancers that take care of the load. Microservice applications communicate with each other constantly, adding far more east-west traffic, which introduces an additional level of complexity.
One of the biggest advantages of the microservice approach, as we already mentioned, is independent scalability – each part of the system can be scaled on its own. Synchronous communication literally kills this advantage: if your API Gateway synchronously communicates with a database or any other downstream service, any peak load in the north-south traffic will overwhelm those downstream services immediately. As a result, all the services down the road will need rapid and immediate scaling.
Difficult Security Configuration
East-west traffic requires a lot more SSL certificates, firewalls, ACL policy configuration, and enforcement which is non-trivial and error-prone, especially when done manually.
To sum it up, one can say that implementing a synchronous style of inter-service communication literally contradicts the whole point of breaking your monolith into the microservices. Some even claim that it turns your microservices back into a monolith. At the very least, synchronous RPC mechanisms introduce tight coupling, cascading failures, and bottlenecks, and increase load balancing and service discovery overhead. These issues make it hard for the application to scale well.
For a synchronous request/response style architecture, a service mesh is the current de-facto standard solution. In a nutshell, a service mesh manages all the service-to-service, east-west traffic. It consists of three main parts – a data plane (where the information that needs to be moved between services lives), a sidecar component (that serves as the transport layer), and a control pane (that configures and controls the data plane).
The idea of a service mesh is to offload all the inter-service communication tasks and issues to a separate abstraction layer that takes care of all of this transportation hassle – allowing the microservice code to focus on business logic only. Typical service mesh solutions offer at least some of the following features:
- Traffic control features – routing rules, retries, failovers, dynamic request routing for A/B testing, gradual rollouts, canary releases, retries, circuit breakers, etc.
- Health monitoring – such as health checks, timeouts/deadlines, circuit breaking.
- Policy enforcement – throttling, rate limits, and quotas.
- Security – TLS, application-level segmentation, token management.
- Configuration and secret management
- Traffic observability and monitoring – top-line metrics (request volume, success rates, and latencies), distributed tracing, and more.
Service mesh solutions address most of the challenges mentioned above, remove a need for costly API Gateways and load balancers for east-west traffic, standardize handling network issues and configuration across the polyglot services, and take care of service discovery.
Service meshes are not silver bullets, however. Let’s take a moment to talk about a few possible pitfalls they have:
- Relatively new — The technology is still at its early adoption phase, and is subject to constant and breaking changes.
- Cost — Service meshes require an upfront investment in a platform, an expense that can be difficult to justify when applications are still evolving.
- Performance — A performance penalty (both in-network latency and runtime resources consumption) is inevitable and practically unpredictable.
- Operational Complexity — Service mesh functionality may duplicate existing application logic which may lead, for example, to redundant retries and duplicated transactions. Also, by the virtue of being another layer in the process, they tack on another layer in the stack you need to take care of.
- Multi-cluster topologies are generally not well supported.
The most important drawback, though, is the lack of support for asynchronous event-driven architecture. While we are not going to discuss synchronous vs asynchronous microservices communication here, the asynchronous approach suits the paradigm of microservices much better (if that interests you, by the way, you should read these Microsoft & AWS blog posts to see why and how exactly it solves a lot of challenges synchronous communication introduces). For now, before we jump on the bandwagon and right at the end of the article, let’s see what issues and challenges asynchronous communication brings alongside the benefits:
- Distributed transactions – Because of the networking layer of inter-service communication, the atomicity of DB operations cannot be enforced by a DB alone. You might need to implement an additional abstraction layer to enforce it, which is not a trivial task: a two-phase commit protocol can cause performance bottlenecks (or even deadlocks!), and the Saga pattern is pretty complicated – so data consistency issues are pretty common. Note that this is not a networking issue per se, and, strictly speaking, is also relevant for synchronous communication.
- Message queues TOC – Message queues are not easy to integrate with, test, configure, and maintain. The maintenance and configurations get much easier by using managed solutions (i.e. AWS SQS and SNS), but then you may face budget and vendor-lock issues.
- Observability – Understanding what’s going on in distributed, asynchronously communicating applications is hard. The three pillars of observability – logs, metrics, and traces – are extremely difficult to implement and manage in a way that will make sense for further debugging and monitoring (which is also true for synchronous communication, by the way).
Asynchronous message queues are more of a solution than a problem, at least in comparison to the difficulties that come with synchronous communication. As for the issues it brings to the table – because of the inherent complexities, there are no drop-in silver bullet solutions which will solve it for you — it takes care of the design, implementation, and operation to ensure high reliability.
I hope this has been a good introduction to the somewhat intricate art of debugging microservices. Note that the complexity presented here seems to undermine the benefits that are achieved by using the pattern. I want to end this article by mentioning that as with all technology, there’s no single correct answer. Evaluate the need, spec out what fulfilling it will entail, and make educated choices based on the situation. There are no silver bullets — only well thought out designs with advantages and disadvantages at every point of the way.
Opinions expressed by DZone contributors are their own.