How to Pinpoint and Fix Distributed Problems Across Microservices
While logical separation of APIs fosters parallel development of independent functions, complexity and interdependency becomes harder to manage.
Join the DZone community and get the full member experience.Join For Free
Microservices are separated logically but are highly dependent on each other to deliver the expected functionality. This means performance problems become distributed across multiple services and can be difficult to trace.
Gone are the days of a developer being able to just run a debugger to find problem code — now multiple contributing factors affect the state of an application in microservices, as well as in Kubernetes environments. A testing harness that closely mirrors the production setup and incoming traffic has become a requirement for highly distributed microservices and containerized environments.
The challenge is how to introduce and test realistic scenarios. These realistic scenarios must cover the variety of use cases that actual customers will experience with APIs and that generate the necessary load to stress test the systems properly.
Lack of observability prevents developers and test engineers from observing real-life user workflow and its impact on the underlying microservices. To make things more challenging, the underlying microservices are often used by multiple internal applications while the infrastructure lives on public clouds or data centers. Each microservice could have its own technology stack with different performance characteristics.
Additionally, some extensions and tools used in production for scaling and observability are configured differently from the development environment. It’s common for cluster administrators to configure horizontal pod autoscaling (HPA) within the production cluster, for example. This automatically launches additional instances of the application. If the application is not designed to take advantage of HPA, it can fail miserably in production. To use auto-scaling techniques, services need to be completely stateless or employ a storage engine that supports mounting shared volumes.
Things to consider when attempting to pinpoint distributed problems across microservices include:
- The number of stacks.
- Differences in configuration of each stack.
- The characteristics of each stack hosted by different cloud providers.
- A lack of knowledge of the actual end-user workflows.
- The load that internal applications are having on the microservices.
Goodbye UI Tests
In microservices scenarios — including Kubernetes — test engineers need to understand API behavior, instead of observing the corresponding workflow through the GUI. Otherwise, when a microservice and its API are deployed, there is no guarantee that code deployed by another team for a completely different microservice will not negatively impact the application. Simply hammering an isolated API with traffic is not a resolution — individual performance often depends on interactions with an invisible network of internal and external API calls that can vary by time of day and come from different applications.
Finding and fixing distributed problems across microservices must involve automatically simulating an environment and its traffic connections and then rebuilding “mock” suites. Each microservice comes with its own characteristics in terms of input and output and performance. Some microservices are very robust when it comes to weathering latency on the input side, while others will respond either by adding lots of additional latency or by spitting out faulty responses to incoming API requests. This is especially problematic for dependencies on external API calls where performance could change by time of day based on customer demand or in a much more erratic manner.
Another issue with distributed applications arises from abruptly changing usage patterns. Maybe one developer changed their mind about limiting the data input to only the few required fields. They may now see tremendous upside in ingesting all fields and time to serve a request.
An ideal test environment needs to simulate the myriad of interdependencies between microservices in a manner that reflects real-life usage patterns. The goal is not to break the application by loading it up with brute-force request patterns but to figure out its boundaries during real-life everyday use. Therefore, a proper testing environment needs the following:
1) Real-life usage patterns that do not rely on the assumed happy path but replays sets of actual workflows that can be triggered by other software or via human interaction through a GUI.
2) In addition to recording these workflows, the test framework needs to replicate the involved microservices, internal and external, and factor in their received load, average, and peak, based on use by other applications.
3) The test environment needs to simulate the production environments using tools plugged in with their Kubernetes stack, mirroring specific configurations for the organization’s cloud setup, handling of application state and data connections, integrations points with external services, and so on.
4) DevOps pipeline integration is critical to catch issues early in the development process. This can prevent architectural decisions that would make the boundaries between microservices brittle and therefore prone to failure. This is especially important in situations where developers create quick workarounds for problems that can be detected right before release time, but can become significant bottlenecks once in production.
A proper testing harness that meets all of the above criteria should allow the DevOps team to continuously stress-test their APIs using real-world traffic without the need for complicated scripting. This can be achieved with Jenkins — and the proper testing-harness platform — such that a simple kubectl command can apply a specific YAML file from a repository into a quality-assurance (QA) namespace. This quick operation allows the application to be deployed into a simulation environment, where it will encounter the same data and dependencies as if it were in production.
In addition to taking all of the complexities into account of the working microservices environment, the test harness must be highly assertive. It should thus “stop or even delete” the application or system with extreme loads to see how it will hold up once deployed.
Following the completion of such a test during any stage of the CI/CD process, the following metrics should be provided:
- Response Time (latency or duration): how long a transaction takes to complete.
- Throughput (traffic or utilization or rate): how many transactions are happening over a unit of time (such as transactions per second or minute).
- Error Rate: a calculation of the rate of contractual/data response errors.
- Infrastructure-saturation metrics during stress-test replay of production-like traffic.
Let’s take an ecommerce environment as an example. In a proper test harness for microservices, the application might be tested with a mock Stripe service or another payment system, along with other backend dependencies. A traffic generator container is then used to serve responses that the system under test will typically encounter.
A proper mock environment for microservices should also provide chaos testing. This might involve shutting down six out of a dozen customer APIs or boosting latency among certain microservices connections to payment systems Stripe and PayPal. For the responder container, it is possible to set the test harness so that external APIs from Stripe are unresponsive 20% of the time. When the connection times out, a chaos test should determine whether the appropriate error message is communicated, as opposed to the user interface displaying a generic 404 error page.
Opinions expressed by DZone contributors are their own.