Chaos Mesh + SkyWalking: Better Observability for Chaos Engineering
This tutorial demonstrates how to combine SkyWalking and Chaos Mesh to better observe the effects of chaos experiments on applications’ service performance.
Join the DZone community and get the full member experience.Join For Free
Chaos Mesh is an open-source cloud-native chaos engineering platform. You can use Chaos Mesh to conveniently inject failures and simulate abnormalities that might occur in reality, so you can identify potential problems in your system. Chaos Mesh also offers a Chaos Dashboard which allows you to monitor the status of a chaos experiment. However, this dashboard cannot let you observe how the failures in the experiment impact the service performance of applications. This hinders us from further testing our systems and finding potential problems.
Apache SkyWalking is an open-source application performance monitor (APM), specially designed to monitor, track, and diagnose cloud-native, container-based distributed systems. It collects events that occur and then displays them on its dashboard, allowing you to observe directly the type and number of events that have occurred in your system and how different events impact the service performance.
When you use SkyWalking and Chaos Mesh together during chaos experiments, you can observe how different failures impact the service performance.
This tutorial will show you how to configure SkyWalking and Chaos Mesh. You’ll also learn how to leverage the two systems to monitor events and observe in real-time how chaos experiments impact applications’ service performance.
Before you start to use SkyWalking and Chaos Mesh, you have to:
- Set up a SkyWalking cluster according to the SkyWalking configuration guide.
- Deploy Chao Mesh using Helm.
- Install JMeter or other Java testing tools (to increase service loads).
- Configure SkyWalking and Chaos Mesh according to this guide if you just want to run a demo.
Step 1: Access the SkyWalking Cluster
After you install the SkyWalking cluster, you can access its user interface. However, no service is running at this point, so before you start monitoring, you have to add one and set the agents.
In this tutorial, we take Spring Boot, a lightweight microservice framework, as an example to build a simplified demo environment.
- Create a SkyWalking demo in Spring Boot by referring to this document.
- Execute the command
kubectl apply -f demo-deployment.yaml -n skywalkingto deploy the demo.
After you finish deployment, you can observe the real-time monitoring results at the SkyWalking UI.
Note: Spring Boot and SkyWalking have the same default port number: 8080. Be careful when you configure the port forwarding; otherwise, you may have port conflicts. For example, you can set Spring Boot’s port to 8079 by using a command like
kubectl port-forward svc/spring-boot-skywalking-demo 8079:8080 -n skywalking to avoid conflicts.
Step 2: Deploy SkyWalking Kubernetes Event Exporter
SkyWalking Kubernetes Event Exporter is able to watch, filter, and send Kubernetes events into the SkyWalking backend. SkyWalking then associates the events with the system metrics and displays an overview of when and how the metrics are affected by the events.
If you want to deploy SkyWalking Kubernetes Event Explorer with one line of commands, refer to this document to create configuration files in YAML format, and then customize the parameters in the filters and exporters. Now, you can use the command
kubectl apply to deploy SkyWalking Kubernetes Event Explorer.
Step 3: Use JMeter To Increase Service Loads
To better observe the change in service performance, you need to increase the service loads on Spring Boot. In this tutorial, we use JMeter, a widely adopted Java testing tool, to increase the service loads.
Perform a stress test on
localhost:8079 using JMeter, and add five threads to continuously increase the service loads.
Open the SkyWalking Dashboard. You can see that the access rate is 100%, and that the service loads reach about 5,300 calls per minute (CPM).
Step 4: Inject Failures via Chaos Mesh and Observe Results
After you finish the three steps above, you can use the Chaos Dashboard to simulate stress scenarios and observe the change in service performance during chaos experiments.
Chaos Mesh user interface
CPU Load: 10%; Memory Load: 128 MB
The first chaos experiment simulates low CPU usage. To display when a chaos experiment starts and ends, click the switching button on the right side of the dashboard. To identify whether the experiment is
Applied to the system or
Recovered from the system, move your cursor onto the short, green line.
During the time period between the two short, green lines, the service load decreases to 4,929 CPM, but returns to normal after the chaos experiment ends.
The service load variation under the first chaos condition
CPU Load: 50%; Memory Load: 128 MB
The service load variation under the second chaos condition
CPU Load: 100%; Memory Load: 128 MB
When the CPU usage is at 100%, the service load decreases to only 40% of what it would be if no chaos experiments were taking place.
Published at DZone with permission of Ningxuan Wang. See the original article here.
Opinions expressed by DZone contributors are their own.