Micrometrics for CI/CD Pipelines
Learn about micrometrics for finding more performance issues and delivering better software to production, and how to integrate them with a CI/CD pipeline.
Join the DZone community and get the full member experience.Join For Free
Continuous integration/continuous deployment (CI/CD) has become central to software development. To ensure high-quality software releases, smoke tests, regression tests, performance tests, static code analysis, and security scans are run in a CI/CD pipeline. Despite all these quality measures, still, OutOfMemoryError, CPU spikes, unresponsiveness, and degradation in response time are surfacing in production environments.
These performance problems surface in production because, in CI/CD pipelines, only macro-level metrics such as static code quality metrics, test/code coverage, CPU utilization, memory consumption, and response time are studied. In many cases, these macro-metrics aren’t sufficient to uncover performance problems. Let’s review the micrometrics that should be studied in CI/CD pipelines to deliver high-quality releases to production. We will also learn how to source these micrometrics and integrate them into a CI/CD pipeline.
How Tsunamis Are Forecasted
You might wonder why tsunami forecasting is related to this article — there is a relationship! A normal sea wave travels at a speed of 5-60 miles per hour, whereas tsunami waves travel at a speed of 500-600 miles per hour. Even though a tsunami wave travels at 10 to 100 times the speed of normal waves, it’s very hard to forecast tsunami waves. Thus, modern day technologies use micrometrics to forecast tsunami waves.
DART device to detect tsunamis.
To forecast a tsunami, multiple DART (Deep-ocean Assessment and Reporting of Tsunami) devices are installed throughout the world. Deep ocean water is about 6,000 meters in depth (20 times the tallest San Francisco Sales Force tower). Whenever the sea level rises more than 1 millimeter, the DART automatically detects it and transmits this information to a satellite. This 1-millimeter rise in seawater is a leading indicator of tsunami origination. Pause here for a second and visualize the length of 1 millimeter in the scale of 6,000 meters sea depth. It’s nothing, negligible — but this micrometric analysis is what used for forecasting tsunamis.
How to Forecast Performance "Tsunamis" Through Micrometrics
Similarly, there are a few micrometrics that you can monitor in your CI/CD pipeline. These micrometrics are lead indicators of several performance problems that you will face in production. A rise or drop in values of these micrometrics are the great indicators for the origination of performance problems.
Garbage Collection Throughput
Average GC pause time
Maximum GC pause time
Object creation rate
Peak heap size
Let’s study each micrometric in detail:
1. Garbage Collection Throughput
Garbage collection throughput is the amount of time your application spends processing customer transactions versus the amount of time it spends in doing garbage collection.
Let’s say your application has been running for 60 minutes. In this 60 minutes, 2 minutes is spent on GC activities. The means the application has spent 3.33% on GC activities (i.e. 2 / 60 * 100)
That means the garbage collection throughput is 96.67% (i.e. 100 – 3.33).
When there is a degradation in the GC throughput, it’s an indication of some sort of memory problem. Now, the question is, what is the acceptable throughput percentage? It depends on the application and business demands. Typically, one should target more than 98% throughput.
2. Average Garbage Collection Pause Time
When a garbage collection event runs, the entire application pauses. Because garbage collection has to mark every object in the application to see whether those objects are referenced by other objects, if there are no references, then it will have to be evicted from memory. Then, fragmented memory has to be compacted. To do all these operations, the application will be paused. Thus, when garbage collection runs, customers will experience pauses/delays. Thus, one should always try to attain a low average GC pause time.
3. Max Garbage Collection Pause Time
Some garbage collection events might take a few milliseconds, whereas some garbage collection events might take several seconds to minutes. You should measure the maximum garbage collection pause time to understand the worst possible impact on the customer. Proper tuning (and, if needed, application code changes) are needed to reduce the maximum garbage collection pause time.
4. Object Creation Rate
The object creation rate is the average amount of objects created by your application. Maybe, in your previous code commit, the application was creating 100mb per second. Starting from the most recent code commit, the application started to create 150mb per second. This additional object creation rate can trigger a lot more GC activity, CPU spikes, potential OutOfMemoryError, and memory leaks when the application is running for a longer period.
5. Peak Heap Size
Peak heap size is the maximum amount of memory consumed by your application. If peak heap size goes beyond a limit, you must investigate it. There may be a potential memory leak in the application, newly introduced code (or third-party libraries/frameworks) is consuming lot of memory. If there is legitimate use of it, you will have to change your JVM arguments to allocate more memory.
Garbage collection throughput, average GC pause time, maximum GC pause time, object creation rate, and peak heap size micrometrics can be sourced only from garbage collection logs. No other tools can be used for this purpose. As part of your CI/CD pipeline, you need to run a regression test suite or performance test (ideal). Garbage Collection logs generated from the test, should be passed to GCeasy’s REST API. This API analyzes garbage collection logs and responds with the above micrometrics. To learn where these micrometrics are sent in the API response and JSON path expression for them, refer to this article. If any value is breached, the build can be failed. GCeasy's REST API has the intelligence to detect various other garbage collection problems, such as memory leaks, user time > sys + real time, sys time > user time, and invocation of System.gc() API calls. Any detected GC problems will be reported in the "problem" element of the API response. You might want to track this element as well.
6. Thread Count
Thread count is another key metric to monitor. If thread count goes beyond a certain limit, it can cause CPU and memory problems. Too many threads can cause "java.lang.OutOfMemoryError: unable to create new native thread" in a long-running production environment.
7. Thread States
Application threads can be in different thread states. To learn about various thread states, refer to this quick video clip. Too many threads in a RUNNABLE state can cause CPU spike. Too many threads in a BLOCKED state can make the application unresponsive. If the number of threads in a particular thread state crosses a certain threshold, then you may consider generating appropriate alerts/warnings in the CI/CD report.
8. Thread Groups
A thread group represents a collection of threads performing similar tasks. There could be a servlet container thread group that processes all the incoming HTTP requests. There could be a JMS thread group, which handles all the JMS sending, receiving activity. There could be some other sensitive thread groups in the application as well. You might want to track those sensitive thread groups' size. You don’t want their size to drop below a threshold or go beyond a threshold. Fewer threads in a thread group can stall the activities. More threads can lead to memory and CPU problems.
The thread count, thread states, and thread groups micrometrics can be sourced from thread dumps. As part of your CI/CD pipeline, you need to run a regression test suite or performance test (ideal). Three thread dumps in a gap of 10 seconds interval should be captured when tests are running. Captured thread dumps should be passed to FastThread’s REST API. This API analyzes thread dumps and responds with the above micrometrics. To learn where these micrometrics are sent in the API response and JSON path expression for them, refer to this article. If any value is breached, the build can be failed. The FastThread REST API has the intelligence to detect threading problems such as deadlocks, CPU spiking threads, and prolonged blocking threads. Any detected problems will be reported in the "problem" element of API response. You might want to track this element as well.
9. Wasted Memory
In the modern computing world, a lot of memory is wasted because of poor programming practices like duplicate object creation, duplicate string creation, inefficient collections implementations, sub-optimal data type definitions, and inefficient finalizations. The Heap Hero API detects an amount of memory wasted due to all these inefficient programming practices. This can be a key metric to track. In case if amount wasted memory goes beyond a certain percentage, then CI/CD build can be failed, or warnings can be generated.
10. Object Count
You might also want to track the total number of objects that are present in the application’s memory. The object count can spike because of inefficient code or the introduction of third-party libraries and frameworks. Too many objects can cause OutOfMemoryError, memory leaks, or CPU spikes in production.
11. Class Count
You might also want to track the total number of classes present in the application’s memory. Sometimes class count can spike because of an introduction of third-party libraries and frameworks. A spike in the class count can cause problems in the Metaspace/PermGen space of the memory.
Wasted memory size, object count, and class count micrometrics can be sourced from heap dumps. As part of your CI/CD pipeline, you need to run a regression test suite or performance test (ideal). Heap dumps should be captured after the test run is complete. Captured heap dumps should be passed to HeapHero’s REST API. This API analyzes heap dumps and responds back with these micrometrics.
To learn where these micrometrics are sent in the API response and JSON path expression for them, refer to this article. If any value is breached, then build can be failed. The HeapHero REST API has the intelligence to detect memory problems such as memory leaks and objects finalization. Any detected problems will be reported in the "problem" element of the API response. You might want to track this element as well.
Opinions expressed by DZone contributors are their own.