DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • A Deep Dive Into Distributed Tracing
  • 3 Best Tools to Implement Kubernetes Observability
  • Unified Observability: Metrics, Logs, and Tracing of App and Database Tiers in a Single Grafana Console
  • Modeling Saga as a State Machine

Trending

  • Proactive Security in Distributed Systems: A Developer’s Approach
  • My Favorite Interview Question
  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 2
  • Using Python Libraries in Java
  1. DZone
  2. Data Engineering
  3. Data
  4. 4 Key Observability Metrics for Distributed Applications

4 Key Observability Metrics for Distributed Applications

What to watch with your cloud applications- in this post, we'll cover areas your metrics should focus on to ensure you're not missing key insights.

By 
Michael Bogan user avatar
Michael Bogan
DZone Core CORE ·
Jul. 12, 21 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
12.5K Views

Join the DZone community and get the full member experience.

Join For Free

A common architectural design pattern these days is to break up an application monolith into smaller microservices. Each microservice is then responsible for a specific aspect or feature of your app. For example, one microservice might be responsible for serving external API requests, while another might handle data fetching for your frontend. 

Designing a robust and fail-safe infrastructure in this way can be challenging; monitoring the operations of all these microservices together can be even harder. It's best not to simply rely on your application logs for an understanding of your systems' successes and errors. Setting up proper monitoring will provide you with a more complete picture, but it can be difficult to know where to start. In this post, we'll cover service areas your metrics should focus on to ensure you're not missing key insights.

Before Getting Started

We're going to make a few assumptions about your app setup. Don't worry—you don't need to use any specific framework to start tracking metrics. However, it does help to have a general understanding of the components involved. In other words, how you set up your observability tooling matters less than what you track. 

Since a sufficiently large set of microservices requires some level of coordination, we're going to assume you are using Kubernetes for orchestration. We're also assuming you have a time series database like Prometheus or InfluxDB for storing your metrics data. You might also need an ingress controller, such as the one Kong provides to control traffic flow, and a service mesh, such as Kuma, to better facilitate connections between services.

Before implementing any monitoring, it's essential to know how your services actually interact with one another. Writing out a document that identifies which services and features depend on one another and how availability issues would impact them can help you strategize around setting baseline numbers for what constitutes an appropriate threshold. 

Types of Metrics

You should be able to see data points from two perspectives: Impact Data and Causal Data. Impact Data represents information that identifies who is being impacted. For example, if there's a service interruption and responses slow down, Impact Data can help identify what percentage of your active users is affected. 

While Impact Data determines who is being affected, Causal Data identifies what is being affected and why. Kong Ingress, which can monitor network activity, can give us insight into Impact Data. Meanwhile, Kuma can collect and report Causal Data. 

Let's look at a few data sources and explore the differences between Impact Data and Causal Data that can be collected about them.

Latency

Latency is the amount of time it takes between a user performing an action and its final result. For example, if a user adds an item to their shopping cart, the latency would measure the time between the item addition and the moment the user sees a response that indicates its successful addition. If the service responsible for fulfilling this action degraded, the latency would increase, and without an immediate response, the user might wonder whether the site was working at all. 

To properly track latency in an Impact Data context, it's necessary to follow a single event throughout its entire lifetime. Sticking with our purchasing example, we might expect the full flow of an event to look like the following:

  • The customer clicks the "Add to Cart" button

  • The browser makes a server-side request, initiating the event

  • The server accepts the request

  • A database query ensures that the product is still in stock

  • The database response is parsed, a response is sent to the user, and the event is complete

To successfully follow this sequence, you should standardize on a naming pattern that identifies both what is happening and when it's happening, such as customer_purchase.initiate, customer_purchase.queried, customer_purchase.finalized, and so on. Depending on your programming language, you might be able to provide a function block or lambda to the metrics service:


Java
 
statsd.timing('customer_purchase.initiate') do
  # ...
end


By providing specific keywords, you ought to hone in on which segment of the event was slow in the event of a latency issue.


Tracking latency in a Causal Data context requires you to track the speed of an event between services, not just the actions performed. In practice, this means timing service-to-service requests:


Java
 
statsd.histogram('customer_purchase.initiate') do
  statsd.histogram('customer_purchase.external_database_query') do
    # ...
  end
end


This shouldn't be limited to capturing the overall endpoint request/response cycles. That sort of latency tracking is too broad and ought to be more granular. Suppose you have a microservice with an endpoint that makes internal database requests. In that case, you might want to time the moment the request was received, how long the query took, the moment the service responded with a request, and the moment when the originating client received that request. This way, you can pinpoint precisely how the services communicate with one another.


Traffic

You want your application to be useful and popular—but an influx of users can be too much of a good thing if you're not prepared! Changes in site traffic can be difficult to predict. You might be able to serve user load on a day-to-day basis, but events (both expected and unexpected) can have unanticipated consequences. Is your eCommerce site running a weekend promotion? Did your site go viral because of some unexpected praise? Traffic variances can also be affected by geolocation. Perhaps users in Japan are experiencing traffic load in a way that users in France are not. You might think that your systems are working as intended, but all it takes is a massive influx of users to test that belief. If an event takes 200ms to complete, but your system can only process one event at a time, it might not seem like there's a problem—until the event queue is suddenly clogged up with work.

Similar to latency, it's useful to track the number of events being processed throughout the event's lifecycle to get a sense of any bottlenecks. For example, tracking the number of jobs in a queue, the number of HTTP requests completed per second, and the number of active users are good starting points for monitoring traffic.

For Causal Data, monitoring traffic involves capturing how services transmit information to one another, similar to how we did it for latency. Your monitoring setup ought to track the number of requests to specific services, their response codes, their payload sizes, and so on—as much about the request and response cycle as necessary. When you need to investigate worsening performance, knowing which service is experiencing problems will help you track the possible source much sooner.

Error Rates

Tracking error rates is rather straightforward. Any 5xx (or even 4xx) issued as an HTTP response by your server should be tagged and counted. Even situations that you've accounted for, such as caught exceptions, should be monitored because they still represent a non-ideal state. These issues can act as warnings for deeper problems stemming from defensive coding that doesn't address actual problems. 

Kuma can capture the error codes and messages thrown by your service, but this represents only a portion of actionable data. For example, you can also capture the arguments which caused the error (in case a query was malformed), the database query issued (in case it timed out), the permissions of the acting user (in case they made an unauthorized attempt), and so on. In short, capturing the state of your service at the moment it produces an error can help you replicate the issue in your development and testing environments.

Saturation

You should track the memory usage, CPU utilization, disk reads/writes, and available storage of each of your microservices. If your resource usage regularly spikes during certain hours or operations or increases at a steady rate, this suggests you’re overutilizing your server. While your server may be running as expected, once again, an influx of traffic or other unforeseen occurrences can quickly topple it over.

Kong Ingress only monitors network activity, so it's not ideal for tracking saturation. However, there are many tools available for tracking this with Kubernetes.

Implementing Monitoring and Observability

Up to now, we've discussed the kinds of metrics that will be important to track in your cloud application. Next, let’s dive into some specific steps you can take to implement this monitoring and observability.

Install Prometheus

Prometheus is the go-to standard for monitoring, an open-source system that is easy to install and integrate with your Kubernetes setup. Installation is especially simple if you use Helm.


First, we create a monitoring namespace:

$ kubectl create namespace monitoring

Next, we use Helm to install Prometheus. We make sure to add the Prometheus charts to Helm as well:

Java
 
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

$ helm repo add stable https://kubernetes-charts.storage.googleapis.com/

$ helm repo update

$ helm install -f https://bit.ly/2RgzDtg -n monitoring prometheus prometheus-community/prometheus


The values file referenced at https://bit.ly/2RgzDtg sets the data scrape interval for Prometheus to ten seconds.

Enable Prometheus Plugin in Kong

Assuming you are using Kong Ingress Controller (KIC) for Kubernetes, your next step will be to create a custom resource—a KongPlugin resource—which integrates into the KIC. Create a file called prometheus-plugin.yml:

Java
 
apiVersion: configuration.konghq.com/v1
kind: KongClusterPlugin
metadata:
  name: prometheus
  annotations:
    kubernetes.io/ingress.class: kong
  labels:
    global: "true"
plugin: prometheus


Install Grafana

Grafana is an observability platform that provides excellent dashboards for visualization of data scraped by Prometheus. We use Helm to install Grafana as follows:

$ helm install grafana stable/grafana -n monitoring --values http://bit.ly/2FuFVfV

You can view the bit.ly URL in the above command to see the specific configuration values for Grafana that we provide upon installation.

Enable Port Forwarding

Now that Prometheus and Grafana are up and running in our Kubernetes cluster, we'll need access to their dashboards. For this article, we'll set up basic port forwarding to expose those services. This is a simple—but not very secure—way to get access, but not advisable for production deployments.

Java
 
$ POD_NAME=$(kubectl get pods --namespace monitoring -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace monitoring  port-forward $POD_NAME 9090 &

$ POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace monitoring port-forward $POD_NAME 3000 &

The above two commands expose the Prometheus server on port 9090 and the Grafana dashboard on port 3000.

Those simple steps should be sufficient to set you off and running. With Kong Ingress Controller and its integrated Prometheus plugin, capturing metrics with Prometheus and visualizing them with Grafana are quick and simple to set up.

Conclusion

Whenever you need to investigate worsening performance, your Impact Data metrics can help orient you on the magnitude of the problem: it should tell you how many people are affected. Likewise, your Causal Data identifies what isn't working and why. The former points you to the plume of smoke, and the latter takes you to the fire. 

In addition to all of the above, you should also consider the rate at which your metrics are changing. For example, say your traffic numbers are increasing. Observing how quickly those numbers are moving can help you determine when (or if) it'll become a problem. This is essential for managing upcoming work with regular deployments and changes to your services. It also establishes what an ideal performance metric should be.

Google wrote an entire book on site reliability, which is a must-read for any developer. If you're already running Kong alongside your clusters, plugins such as this one integrate directly with Prometheus, which means less configuration on your part to monitor and store metrics for your services.

Metric (unit) application microservice Data (computing) Observability Event Kubernetes Requests Database Moment

Opinions expressed by DZone contributors are their own.

Related

  • A Deep Dive Into Distributed Tracing
  • 3 Best Tools to Implement Kubernetes Observability
  • Unified Observability: Metrics, Logs, and Tracing of App and Database Tiers in a Single Grafana Console
  • Modeling Saga as a State Machine

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!