DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Navigating and Modernizing Legacy Codebases: A Developer's Guide to AI-Assisted Code Understanding
  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  • Build a Stateless Microservice With GitHub Copilot in VSCode
  • From Prompt to Running Microservice: ServiceBricks Step-By-Step

Trending

  • The Ultimate Guide to Code Formatting: Prettier vs ESLint vs Biome
  • How Large Tech Companies Architect Resilient Systems for Millions of Users
  • Simplifying Multi-LLM Integration With KubeMQ
  • Designing for Sustainability: The Rise of Green Software
  1. DZone
  2. Data Engineering
  3. Data
  4. Best Practices for Tracing and Debugging Microservices

Best Practices for Tracing and Debugging Microservices

When moving from a monolith to a distributed system, tracing and debugging can get really hard. Luckily, there are some things you can do to make it easier.

By 
Andrew Rivers user avatar
Andrew Rivers
·
Feb. 16, 17 · Opinion
Likes (19)
Comment
Save
Tweet
Share
35.6K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, I’ll be talking about some techniques for debugging and fault-finding in microservices architectures.

One of the issues we’ve had for a long time — in fact, ever since distributed computing became a thing — has to do with debugging issues with a single business process that is run across multiple machines at different times. When software was monolithic, we always had access to the full execution stack trace and we knew the machine that we were running on. When we encountered an error, we could write all the information we needed to a log and later inspect it to see what went wrong.

When applications started running over small farms, we no longer knew exactly which machine we were running on. A look at the whole farm was a highly inefficient but often taken route. With the advent of cloud computing, where compute instances are ephemeral, we cannot rely on machine logs to be there when we come to analyzing log data. We need to get serious about how we manage our logging so that we have a decent chance of tracking down and fixing issues.

Here are my best practices for tracing and debugging your microservices.

1. Externalize and Centralize the Storage of Your Logs

The first thing you need to do is to treat the data integrity of your logs seriously. You cannot rely on retrieving information from a physical machine at a later date, especially in an auto-scaled virtual environment. You wouldn’t allow a design where important customer data was stored on a single virtual machine with no backup copy, so you shouldn’t do this for logs.

There’s no absolute right answer to how you store your log information whether it’s in a database, on disk, or in an S3 bucket, but you need to be sure the storage used is durable and available.

Tip: If you’re on AWS, you can always direct your output to CloudWatch. On Azure, consider using Application Insights. We will be going into more detail on logging in future articles, as getting this right is fundamental to success in making operationally supportable services.

2. Log-Structured Data

Many of the logging components we get now will output JSON documents for each log instead of outputting flat rows in a text file. This type of data structure is important because log processing and collation tools can easily process each record, and the information you provide in each record can be richer.

3. Create and Pass a Correlation Identifier Through All Requests

When you receive an initial request that kicks off some processing, you need to create an identifier that you can trace all the way through from the initial request and through all subsequent processing. If you call out to other components or microservices in order to complete your processing, then you should pass the same correlation identifier each time.

Each of your subsequent components and microservices need to use this identifier in their own logging so that you can collate a complete history of all of the work done to process a request.

4. Return Your Identifier Back to Your Client

It’s all very interesting if you can trace the flow through your system for a given request, but not much fun if you get a support call and you have no idea which request you need to be looking at.

When your API client has made a request and initiated a process, you should return a transaction reference in the response headers. When your Ops team needs to investigate the issue, they should be able to provide the transaction reference with a support call and you can find the relevant information in the logs.

5. Make Your Logs Searchable

It’s a good start to capture your log data, but you’re still left with a haystack of information even if you have the identifier of your needle. If you’re going to be serious about support, then you should be able to search and retrieve a collated and filtered set of information for a single request.

Even if your data isn’t initially being written into a data warehouse, you might want to look at offline processing that loads it into one. You’ve got plenty of options for this, such as the ELK stack or AWS Redshift. If you’re on the Azure stack, you could choose Application Insights, as mentioned above.

6. Allow Your Logging Level to Be Changed Dynamically

Most logging frameworks support multiple levels of details. Typically, you’ll have error, warning, info, debug, and verbose as the default logging levels available to you, and your code should be instrumented appropriately. In production, you’ll probably log info and the above, but if you have problems in specific components, then you should be able to change the tracing to debug or verbose in order to capture the required diagnostic information.

If you’re getting problems, you should be able to make this change to logging levels on the fly while your systems are running, diagnose the issue, then return the logging back to normal afterward.

Summary

Fault-finding in distributed microservices can be difficult if you don’t readily have access to the logs from all of the machines your code runs on. Virtualized cloud instances, which can disappear at any time, exacerbate the problem because you cannot go back to a machine instance later to see what has happened.

Planning the management of logs and how you conduct fault finding needs to be done at the design phase and executed with the appropriate tooling and technique. Once you’ve cracked this, supporting your systems will be a whole lot easier.

microservice

Published at DZone with permission of Andrew Rivers, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Navigating and Modernizing Legacy Codebases: A Developer's Guide to AI-Assisted Code Understanding
  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  • Build a Stateless Microservice With GitHub Copilot in VSCode
  • From Prompt to Running Microservice: ServiceBricks Step-By-Step

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!