Metrics and Logs Are Out, Distributed Tracing Is In
This post recaps my talk with Chinmay Gaikwad, the tech evangelist at Epsagon, about distributed tracing and observability for microservices architectures.
Join the DZone community and get the full member experience.Join For Free
This post recaps my talk with Chinmay Gaikwad, the tech evangelist at Epsagon, about distributed tracing and observability for microservices architectures. Check out the transcript and video from our conversation below!
Question: Can you talk a little bit about what you do, where you come from and what's Epsagon famous for?
Chinmay: I come from a software development background. I started my career at Intel as a software developer. I was there for a few years, then moved around roles. I jumped from software engineering to application platform engineering to technical marketing engineering. And then, I also did a bit of product marketing. And now I am a technical evangelist here at Epsagon.
I'm based out of New York. I love the city, and I think it's one of the best cities in the world.
Question: I also live close to New York, and when people ask me where I'm from, I usually say I'm from New York, even though I technically live in New Jersey. I can see the skyline from my house, so I can also call myself a New Yorker. Some people might disagree, but you know, haters are going to hate—New York state of mind.
Chinmay: Epsagon is an observability platform, and we help monitor applications running in Kubernetes and serverless environments. And Epsagon started as a distributed tracing-based observability platform. And we can talk more about that as we go further along. But that's what Epsagon does, essentially.
Observability Has Changed Over the Years
Question: These days, the observability and health of production systems are becoming more important. Engineers are the builders who want to make sure that the system stays up 24/7 and no one gets paged in the middle of the night.
When I was a consultant, we provided some production support, but they would only call me as a developer when something was really bad. We developed the code, we’d throw this over the fence and the people had to support it. We didn't care what they used for monitoring, alerts and who would be on call. And over the last couple of years, maybe like the last 10 years or so, this situation has changed. How have you seen this change over the years in your career?
Chinmay: Over the last 10 years, I have seen two kinds of changes. The first one is on the architecture side of things. Earlier, we had monolithic-based applications, and now we have microservices-based applications. That means we have smaller applications that are easier to manage, scale and deploy. So that's a huge shift that I've seen over the last 10 years.
And the other shift that I've seen is developers versus a DevOps culture. And as you mentioned, just developing the code and throwing it over was kind of a thing back then.
These days, you are responsible for your code, and you’re responsible for monitoring the health of your code. So the DevOps culture has integrated with most of the companies these days.
How Does Epsagon Help With Observability Challenges?
Question: Kong is a cloud connectivity company, and we really, truly care about connecting applications and distributed systems through APIs. For us, it's also important for us to provide this connectivity, like electricity or water or even air, for modern digital enterprises. And part of that is observability and monitoring to make sure that things are going smoothly.
Let's talk a little bit about Epsagon's product and how it helps solve the problems we just discussed.
Chinmay: Traditional observability platforms were focused on monitoring monolithic-based applications. What I mean by that is since the logic of the code was within an application itself, it was easier to moderate or monitor the applications. And they were also focused on infrastructure because back then, infrastructure was very varied. We had virtual machines, dedicated servers, on-premises and in the cloud. These days, applications have gained more importance, with microservices coming into the picture. And infrastructure is becoming standardized, so you generally use a particular cloud provider or use Kubernetes-based infrastructure.
Epsagon came into the picture when the microservices revolution was complete. Then we realized that there are metrics and logs, but those two are ineffective in troubleshooting applications in a microservices world.
What’s the Value of Distributed Tracing?
Question: Usually, we use logs only as a last resort. And metrics might show some current state, for example, if we see something is happening in production. But if we cannot reproduce this because of a spike, how can we use this data? Like how can they use these metrics to investigate? We probably need to do something better to connect logs and metrics. What do you think?
Chinmay: Exactly. So that's where distributed tracing comes into the picture. Imagine you have an application that consists of thousands of microservices. For example, if a user clicks on a particular application, they’re touching a subset of those microservices, not all of them.
Distributed tracing will tell you the story behind a particular user action or a request.
And that's where the problem might lie, for example, if a user cannot put an item into the cart and you have an error. Distributed tracing helps you pinpoint where the problems are occurring, which microservices are responsible and connect metrics and logs.
Important Metrics to Track With Distributed Tracing
Question: So we have logs, metrics and now traces that allow for the best of both worlds of the historical data provided by logs, metrics that show some immediate data and traces that show how things happened over time.
Usually, when we talk about metrics, what’s important? How should people think about which metrics or traces are actionable? How do you think about this, and how does Epsagon help to think about those things?
Chinmay: We have a lot of customers who work in different industries, such as construction, fashion or music.
Question: That’s very representative of New York. You see a lot of construction, and you see fashion and music.
Chinmay: Yeah, exactly. So we get to work with a lot of cool customers.
You have a particular list of KPIs defined based on your service level agreements (SLAs) at a business level. Then you define your KPIs. The KPI might be as simple as how long the website is up, and that’s a very simple KPI. And then, if you need to figure out what it takes to keep the website up, what are the components required to do that? You can define which metrics are important.
For example, if you're using AWS DynamoDB as your backend database, then you need to figure out how much request time is being taken to fetch data from DynamoDB, so that might be one of your important metrics. It depends on a case-by-case basis. Business-level metrics define engineering metrics, and the flow goes on at the end.
Epsagon Use Case
Question: Can you talk about some of the most exciting use cases that you work in?
Chinmay: Most of those were making a transition from monolith to microservices in recent years. It's a huge mindset shift, right? And generally, people don't jump directly to serverless functions. They take an intermediate step, maybe move to containers and then monolithic applications in containers.
It becomes hard to monitor monolithic applications within containers. We can definitely do that, but the ideal case would be running microservices within containers.
One of our customers, for example, did this huge transition, and it required a lot of engineering resources, and they did not see the results. So they did not see the transition from monolithic to microservices to their benefit. They were seeing a lot of website downtime, and they saw a lot of database downtime and did not know what was happening, essentially.
And they were also using Ruby, which is not the most commonly used language in the cloud these days. For us, it was a challenge to figure out what exactly was going on. It was trying to figure out two things: the technical challenge of monitoring Ruby applications and helping the customer understand that that transition from monoliths to microservices is an effective one, and they should go that route.
So we helped them observe their entire architecture and their platform. Because of us, that troubleshooting time was reduced by 25 percent, which was a huge number for them, and they kind of got convinced that microservices are the way to go, and then they stuck with Epsagon as well.
How to Instrument Epsagon
Question: Say we have two services, one written in Ruby, another PHP and maybe a third written in Java. What does it take to install an integration with Epsagon? Does it require some sort of agent, or do those agents need to be aware of the language they're writing? Or is there some sort of generic agent that monitors traffic, and you extract information from that?
Chinmay: That’s basically instrumentation, and it is language-specific. Everyone supports a bunch of languages, such as Node, Go, Java, Python, etc.
We’ve seen a wave of no-code/low code applications or architecture coming up. And Epsagon supports that kind of motion. We believe in low code implementation.
So for you to instrument your application, it requires four or five lines of code, and then you're good to go.
Question: So they need to use a library? What's the impact of implementing these things with libraries in our application?
Chinmay: Generally, people did not care about the agents’ CPU or memory consumption because they were using on-premises data centers and stuff like that, so the cost was not much. But now, since you're running almost everything in the cloud, the cost of CPU and memory is definitely high. And Epsagon is cognizant of that. Once we instrument the application, CPU or memory usage cost is just about 1-2%. So it's not much, even with the fact that we don't do any sampling.
Sampling is the biggest consumption of CPU or memory resources. We do sampling at our end, and it does not affect the customer's environment, and that’s something we were cognizant about when we designed our application.
Question: And what about a network? Because you either need to submit all requests or come up with some smart matching strategies. You don't want to bash too long. In this case, information would not be super real-time. It's a trade-off, I understand, but how do users feel about this in general, and what's your recommendation?
Chinmay: We did a very basic cost analysis for the network data that sends. For example, customers need to send metrics data to Epsagon. And the cost essentially is at the most lower hundreds. So it's not very significant, and it’s minimal to the customer compared to what observability brings to the entire application.
Monitoring Workloads in the Clouds
Question: Makes sense. So since you mentioned the cloud and that people are moving more and more workloads there. How do you approach the question of monitoring these?
Chinmay: Epsagon started with monitoring Lambda applications. Our complete focus when we started back in 2018 was Lambda applications. And all you need to do is essentially just integrate your AWS environment. And that's it, and you don't have to worry about anything else. Because we specialize in monitoring Lambda applications, we have a dashboard and a function screen, which essentially tells you if there are any exceptions. It's very serverless focused. And recently, we also started supporting a lot of Kubernetes and container applications. It's a combination of both.
Demo: Epsagon in Action
Check out Chinmay’s demo in the below video. I especially appreciated the estimated monthly cost Epsagon includes because it’s an important metric for modern developers.
Published at DZone with permission of Vik Gamov, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.