New StackPod Episode: OpenTelemetry - the Future of Observability?

For our latest StackPod episode, we invited StackState senior engineer Melcom van Eeden to talk about OpenTelemetry: What is it and is it the future of observability?

Annerieke Kortier

Sep. 17, 22 · Interview

Like (1)

Save

7.2K Views

OpenTelemetry has been getting a lot of attention in the observability field. Moreover, in StackState’s latest release, we added support for OpenTelemetry traces. Melcom van Eeden, software developer at StackState, was one of our developer champions who made this possible. In addition to joining us on this episode of StackPod, he wrote a blog post on how to leverage OpenTelemetry with StackState and he recorded a tutorial video about the topic. Melcom is obviously very enthusiastic (and knowledgeable) about this technology. You can imagine we had to have Melcom on the StackPod to talk more about this “knight in shining armor,” as he calls it.

Melcom has been doing software engineering for a long time. He started in school, picking IT subjects and building what he likes to call “weird stuff.” After school, Melcom started working in IT and got introduced to different databases and programming languages. “I think that piqued my interest,” he says. “I wanted to learn more about different languages and technologies.”

He worked in different tech jobs, going from backend to frontend roles and vice versa and got introduced to cloud architecture systems - specifically AWS - a few years ago. Eventually, he joined StackState. “OpenTelemetry is a fairly new service and framework,” he said. “StackState implementing OpenTelemetry and bringing it into their product, that’s the one thing that made me join StackState. It was amazing.”

During the StackPod, Melcom and Anthony dive into all kinds of topics related to OpenTelemetry and what it does for observability, such as:

What does serverless mean and what does it have to do with OpenTelemetry?
What is OpenTelemetry exactly?
Why is OpenTelemetry such an innovative new standard?
How can OpenTelemetry help IT - not just DevOps and SRE teams, but also developers - in their work?
How can you leverage OpenTelemetry with StackState?

"As soon as you add OpenTelemetry to observability, you get five more relations, you get two red blocks, it looks like you created chaos. But it did not, it brought more insight into what you're currently running and what is potentially going wrong in your system," says Melcom. "It might look all green and nice on top, what is happening below? All those missing things are filled in by OpenTelemetry - it's really amazing."

Also, here are some useful links to articles and videos about OpenTelemetry if you want to learn even more about it:

How to Use OpenTelemetry to Troubleshoot a Serverless Environment with StackState (tutorial blog post by Melcom)
Troubleshoot a Serverless Environment with OpenTelemetry and StackState (tutorial video by Melcom)
Read more about StackState’s support for OpenTelemetry

Enjoy the episode!

Episode transcript:

Melcom: As soon as you add OpenTelemetry into it, you get five more relations, you get all of a sudden two red blocks, it looks like you created chaos, like OpenTelemetry broke things. But it did not. It brought more insight into what you're currently actually running and what is potentially going wrong into your system. While it might all look green and nice on top, what is actually happening down below. So all those missing relations and those missing things are pulled in by OpenTelemetry. It's truly amazing. When I got introduced to OpenTelemetry and I saw the capabilities that it brought and the insight that it brought, it was like: "wow, this is pretty cool to see, these actual errors happening inside of my code."

Annerieke: Hey there, and welcome to the StackPod. This is the podcast where we talk about all things related to observability because that's what we do and that's what we're passionate about, but also what it's like to work in the ever-changing, dynamic, tech industry. So if you are interested in that, you are definitely in the right place.

Annerieke: So today, we are going to talk about a subject that’s been getting a lot of attention in the observability branch lately: OpenTelemetry. In StackState’s latest release, we added support for OpenTelemetry traces and Melcom van Eeden was one of our great developers who built this feature. He also wrote a blog about it, he created a tutorial video on how to use OpenTelemetry within StackState and is, as you will hear in a minute, very passionate about the subject. So, needless to say, Melcom is the perfect person to talk about OpenTelemetry. He and Anthony discuss topics like: what does ‘serverless’ have to do with OpenTelemetry? What is OpenTelemetry exactly? Why is it such an innovative new standard that can help a lot of people (from developers to DevOps and SRE teams)? So, needless to say, OpenTelemetry is a topic we are very excited about and we are happy to share this episode with you. So, without further ado, let’s get into it, and enjoy the podcast.

Anthony: Hi everybody. Welcome back to the StackPod. My name's Anthony Evans, your regular host, and we are going to delve into a very interesting topic today. There's going to be, it's a very modern topic and a very modern technology enhancement program, maybe even problem that people have been dealing with in terms of how do we deal with infrastructure as code and what is OpenTelemetry and how does StackState use it and how are we incorporating it into our observability platform? First and foremost, I'm going to just let Melcom here introduce himself. Melcom van Eeden is based out of South Africa and works for StackState. But maybe you can give a better intro of yourself than I can.

Melcom: Anthony, thanks. I've been doing software engineering for as long as I can remember. It actually goes back to school, picking subjects, picking IT, getting interested in it, and starting to build weird stuff. And then from there, that kind of, I think piqued my interest in programming as a general, when I started working at the first company where I did, I think I started with ASP or something. And in regards to that, I got introduced to a few basic things, a few databases, a few languages. And the second company I joined kind of introduced different languages and different tech to me. And I think that kind of opened up the window where I got a bit of experience, not on a singular thing, but on multiple things. It kind of like piqued that interest in like, "Okay, cool. I want to learn more about different languages and different technologies." And that kind of spiraled out of control for a bit.

Melcom: The following companies, it would be different things that would be frontend, backend, frontend. Then it got to architecture and I got, basically not long after that, introduced to the cloud architecture systems. And AWS kind of growing and getting a name out there. I would say that was about, down here in South Africa, that's about five years ago, maybe. I think thereabout, where it actually took traction here and people actually started integrating it. Companies started to spin up startups basically, attempting to build an entire company out of serverless. And they'd spin up these various things for clients based on this. Then I eventually joined StackState.

Melcom: And I think the one thing that piqued my interest about StackState was that there is no ... This kind of relates to the things I just mentioned. There is no set tool, there is no set language that things are in. It is, for example, if there's any type of integrations, if there's any specific thing a client uses, if there's a specific language, it all depends on the client's needs, then it gets, okay, cool, let's focus on that. Let's build it. Let's build a solution on top of that. And I think that's the one thing that kind of consumed me with like, "Oh, wow, okay. This is pretty cool." Because whatever we see at that time, for example, OpenTelemetry now, which is a fairly new service and framework, now we have StackState implementing this and bringing it into their product. That's the one thing that kind of made me join StackState. It was amazing.

Anthony: Let's take a step back. And that's a good explanation. Thank you for that. I didn't mean to just side roll that back. Well, thank you very much. That's it. Thank you.

Melcom: Yeah. It's like, okay, time's up. Thank you very much. And that is Melcom.

Anthony: You brought up a point though. I think South Africa was not alone. About five years ago, there was a huge digital transformation push. I think for the last 10 years or so, people have been looking to get software as a service product. They've been looking to migrate towards the cloud. I think a lot of the time when people originally thought of migration, it was effectively a data center relocation, which is like, "Hey, I'm going to go from Equinix to Microsoft Azure, and now I'm going to be in the cloud and all my problems will be solved." But the main thing that I've always recognized that the cloud gives you is unlimited resource. That elasticity to be able to do what you want without having to wait six months for the rack to be delivered, installed, set up, whatever, you get the ability to deploy your applications and you can do it in whatever way you wish.

Anthony: But the one thing you're never limited on is resource outside of potential cost associated to the resource. But that Genesis of movement has only really been adopted in the last five years or so. To your point where, people have really started saying to themselves, "Okay, maybe I don't need a like for like windows server in one environment over another. Maybe I don't need an EC2 environment continuously running only to serve a portion of code that's only needed every so often." That's where that thing called serverless infrastructure and code has really come into play. And can you do everybody a favor, as part of the Jeffrey Evans test, which is the name of my dad and just kind of, by the way, my dad was a mailman, his entire life. If you can get him to understand what serverless infrastructure is, I think you should be teaching at Stanford. Go ahead. And could you explain for everybody serverless technology and then also how you got introduced to serverless technology and how you saw the benefits of it?

Melcom: Cool. Yeah. Basically, with serverless, just to make a small distinction between the two things I'm going to mention. And ironically enough, both is called serverless. We have the serverless side of things, basically on the cloud, and then we have the serverless framework. Firstly, I'm just going to explain what serverless is on the cloud. As you kind of mentioned, for example, we used to spin up entire machines to run code. These machines may have, let's give an example of 10 gigs of memory and the code might take that limit, or it might run with 32 gigs of memory, but it only uses 50% of that. Now the problem came that, power consumption on those systems, running those systems, and even buying those systems got very costly. Eventually what happened is, and I'm talking about a bit further back now.

Melcom: What eventually kind of happened is, people started to kind of migrate some of their onsite servers, this is now before AWS, to providers that basically provided servers. Now it's like, "Okay, cool. We pay these providers X amount. They give us a server. If you want upgrade, we just kind of increase those costs and they give us a new server. But you have to now take all your things and migrate it over from your original server and so forth." Then AWS came into play. And this entire serverless infrastructure around that. How that basically now assists everyone in saving costs and spinning up things a bit easier is that, AWS has this vast amount of resources behind them. That's actively running. They were thinking, "Okay, cool. How can we allow a user to execute their program that they don't necessarily need an entire server for, on this resource stack we have."

Melcom: And basically what they then provided is these allocations of processes as it is required. What that means is, I know, for example, press the execution button on my script, some of those resources are allocated for the script, it executes and at the very end, it basically says, "Okay, cool. This is the amount of memory or whatever other parameters they're looking at that are consumed. And they only charge you for that." Where that advantage comes in, if we take something really small, let's say you have something that needs to send out a mail every day. That script runs once a day, maybe at 8:00 AM, and that's the entire job of it. Now, originally, you had to run the server 24/7, having that script on there, and you are going to pay for that entire server. But now having that same script, just sit there and execute at that time.

Melcom: It maybe takes 30 seconds to run. You're only paying for that 30 seconds. If you look at the amount of money you're saving based on that, it's quite enormous. And that's only mentioning scripts. I don't want to jump too in-depth with this, but serverless basically, there's various technologies and this includes machine learning and script execution, and queues, and all these things that's basically serverless. And you only pay for what you use. You don't pay for having the actual service up and running. And then to the second point that I mentioned, the serverless framework. The serverless framework now allows ... And this is a well-known framework that's basically now used to deploy CloudFormation Stacks. CloudFormation Stacks is basically a template that you can deploy in AWS. And AWS understands what it needs to deploy with that. What serverless basically does, it makes it easier for developers to create these CloudFormation Stacks.

Melcom: Now developers can go, they can type, "Okay, cool. I want two scripts. I want a database. I want a machine that I can basically spin up some virtualization maybe. And as soon as they click the deploy button on their code, AWS takes this, this CloudFormation script that was generated from it, reads it, and then actually creates these resources on the fly." This allows the developers to now control resources. If we talk again about like five, six years ago, you had to physically go to a DevOps engineer, you had to ask them, "Can you spin these programs from me up?" It might take a week, maybe even a month for these services to actually be installed, set up, shorted correctly, various things like that. But now a developer can easily go type for five lines of code, while he has database and now he types another two lines, I think it is literally only three lines of code you need to spin up an entire queue. Where you can imagine swinging up RabbitMQ on a bare metal. Gets quite complicated.

Melcom: Serverless is there to basically assist us in saving costs and on the developer side, assist the developers to control the systems that are being deployed to the cloud architectures.

Anthony: What is the main advantage, if any, of serverless over microservices? Threw you a real teaser there.

Melcom: Yeah. The structure about a serverless is broad. I'm presuming you would talk about Lambdas, which is the script solution for that. If we are looking at microservices and a Lambda script, I would tell you that a Lambda script should be a microservice in the sense of, it should only do one task. It should do a one to one task. And that is it. It all depends. The general term around that word is basically the same on both of those. But I would say, the advantage comes in where the one kind of describes the technique of it. Cool. You can still run that on a bare metal machine. You can still run that somewhere else. You can even run that microservice that you wrote on a serverless infrastructure. The difference comes between those two things. And the sense of the one is the definition about, cool, this is serverless and the other one is more about, cool, what is the job of the script? What is the execution cycle of it?

Anthony: Yeah. And I would also say as well that when it comes to things like Kubernetes and whatnot, you were mentioning earlier around cost saving, but at the end of the day, in order to support Kubernetes, you still need EC2 servers running to support the cluster. Which means that, even if the microservices aren't being called all the time, you're still going to have a minimum amount of resources associated to it. Whereas if you define a Lambda script, it only ever runs and so you are only ever paying when it actually runs. And we're talking like pennies.

Melcom: It's like 0.00003 or something for exclusion cycle.

Anthony: Yeah. And it doesn't require you to spin up a ton of resources. And I think that really takes advantage of what AWS is good for. Which is that farm or that data center of just compute power that you can take advantage of without having to get the physical servers stacked up, without even having to wait to deploy an EC2 server, without even needing an operating system, really. Because even when you're in Kubernetes, you still need a container that is effectively running in some kind of operating system and running code. Even if it's skinny down, whereas, with Lambda scripts, you can just run Python. You can run, Node.js, whatever you want in the context of a script without having ...

Melcom: I would also argue, there is a bit more things around Lambdas that I feel gives us an advantage. Not to jump into the terminology too deeply, but basically, when you run a script the first time it's called a cold start, and then if you continue to run it in a few seconds after that, it's basically a warm execution. Because it has already been executed. Now, the thing as a cold start script, is always slower at the very start where the concurrent calls after that is a lot faster than the cold start again. Now Lambda, because it's serverless brings in this advantage of keeping these scripts warm for us, having functionality around there to keep it warm, having this synchronous execution structure, limiting memory, and various things around how you want to approach this execution of the script. Though imagine saying, "Cool, here's your bare metal machine. Please do all of this for me."

Melcom: Go control memory around the script. Limit the script, 200 makes, do this, do that. That's a lot of work around doing that on that machine. Now you go and you upgrade your machine or your instance gets deleted. You need to re-set up all those things. And again, you have all these various things you need to run, probably various services or more scripts just to limit these resources and what's being used. I also feel that the serverless capability also brings us more convenience and more power around limitations and restrictions, we can actually apply on top of scripts.

Anthony: Well, that brings us to a good point. Because with great power, comes great responsibility. Because everybody likes to think that companies live in a bubble where everybody kind of just sits in the same cubicle area and all makes changes to the same thing. But now as we work from home, global organizations are all over the world. You may have people in India developing Lambda scripts or enhancing existing scripts based on bugs, based on whatever. And then you've got people in the states, South Africa, wherever it may be. It's a real global thing. And the true advantage is that you can, just modify a script, literally on the fly, out of the box. AWS doesn't come with approval workflows and whatnot. If you really have the authorization to do it, you can go in and change a script.

Anthony: You can do whatever you need to. And we're not really going to about security right now. The main reason why I'm bringing this up is because with changes, inevitably there potentially becomes problems, less so from a security standpoint, but maybe from a reliability standpoint and a dependency standpoint. If you've got all these scripts that are not only running independently, but have codependence on each other in certain functions, you're going to potentially create issues if you've got a global team working on these things in different time zones, or even if you're in the same time zone, it can be still a complicated thing. One of the mechanisms that has really come out is the OpenTelemetry framework, in order to at least keep track of what's going on within these different environments. And can you explain for people at home, what OpenTelemetry was originally designed to do, what it does, and how it helps organizations today?

Melcom: Yeah. Firstly, I'll explain the problem that we have and then I'll go into OpenTelemetry. I would go to where it is and I would implement it. In regards to serverless, before you fully jump off serverless, if you think about the lost point that I mentioned about developers now controlling these stacks, deploying various things, with developers controlling that, they are building the stacks, they are defining the variables, they are defining communications and they're saying, "Cool. This is how Lambda communicates to this service." There is obviously security-related things they are handling. And now the problem comes in where these server stacks are growing, and it's becoming bigger and bigger and bigger, especially if it's one company that focuses solely on a serverless stack, or if the company builds multiple templates for different companies, then you have multiple serverless stacks going on there.

Melcom: And to add on top of that, one server stack can be dependent on another one. Because it can actually consume resources from another one. Meaning that you can have multiple serverless stacks being dependent on each other. Now the nice thing is, if all of that works, it's great. Usually, you deploy it, it works, it's fine, everything runs, sends messages as expected. Now the problem occurs when things start to expire or when people start to update things that is not immediately used and not observed at that time. If we would think of something that is maybe scheduled to run once a week, or if we are thinking about API keys, both of those can be a problem in the sense of the execution of that script once a week, something might have changed in the name that the developer didn't know, had to go change on, for example, a scheduled job.

Melcom: Now that script's never going to execute or the API that is basically being hit, the key expires maybe every 31 days. And the person that's added the key into the stack knew that, but maybe left the company and now 20 days after that time or thereabouts, the key expires. And nobody knows about that, because it's not observed or nobody generally knows. "Okay, cool, wait, something's wrong. I need to go look at that point," if the knowledge was maybe not shared. Now that is where OpenTelemetry comes into play. If we give a bit of background about OpenTelemetry, I think it was, let me just think. I think the year was 2016, either OpenCensus or OpenTracing was created. That was the first tracing framework that was delivered. Then I think the year was either 2020 or 2021, then Google, they created OpenCensus, which was another tracing framework that basically started providing these things. Basically, these instrumentations that anyone can create. They were competing with each other. There was a word for it. I can't remember.

Melcom: There was a funny sentence they used for this warfare. And they were basically competing. And then somewhere in the year, I can't remember the year, maybe it wasn't 2020, maybe 2018 for Google. And then 2019, they decided to merge these two products and not have this competition anymore. And these two products were merged into this project called OpenTelemetry. They took both of these projects, brought that advantage of them, and created OpenTelemetry. And that I think is why the flare and the amazingness of OpenTelemetry comes in, is because these two big companies basically went and brought both of their good solutions into this one product.

Melcom: What does OpenTelemetry do? We've been throwing around this word, but how does it basically assist us and what does it show? You have a script and this script basically communicates with various things. For example, the script communicates with the API. And the problem comes in, now the API key expires and the script keeps on communicating with this API. And it fails and maybe the Lambda script doesn't fail and nobody knows what's going on. Maybe a day or two after someone looks in the logs or they get a complaint about the client, they're going through their logs. They see, "Okay, there's maybe a problem." Debug the script, maybe another day or two. We're talking about like four days of work being lost, potentially products being damaged, all depends. And this is all because the community inside your script was hidden, was unknown. Was this a black box?

Melcom: Now, OpenTelemetry brings in the solution where, if you add this onto your script, you can basically instrument different parts of your script. And what I mean by different parts is different libraries. AWS is a library, Azure will be a library, various things like that. You can instrument these modules to now basically be observed. And what I mean by that is, if an execution happens, for example, this SUP request, going out to the API, when that happens, OpenTelemetry captures that request. Not the request going out, but the actual code that was written for that request. So it sees: cool, this piece of code is executing, let's create a piece of data about what just happened. And then it waits for the response. If the response comes back bad, good, whatever the response is, even if we don't get a response back, it basically adds out the response on top of that data stack.

Melcom: Now, essentially we're sitting with information about what happens and at the very end of your script, this information is sent off to the agent. And I'll get to the agent part down in a second. Now, the advantage of this and the monitoring about like, for example, the SUP one, is that anyone can write libraries for these. We can have any company, open source, anyone can go write AWS things. They can instrument their own ones. They can have it, however, their company is structured. They can write their own instrumentations and have that data being sent off. And the nice thing about that is now you can start seeing when things are going wrong and where they're going wrong almost immediately because of the script's execution cycle. Now, one thing as a developer, I can confidently say, is that when something goes wrong in a piece of code, if I do a SUP request, it all depends on how me as a developer to handle that.

Melcom: When that fails and I have that catch, or I have that continue, do I go and write the correct logging? Do I kill the Lambda script? Do I let it continue? Is it gracefully? There's so many rules. And every developer is different. Everyone out of ears do their own standards they got used to. It might look like your script is executed successfully, but internally, something failed. And I think that's where OpenTelemetry is amazing by bringing in that value add and actually showing you what happened in your script, even though your script failed or succeeded, there might be things wrong in there and it shines light upon what is wrong inside of your script.

Anthony: That's a really interesting look at things. Because, to your point earlier, one of the big problems you have right now it's tribal knowledge. People leave the organization. And also, the stuff that we're talking about, isn't really taught in schools, unless you go and specifically go and learn about cloud infrastructure. You may learn about the programming languages that are in Lambda scripts. You may know how to debug a piece of code on a case-by-case basis. But the fact that you've got this whole architecture around it, isn't predominantly, at least with people working in technology today, they usually have to go back to school to kind of learn about this new thing because, to your point, these technologies have been around less than 10 years. A lot of people who started in tech are going to be there with a very stringent kind of dependency.

Anthony: I don't want to use the word legacy, but we're talking about the VMwares, we're talking about the EMC, the Dell, the Oracles of the world who are very much rack and stack type technologies. Now we're on the other end of the spectrum. Where it's completely serverless, it's infrastructure as code, as opposed to any kind of physical rack that's going on anywhere. And OpenTelemetry has effectively been designed and brought into play to kind of help make sense of these scripts as they call from one script to the other. Whether it's an API, whether it's a function, whatever it is, that helps us kind of understand that, hey, if something starts failing, it doesn't have to be an access key. You may have always had a mainframe in your old environment and now it can't reach the mainframe because, hey, guess what? Financial services still need mainframes and a lot of other companies.

Melcom: There's one thing around OpenTelemetry which makes this amazing. We described the problem it solves, but why is OpenTelemetry such a mainstream thing now? Why does everyone talk about OpenTelemetry? Because I mean, there are products out there that do the same thing. And now I'm going to call them smaller products, seeing that OpenTelemetry is taking such a big leap in the way of traces and metrics and eventually logs. And I think the biggest advantage OpenTelemetry brings us, is that it generalizes things. And what I mean by that is, if you would imagine, you want to trace something in your environment, you will have to go, you will have to write this platform that basically takes in the time, it takes in the information you are capturing. You have to tell it when it's done. It needs to know how to put the things you captured in maybe a certain type of format, send this somewhere, capture that somewhere, read it again. There's a lot of things you have to do to create the system.

Melcom: And it might not be as flexible. Now, where OpenTelemetry comes into play is, it creates a standard and the libraries and tools to use the standard. Now the nice thing is, you can literally go, import these libraries, you write, again, five, six lines of code to start your span. You say at the very end you're done. And then as the environment variable, you just say where it goes through. And that is literally what you do on the one side. On the other side, you spin up a thing they call a collector. It basically collects the information that is sent to it. And you sit with it there. You can use it in your code, you can send it anywhere. And I think that is why OpenTelemetry is becoming such a big mainstream of thing, is because it is standardizing things. It's not bringing anything new. It is standardizing and being open source. Anyone can write anything. They can deploy in their own libraries. People can pull it in, but it's always on the same standard. You always expect the same results.

Anthony: Yeah. And also I think the fact that AWS X-Ray uses the OpenTelemetry standard as well helps. Now you've all of a sudden got a service that can read some level of OpenTelemetry as well makes total sense. You don't need to buy a monitoring tool. You don't need to install an engine. You don't have to do product specific instrumentation of your code. Allow those other products to just support OpenTelemetry, and then this standardized framework is just used across the board the same way networking protocols and IP protocols and all this other stuff. There are a million ways you could do everything. But then sooner or later, something gets standardized and that becomes the industry standard. And I'm a big fan of open source because it allows you to extrapolate data in all sorts of new and unique ways.

Anthony: You get full understanding of the code. And for a product like StackState, which is a very open platform, we have two advantages in that. We've got a really open API, which allows us to take in data, whether it's telemetry or topology, whatever it may be to build out our observability capabilities. But then you also have our agent as well, which can do its own level of scripting and advancement, in order to get things. And then we also have our huge partnership with AWS, which means that we're allowed to build out Lambda scripts ourselves. We're allowed to build out different components to help support things. And they're all technically validated by AWS and it's all just part of one big thing, as far as the customer sees and they can start taking advantage by just simply adopting the OpenTelemetry standard, which guess what? Is the primary capability that people have adopted?

Anthony: Speaking of StackState though, we recently released our first version of support for OpenTelemetry, as part of our "turnkey" integrations, if you will, that just natively work as soon as you deploy our capabilities. What types of data are we currently pulling back out of the box and how are we leveraging that to identify root cause or to help with reducing the amount of time it takes for people to troubleshoot these API keys that are missing and things like that occur.

Melcom: Yeah. The first approach we took for version one was to take a service everyone knows, and that we have been mentioning quite a few times now, as AWS and a few services from AWS. We decided to take the AWS instrumentation, basically bring that into the StackState solution and bring OpenTelemetry onto the four following services. SNS, SQS, S3 Step Functions and Lambda functions. Those specifically, we are automatically monitoring and retrieving information about. Now, what I mean by automatic and the advantage of how we are using these instrumentations, is the only physical thing you require on your side to start seeing these OpenTelemetry things, is the Cloudformation, template that you install, which contains our AWS stack back. And after that is deployed, you basically add this OpenTelemetry layer in, no co-change is required at all. And that is it.

Melcom: As soon as you then start producing data, that data is automatically sent to the agent, the same agent's conservancy, trace agent inside of it. And you will see that data start appearing on your UI. The advantage of this is and why we chose the AWS instrumentation is because those little blocks you're writing to actually do your AWS call, for example, you're saying, "Read this bucket name with these values, so forth," we are actually reading that piece of code that you're executing there and we can determine cool, this is the bucket you want to talk to. Now on the stack side sort of things of the flowing through the agent, we are actually creating a component on the dashboard based on that bucket name. Now advantage comes in, if you have existing buckets from your AWS framework, you will see a nice relation between that Lambda and that bucket, and you can see, okay, cool, this Lambda is actually talking to the existing buckets.

Melcom: It's doing something with it. Uploading files, deleting files, whatever it might be, but it has communication with this bucket. Now the advantage comes in, if you accidentally type the wrong bucket name, it still works. Now you have this Lambda script that will have a relation to a OpenTelemetry component inside of StackState that shows you this bucket name. The component is defined differently. The icon is different. It's noticeable. You can instantly see, okay, there's something else going on there. And now Azure should license to this point. We get to the second point. Okay, cool. Now you want to see the root cause. How can you identify if there's a problem and so forth? Now the status of that AWS call is also propagated with this data.

Melcom: What that means is, as soon as you see the incorrect icon there, after a few seconds, you shall also see the block turning red and this actual critical state moving up, showing you cool, this is where the actual root cause starts. You can start now obviously debugging and you might notice, okay, cool, the bucket name is incorrect, and you can change that. And after will propagate and go to the correct bucket. And you'll have the correct links. Now, the second advantage also is, the bucket may exist, but let's say you have a restriction where you can't talk to the bucket. That is also monitored. It's not specific to these things I'm mentioning. It is specific to the request being generated in a hole. Let's say, for example, if you go back to the restriction example, if it's restricted and you are trying to write an object into a bucket but it's failing, that still causes error.

Melcom: It still causes the block to turn red and to propagate up and you can see, cool, that is where the root call starts. I believe that is where the greatest support comes from. Having OpenTelemetry is now with these AWS services, you don't have to do any code changes, but it still shows you a problem. And there's this recent blog I actually wrote that gives you a bit more insight into, okay, cool, how to set up StackState, how to use StackState with OpenTelemetry. And inside of that, you have this IOT sensor scenario where it shows you two examples we've been mentioning. It shows you API key being incorrect, and it actually shows you a component that has the wrong name that I just mentioned. And what StackState actually shows you, what it generates and how to solve that problem.

Melcom: It takes you step by step through it to actually show, okay, go change that, go look at that. Various steps to the very end to actually get the thing correct and working. And I think the most amazing part about going through those demo slides would be that, there is a part you can clearly see, cool, this is before OpenTelemetry, and this is after OpenTelemetry. And without OpenTelemetry, there's like five missing relations. Because it's physically this Lambda is talking to these services. It looks like these blobs or services. Maybe if you talking to each other, everything looks nice and green. As soon as you add OpenTelemetry into it, you get five more relations. You get all of a sudden two red blocks. It looks like you created chaos, OpenTelemetry, broke things. But it did not.

Melcom: It brought more insight into what you're currently actually running and what is potentially going wrong into your system. While it might look all green and nice on top, what is actually happening down below? All those missing races and those missing things filled in by OpenTelemetry is, it's really amazing. When I got introduced to OpenTelemetry and I saw the capabilities that it brought and the insight that it brought, it was just like, "Whoa, okay. This is actually pretty cool to see." These actual errors happening inside of my code.

Anthony: Two things. First of all, I really like the passion that is obviously coming across. One of the key things that I'm always, and you mentioned this earlier, I always join internal company meetings a few minutes before, and I'm always kind of being out there kind of thing.

Melcom: Hyper.

Anthony: Yeah. But it's good. If you're really making a difference for your customers and you're selling or providing technology that actually makes a difference and is useful and isn't just shelfware, we should all be really excited. And that's why we all work here at StackState. Taking that passion and putting it into outcomes is incredibly important. If you just come in every day and you just act like it's your day job, and you've got a Jira backlog that you've got to get rid of, you're not going to resonate with the client. Whereas if you've got the mindset that you have, which is more, okay, OpenTelemetry, really passionate about it. Let me tell you where it's great, but then also where I can fill the gaps from an observability standpoint so that I can really help you with those details. Because, because that's really where we're at.

Anthony: We don't want to appeal to the people that necessarily create the Lambda scripts, because, hey, guess what? They can fix their own scripts if they really want to. And hey, guess what? Developers are super expensive. And once they develop one thing, they're usually onto the next and they don't want to necessarily have to revisit logging or boring stuff like that to improve the operational ability of everything. By providing a product that allows for innovation to continue and the way I like to view it, is effectively, we can remove some red tape from code development pipelines and CD/CI pipelines because if something does break, we have the best opportunity to be able to see it firsthand. As well as capture the exact change that happened in order to quickly and effectively resolve the issue. You were going to say something.

Melcom: I'm actually really excited to see ... The things we've actually still planned out for OpenTelemetry, the growth of OpenTelemetry. Because OpenTelemetry is still quite a new thing. If we look at logs in the sense of like, cool, they're still working in logs. Tracing is released, but logs are something they're still looking for, metrics are in beta state. OpenTelemetry is this new standard, which obviously is now growing with StackState. Both of us are growing and implementing and seeing these amazing new things coming in. We focused on the AWS instrumentations and for upcoming releases, we're actually trying to bring in a more generic approach. I mentioned earlier about, anyone can create their own instrumentation library. It's massively open source. Anyone can create anything.

Melcom: If we can get to the point where we have this endpoint that basically says, cool, send me anything. I don't care what you send me. As long as you're following the OpenTelemetry standard, I'll show that for you on the StackState dashboard. That is what we are now moving to. We have these libraries that make it easy to implement. You don't need any code changes. But we also want to give the opportunity to all of those hardcore programmers that want to go like, okay, cool. Let me create my spans. Let me create everything from scratch. I want to create, I want to see it on StackState with OpenTelemetry. We want to give them that package. We want to give them that solution to say, okay, cool, follow this, and you can exactly do that. You can build your own structure as you want. And we will handle showing you root causes, if anything goes wrong and so forth.

Anthony: Yeah. A lot of customers turn around and they're like, "Oh, we're a long way off from adopting these types of things. But actually, some of the biggest financial services companies, some of the biggest 5G Telco companies, streaming companies, they're all adopting this serverless architecture because of that. But the challenge you have, everybody has had a dropped phone call that's either a container going down or a script failing an API call somewhere that meant you didn't get your handshake, which meant that your phone dropped and you have to redial the number in order to make that connection again. Or if you're buffering your streaming service or you can't get access to the catalog, or you can't even check out from the Microsoft Store for whatever reason, half of the time, it's always to do with either something that's related to containerized technology these days, or serverless technology and the complications that come with adopting these fast-moving technologies that provide the service and the speeds that we as consumers expect without having to provide overhead.

Anthony: One of the recent things that we did with one of our customers, is that they have a 100 node Kubernetes cluster that does nothing but process tax returns for three months of the year. They deployed one of our competitors and the CPU overhead went through the roof. They had to have 200 nodes just to support the observability platform. We took them back down to 100 with our demon set and our AWS integration because through those, we just didn't need the overhead that an agent based tool requires in order to actually get the data out of them. And for anybody who's the listening and really wants to see everything that Melcom just explained in terms of visualization and seeing the S3 bucket, seeing the code, seeing the relationships between everything you can simply go to our website and get a 14-day free trial, StackState.com.

Anthony: And within that 14-day trial, you get complete access to everything in our AWS and Kubernetes and OpenShift integrations that you can try out as much as you want over a 14-day period. We'd be happy to set that up because it is something that's really cool and really futuristic and it's really establishing us as the Kings of the cloud in a way, or kings and queens of the cloud in this day and age. But we can help out in whatever way that we need to in order to help accommodate for these really complicated high-velocity environments. Well, we're out of time right now in terms of what we wanted to cover, and also what we've got allotted for today. Is there anything you want leave everybody, any kind of books that you would recommend that you've read recently, or any kind of information that people should go and check out, if they're interested?

Melcom: Honestly, obviously I'm going to mention OpenTelemetry. But in regards to links, it's very difficult. I would say that because there's so many open source things, there's so many various points to this, various languages all differ for OpenTelemetry. But I would definitely type in Google, just something general like, what is OpenTelemetry and get a deeper, broader explanation about, they go into depth about languages and various things like that. Maybe click on a few links there. But it is a really awesome tool. It might look scary on top when you look at OpenTelemetry and you see all these links and various things.

Melcom: What is all these words like lectures and exporters and all these words they're using? But honestly, it's such a lightweight tool. It's so great to implement and start playing around with. I would definitely recommend, if you're interested in any type of tracing or metrics at least, jump into OpenTelemetry and read a bit about it, and see the capabilities it brings for you. And maybe even how to create your own instrumentation, if you are interested up to that point. How can I read this piece of code? How do I create this instrumentation? I can just install and it just kind of works.

Anthony: Awesome. Yeah. That's everything we've got for today. Again, thanks Melcom for joining and sharing your expertise and your passion. I've really enjoyed it. I think you're another great example of why it's really fun to work at StackState. It's just full of passionate people, who understand technology and just want to make a difference and really drive the future. Whatever that may bring, regardless of what happens and just helping people, good people, good technology, and good outcomes. Well, actually great outcomes. Let's leave it with that as opposed to just good. Thank you again for your time and thank you for everybody listening. Catch us next time on the StackPod. Bye.

Melcom: Bye-bye.

Annerieke: Thank you so much for listening. We hope you enjoyed it. If you'd like more information about StackState, you can visit stackstate.com and you can also find a written transcript of this episode on our website. So if you prefer to read through what they've said, definitely head over there and also make sure to subscribe if you'd like to receive a notification whenever we launch a new episode. So, until next time...

Subscribe to the StackPod on Spotify or Apple Podcasts.

Useful links:

Read Melcom's blog post about OpenTelemetry
Read about StackState supporting OpenTelemetry trace

API AWS IT Kubernetes Machine learning Observability Advantage (cryptography) Cloud Data (computing) dev

Published at DZone with permission of Annerieke Kortier. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

New StackPod Episode: OpenTelemetry - the Future of Observability?

For our latest StackPod episode, we invited StackState senior engineer Melcom van Eeden to talk about OpenTelemetry: What is it and is it the future of observability?

Episode transcript:

Useful links:

Related

Partner Resources