The 2021 DevOps Trend Everyone Is Missing
DevOps is going through the roof. DevOps is everything and everything is DevOps nowadays. But there's one DevOps trend that I'm seeing that everyone else is missing.
Join the DZone community and get the full member experience.Join For Free
When I recently looked up 2021 DevOps trends, from the various predictions everyone has been making, it seems that DevOps is going through the roof. DevOps is everything, and everything is DevOps nowadays.
Here’s a partial list of DevOps trends that are exploding:
- Hybrid Deployments
- Resilience Testing
- Testing in Production
- Microservices (of course)
- Cloud-centric Infrastructure
- Edge Computing
- Infrastructure as Code
- Application Performance Monitoring (APM) Tools
- Hybrid Computing
- Feature Toggles
- And the list goes on…
But what struck me after reading all those articles is that not one of them saw non-intrusive Production Debugging making inroads as a standard component of the DevOps toolchain. This is the DevOps trend that I’m seeing.
What Is Non-Intrusive Production Debugging?
Let’s start with what it isn’t, and that’s the usual go-to tactic when trying to resolve errors in Production — Log files. That painful, iterative process of:
- Nasty bug.
- Darn, I don’t have enough data.
- Let’s add some log entries.
- Reproduce the bug.
- Show me the logs.
- Got it?
- NO — go back to step 2.
- YES (after several time-consuming iterations) — Done.
It looks something like this:
There’s also the option of attaching a remote debugger directly to Production, but, usually, Ops won’t allow that. There are security concerns, and you don’t want to interrupt service when you pause execution at a breakpoint.
Non-intrusive Production Debugging follows the concept of observability tools. These are the APMs (Application Performance Monitoring tools) that display, slice, and dice logs, metrics, and traces. That’s what observability is all about. Understanding what’s going on with your system (so you can resolve errors) without interrupting or interfering with its operation.
But when it comes to fixing bugs in Production, these tools don’t quite provide enough data. The furthest they’ll usually get is to show you where an exception was thrown, the stack trace, and some general metadata about the error scenario, such as the browser or operating system.
That’s not usually enough to fix bugs. Modern software architectures like microservices and serverless make things even more difficult. Imagine tracking down a bug that crashes a node in your Kubernetes cluster, and Kubernetes just spins up a new instance. Or a logical bug in a serverless function. By the time you get to debugging these issues, the evidence is gone.
Non-intrusive Production Debugging takes observability a step further and shows how your application is behaving, line-by-line at the code level. Even for microservices and serverless code. This is what we call code-level observability, and it complements the observability DevOps gets from APMs.
Why Do I Think This Is a Trend?
If you haven't guessed yet, I'm into non-intrusive Production Debugging. That's what my company builds, and we’re seeing a clear shift in our meetings with prospects and customers. A year ago, our counterparts leading the meetings were developers, albeit senior developers or development managers, but developers, nonetheless. DevOps engineers might have been in the room, but they took a back seat in the process and got involved when we started discussing how our software may or may not affect their Production systems. There were questions about security, performance, deployments, and more, but not much about how our Production Debugger was used or what it would do for them — at least, not from the DevOps people. They just viewed it as another tool that developers needed them to maintain in Production.
Over the last year, there has been a clear shift in focus. DevOps engineers sitting in the room are taking the front seat, asking a lot more questions, and starting to realize how effective Production Debugging can significantly impact DevOps KPIs, even if it’s engineers in the development teams actually doing the root cause analysis and typing out the code.
Why Are DevOps Engineers Getting Interested in Production Debugging?
Part of this realization stems from the fact that a Production Debugger can also run on pre-Production environments like QA and Staging. DevOps engineers know that shifting left debugging to fix bugs more quickly in QA or Staging as well as in Production means improving some standard DevOps KPIs:
- Fewer bugs are going to get past Staging (think lower Change Failure Rate, lower Defect Escape Rate, and higher MTBF).
- Production Debuggers that autonomously capture and display exceptions will not only notify you of an error as soon as it occurs, but also record the complete error execution flow enabling you to understand very quickly whether you're dealing with a real bug or background noise, its severity, and its impact (think lower MTTD).
- Production Debuggers significantly reduce the time it takes to identify, analyze, and fix Production errors (think lower MTTR). When we start talking about this KPI, all those DevOps engineers sitting in the room really sit up because this one reflects on how long a service will be down when things go South in Production. And we all know that these things happen, even to the household names we can’t live without nowadays. Just take a look at downdetector.com, and you’ll see what I mean.
DevOps engineers and SREs are realizing that a Production Debugger is more about monitoring and observability than just being a developer tool. It also sits very comfortably with DevOps principles of feedback and collaboration, bridging a chasm between DevOps and Development that still exists in many organizations. Through the Production Debugger, developers get the data they need directly from Production systems in order to resolve errors. The direct collaboration between DevOps and Development to resolve an error within the Production Debugger is a catalyst to getting bugs fixed and resolving Production incidents more quickly.
How Do Production Debuggers Work?
Debugging in Production is not a completely new concept. Remote debugging has been around for a while, allowing you to inject breakpoints at runtime, collect data at each breakpoint hit, and then immediately resume the process and move on. While this is an easy way to get Production data, it is intrusive and has a significant performance impact, and is, therefore, not widely used today.
Another method is snapshot debugging, where the debugger forks a copy of your process (using copy-on-write technology) at designated points in the code, and you debug by examining the copy. While this method lets you examine the whole memory footprint of the debugged process, it too is intrusive, placing a significant memory load on the running host, so the number of points where you take snapshots is limited.
Modern Production Debuggers use a third method — byte code instrumentation. They add instrumentation to the byte code that performs different functions such as measuring performance, capturing application state, capturing exceptions, and more. This is what APMs have been doing for years. Production Debuggers just took it further. They use the same technology only the goal is to resolve bugs and logical errors rather than performance issues in Production and pre-Production environments.
Since byte code isn’t human-readable, let’s see what it might look like if we were adding instrumentation to source code.
For code that looks like this:
After instrumentation, it might look like this:
Debugging instrumented code does not come without challenges. Most modern Production Debuggers that use instrumentation require a precise correlation to the Git commit from which the binaries being debugged in Production were built. Matching up all the right source files to what is currently running in Production isn’t always trivial, and you also need to match up a collection of build and compilation settings. And then what do you do about third-party code? Some of the tools get around this by decompiling the Production code they’re debugging. This makes life a lot easier because it removes the requirement to match source files, and third-party code is decompiled together with your legacy code.
Why Non-intrusive Beats Intrusive
Remote debuggers are very intrusive in that they attach to the host application and put breakpoints in live running systems. Even if the application only breaks briefly for the remote debugger to collect data, there is still a significant risk to stability that many Production systems cannot tolerate. Similarly, the memory overheads that snapshot debuggers cause to running systems through the intrusive copy-on-write technology that they use run the risk of depleting the system’s memory. For example, Microsoft’s Snapshot Debugger defaults to a maximum of five snapshots per minute to avoid out-of-memory exceptions.
Modern Production Debuggers that use bytecode instrumentation to get Production data are completely non-intrusive to the host application. No breakpoints are placed in the live application, and there’s no strain on memory resources. While the extra code added by instrumentation does have to run like the rest of the code, the performance hit has a negligible impact on performance, if any.
What Can Production Debuggers Do With Instrumentation?
Much of what modern Production Debuggers do is based on non-breaking breakpoints (a.k.a. tracepoints). You specify the lines of code where you want the debugger to instrument the corresponding bytecode to extract data. You can actually do quite a lot with this:
- Dynamic logs: Log data from anywhere in your code, including the value of local variables and method parameters.
- Dynamic metrics: Like dynamic logs, you can measure different metrics based on application-level data you extract from local variables.
- Integrations: Anything you can measure at a non-breaking breakpoint/tracepoint can be propagated to a third-party application through an API. So, you can create Slack notifications or pipe dynamic logs and metrics data to an APM where you can further slice and dice the data, view it in beautiful graphs and charts, and create meaningful alerts.
In addition to doing cool things with non-breaking breakpoints/tracepoints, some Production Debuggers also do this:
- Capture exceptions: This is something that many APMs do already, but a Production Debugger will provide more information about the exception and the values of locals and variables where it was thrown.
- Time-travel recording: Some Production Debuggers capture not only exceptions but also the complete error execution flow leading to an exception along with application data at each step of the way. This enables line-by-line debugging of the exception, which is very similar to the debugging experience in development IDEs.
Who Are the Main Players?
APMs and observability have been around for about ten years now, and there are many great enterprise products available. Modern Production Debugging tools that provide the code-level observability needed to get to the root cause of bugs are newer. Since I am from Ozcode myself (there’s full disclosure), rather than running the risk of being impartial in my description of the main modern Production Debugging tools on the market, I’ll leave it up to you to browse their respective websites and make your own evaluations. Just click the logos.
Non-Intrusive Production Debugging Will Become an Integral Part of the DevOps Toolchain
Any new technology takes time before it becomes a standard line-item in an enterprise’s yearly budget. APMs are there already, and any enterprise worth its weight in software uses these tools to manage, monitor, and troubleshoot its Production systems. However, DevOps professionals now realize that APMs don’t provide enough data when recovering from a Production incident requires digging into the code line-by-line. Non-intrusive Production Debuggers have proven that when you provide code-level observability, dynamic logs and traces, and time-travel debugging, you can cut Production debugging time by up to 80%. And when the cost of downtime can get as high as $5,600 per minute, that translates into real enterprise dollars that DevOps professionals can’t ignore.
APMs are a given today. It won’t be long before the value of non-intrusive Production Debuggers makes its mark on an enterprise’s bottom line and becomes a given too. The DevOps revolution has brought Operations closer to developers. It’s time for that collaboration to take the next step and move into the sphere of debugging.
Opinions expressed by DZone contributors are their own.