Time-Travel Debugging Production Code
This article provides an overview of time-travel debugging and how it relates to debugging your production code execution.
Join the DZone community and get the full member experience.Join For Free
Normally, when we use debuggers, we set a breakpoint on a line of code, we run our code, execution pauses on our breakpoint, we look at values of variables, and maybe the call stack, and then we manually step forward through our code's execution. In time-travel debugging, also known as reverse debugging, we can step backward as well as forward. This is powerful because debugging is an exercise in figuring out what happened: traditional debuggers are good at telling you what your program is doing right now, whereas time-travel debuggers let you see what happened. You can wind back to any line of code that is executed and see the full program state at any point in your program’s history.
History and Current State
It all started with Smalltalk-76, developed in 1976 at Xerox PARC. It had the ability to retrospectively inspect checkpointed places in execution. Around 1980, MIT added a "retrograde motion" command to its DDT debugger, which gave a limited ability to move backward through execution. In a 1995 paper, MIT researchers released ZStep 95, the first true reverse debugger, which recorded all operations as they were performed and supported stepping backward, reverting the system to the previous state. However, it was a research tool and not widely adopted outside academia.
ODB, the Omniscient Debugger, was a Java reverse debugger that was introduced in 2003, marking the first instance of time-travel debugging in a widely used programming language. GDB (perhaps the most well-known command-line debugger, used mostly with C/C++) was added in 2009.
Now, time-travel debugging is available for many languages, platforms, and IDEs, including:
- WinDbg for Windows applications
- rr for C, C++, Rust, Go, and others on Linux
- Undo for C, C++, Java, Kotlin, Rust, and Go on Linux
- Various extensions (often rr- or Undo-based) for Visual Studio, VS Code, JetBrains IDEs, Emacs, etc.
There are three main approaches to implementing time-travel debugging:
- Record and replay: Record all non-deterministic inputs to a program during its execution. Then, during the debug phase, the program can be deterministically replayed using the recorded inputs in order to reconstruct any prior state.
- Snapshotting: Periodically take snapshots of a program's entire state. During debugging, the program can be rolled back to these saved states. This method can be memory-intensive because it involves storing the entire state of the program at multiple points in time.
- Instrumentation: Add extra code to the program that logs changes in its state. This extra code allows the debugger to step the program backward by reverting changes. However, this approach can significantly slow down the program's execution.
Time-Traveling in Production
Traditionally, running a debugger in prod doesn't make much sense. Sure, we could SSH into a prod machine and start the process of handling requests with a debugger and a breakpoint, but once we hit the breakpoint, we're delaying responses to all current requests and unable to respond to new requests. Also, debugging non-trivial issues is an iterative process: we get a clue, we keep looking and find more clues; discovery of each clue is typically rerunning the program and reproducing the failure. So, instead of debugging in production, what we do is replicate on our dev machine whatever issue we're investigating, use a debugger locally (or, more often, add log statements), and re-run as many times as required to figure it out. Replicating takes time (and in some cases a lot of time, and in some cases infinite time), so it would be really useful if we didn't have to.
While running traditional debuggers doesn't make sense, time-travel debuggers can record a process execution on one machine and replay it on another machine. So we can record (or snapshot or instrument) production and replay it on our dev machine for debugging (depending on the tool, our machine may need to have the same CPU instruction set as prod). However, the recording step generally doesn't make sense to use in prod given the high amount of overhead — if we set up recording and then have to use ten times as many servers to handle the same load, whoever pays our AWS bill will not be happy.
But there are a couple of scenarios in which it does make sense:
- Undo only slows down execution 2–5x, so while we don't want to leave it on just in case, we can turn it on temporarily on a subset of prod processes for hard-to-repro bugs until we have captured the bug happening, and then we turn it off.
- When we're already recording the execution of a program in the normal course of operation.
The rest of this post is about #2, which is a way of running programs called durable execution.
First, a brief backstory. After Amazon (one of the first large adopters of microservices) decided that using message queues to communicate between services was not the way to go (hear the story first-hand here), they started using orchestration. Once they realized defining orchestration logic in YAML/JSON wasn't a good developer experience, they created AWS Simple Workflow Service to define logic in code. This technique of backing code by an orchestration engine is called durable execution, and it spread to Azure Durable Functions, Cadence (used at Uber for > 1,000 services), and Temporal (used by Stripe, Netflix, Datadog, Snap, Coinbase, and many more).
Durable execution runs code durably — recording each step in a database so that when anything fails, it can be retried from the same step. The machine running the function can even lose power before it gets to line 10, and another process is guaranteed to pick up executing at line 10, with all variables and threads intact. It does this with a form of record and replay: all input from the outside is recorded, so when the second process picks up the partially executed function, it can replay the code (in a side-effect–free manner) with the recorded input in order to get the code into the right state by line 10.
Durable execution's flavor of record and replay doesn't use high-overhead methods like software JIT binary translation, snapshotting, or instrumentation. It also doesn't require special hardware. It does require one constraint: durable code must be deterministic (i.e., given the same input, it must take the same code path). So, it can't do things that might have different results at different times, like use the network or disk. However, it can call other functions that are run normally ("volatile functions," as we like to call them), and while each step of those functions isn't persisted, the functions are automatically retried on transient failures (like a service being down).
Only the steps that require interacting with the outside world (like calling a volatile function or calling sleep (30 days), which stores a timer in the database) persist. Their results also persisted so that when you replay the durable function that died on line ten if it previously called the volatile function on line five that returned "foo," during replay, "foo" will immediately be returned (instead of the volatile function getting called again). While it adds latency to save things to the database, Temporal supports extremely high throughput (tested up to a million recorded steps per second). In addition to function recoverability and automatic retries, it comes with many more benefits, including extraordinary visibility into and debuggability of production.
Being able to debug any past production code is a huge step up from the other option (finding a bug, trying to repro locally, failing, turning on Undo recording in prod until it happens again, turning it off, then debugging locally). It's also a (sometimes necessary) step up distributed tracing.
Published at DZone with permission of Loren Sands-Ramshaw. See the original article here.
Opinions expressed by DZone contributors are their own.