How to Compare Core Dumps for Simple Time Travel Debugging
Use glibc's elf.h to open two dumps and compare PROGBITS sections with minimal code.
Join the DZone community and get the full member experience.Join For Free
How can the difference between two Linux core dumps be identified and why would this even come up? This is going to be lengthy, but will hopefully give you your answer to both of those questions.
The Case for Comparing Core Dumps
Comparing two core dumps is only meaningful if they represent the same process at different points in time. If that's the case, they could be thought of as process snapshots. Consider an application that triggers a segmentation fault after a random uptime. If the root cause is suspected to be memory corruption and post-mortem debugging does not provide any hints, it would be helpful to go back in time to inspect the memory state before the fatal error.
In the best case, all of that should be done with minimal overhead because the issue only occurs in production in our thought experiment. Also, the actual memory locations of interest are unknown, so being able to visualize relevant memory changes before the fatal error would be desirable. A set of core dumps could provide simple low-overhead time travel debugging in respect to process memory.
In most real-world debugging scenarios involving memory corruption, a memory diff would be too large to be useful. In specific cases involving mostly read-only memory and a limited set of debugging alternatives, going the diff route might just be what's needed to identify the constellation leading to the corruption. With all that talk about comparing two core dumps, How can this diff even be generated?
A core dump is represented by an ELF file that contains metadata and a specific set of memory regions (on Linux, this can be controlled via
/proc/[pid]/coredump_filter) that were mapped into the given process at the time of dump creation.
The obvious way to compare the dumps would be to compare a hex-representation:
The result is rarely useful because you're missing the context. More specifically, there's no straightforward way to get from the offset of a value change in the file to the offset corresponding to the process virtual memory address space.
So, more context if needed. The optimal output would be a list of VM addresses including before and after values.
Creating a Test Scenario
Before we can get on that, we need a test scenario to validate our comparison approach. The following sample includes a use-after-free memory issue that does not lead to a segmentation fault at first (a new allocation with the same size hides the issue). The idea here is to create a core dump using GDB (
generate) during each phase based on break points triggered by the code:
- dump1: Correct state
- dump2: Incorrect state, no segmentation fault
- dump3: Segmentation fault
The sample code:
Now, the dumps can be generated:
A quick manual inspection shows the relevant differences:
Based on that output, we can clearly see that
*g_state changed but is still a valid pointer in
dump3, the pointer becomes invalid. Of course, we'd like to automate this comparison.
Knowing that a core dump is an ELF file, we can simply parse it and generate a diff ourselves. What we'll do:
- Open a dump
PROGBITSsections of the dump
- Remember the data and address information
- Repeat the process with the second dump
- Compare the two data sets and print the diff
elf.h, it's relatively easy to parse ELF files. I created a sample implementation that compares two dumps and prints a diff that is similar to comparing two
hexdump outputs using
diff. The sample makes some assumptions (x86_64, mappings either match in terms of address and size or they only exist in dump1 or dump2), omits most error handling and always chooses a simple implementation approach for the sake of brevity.
With the sample implementation, we can re-evaluate our scenario above. A excerpt from the first diff:
The diff shows that
0x602260) was changed from
The second diff with only the relevant offset:
The diff shows that
0x602260) was changed from
There you have it: a core dump diff. Now, whether or not that can prove to be useful depends on various factors, one being the timeframe between the two dumps and the activity that takes place within that window. A large diff will possibly be difficult to analyze, so the aim must be to minimize its size by choosing the diff window carefully.
The more context you have, the easier the analysis will turn out to be. For example, the relevant scope of the diff could be reduced by limiting it to addresses of the
.bss sections of the executable or library to be debugged if changes in there are relevant to the debugging scenario.
Another approach to reduce the scope: excluding changes to memory that is not referenced by the debugging subject. The relationship between arbitrary heap allocations and the executable or specific libraries is not immediately apparent. Based on the the addresses of changes in your initial diff, you could search for pointers in the
.bss sections of the executable or library right in the diff implementation. This does not take every possible reference into account (most notably indirect references from other allocations, register and stack references of library-owned threads), but it's a start.
Published at DZone with permission of George R. See the original article here.
Opinions expressed by DZone contributors are their own.