Wayland Compositor Debugging in C++: Hunting Null Pointer Crashes in the Display Stack
Debugging a wlroots Wayland compositor crash on ARM Linux: tracing a suspend/resume null pointer bug with gdb, ASan, and lifecycle analysis.
Join the DZone community and get the full member experience.
Join For FreeThe bug looked simple. Resume from sleep, the screen flashes, the display server segfaults. About one in twelve resumes. The device, a Yocto-based industrial Linux box, ARM64, running a custom Wayland compositor on top of wlroots, would log nothing useful, drop to a black screen, and require a reboot. The customer’s complaint was three lines long. The fix took eleven weeks.
This article is what I learned, in the order I learned it, debugging a null pointer dereference inside a Wayland compositor’s lifecycle. It’s specific to wlroots and a custom compositor based on it, but most of the techniques transfer to any C++ system at this layer.
The Platform
Custom industrial device, ARM Cortex-A72, Linux 5.15 LTS, Wayland 1.21, wlroots 0.16. The compositor is roughly 12,000 lines of C++ that we wrote on top of wlroots’ Tinywl example. It exposes a kiosk-style interface, a single fullscreen surface, an IR-driven UI, and no windowing at all. Mali GPU with proprietary user-space drivers. A nightmare to debug because half the symbols are stripped and the GPU stack vendor is contractually unable to give us source.
The reproducer: suspend (systemctl suspend), wait 30 seconds, resume. About 8% of resumes crash. No pattern that mapped to load, time-since-boot, surface contents, or anything we could see from logs.
What Didn’t Help
The first three weeks were spent on things that did not work:
- dmesg/journalctl. Showed a clean suspend/resume cycle. The compositor exited with signal 11, and that was the only entry. No backtrace.
- The customer’s bug report. “Screen flashes black sometimes after sleep.” Seven words and a video that didn’t show anything we could use.
- Reading the wlroots source. Useful in retrospect for understanding the lifecycle, but not for spotting the bug. The bug was in our code, not theirs.
- Trying to repro on a developer workstation. Workstations are x86, run different kernel versions, have different DRM drivers, and don’t actually suspend in a way that exercises the same code paths. Lakshmi on the team spent two weeks trying to repro on a NUC. It never crashed. She finally said, in a one-on-one, “I think I’m wasting your time.” She wasn’t, but the strategy was wrong; we needed to repro on the device.
- Adding printf logs. This is what I’d been doing for six days when I realized the timing-sensitive nature of the bug meant that adding logs changed the timing enough to make the crash either much rarer or, occasionally, much more frequent. The Heisenberg debugging problem is real.
The thing that finally moved us was getting gdb running against the compositor on the device, with full debug symbols, and reproducing the crash live.
Setting Up gdb on the Target
The device runs Yocto. Yocto can produce a debug-symbol package for any recipe. Our compositor recipe didn’t include dbg-pkgs in the image features. Adding it:
In the local.conf for the developer image:
IMAGE_FEATURES += "dbg-pkgs"
EXTRA_IMAGE_FEATURES += "tools-debug"
INHIBIT_PACKAGE_STRIP = "1" # for our compositor specifically
INHIBIT_PACKAGE_DEBUG_SPLIT = "1" # keep symbols in the binary
Note: This nearly doubles the image size on flash. We built a separate dev image for debugging.
Then on the device:
$ gdb /usr/bin/our-compositor (gdb) handle SIGUSR1 nostop noprint pass (gdb) handle SIGPIPE nostop noprint pass (gdb) set follow-fork-mode child (gdb) run
SIGUSR1 is what wlroots uses for some internal signaling, and you don’t want gdb stopping on it. SIGPIPE shows up on broken Wayland client connections, which is normal during a sleep/resume cycle.
Once the compositor was running under gdb I would systemctl suspend from a second terminal, wait, resume by pressing the front-panel button, and watch.
After about 40 minutes of attempts, the compositor crashed. gdb caught the SIGSEGV. Backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x0000aaaaaab43a8c in our::OutputManager::handleOutputDestroy (
listener=0xaaaaaaab8a920, data=0xaaaaaaaadcd60)
at src/output_manager.cpp:284
284 output_state_t *state = output->user_data->state;
(gdb) print output->user_data
$1 = (our::OutputUserData *) 0x0
The crash was a output->user_data dereference where user_data was null. Specifically, this code:
void OutputManager::handleOutputDestroy(struct wl_listener *listener, void *data) {
struct wlr_output *output = static_cast<struct wlr_output *>(data);
output_state_t *state = output->user_data->state; // crash here
// ... cleanup
}
output->user_data was null. Why?
Reading the wlroots Lifecycle
The relevant pattern: when a wlr_output is added to the compositor, we allocate an OutputUserData structure and stuff it into output->user_data. When the output is destroyed, we look up the user_data to clean up. Standard.
The lifecycle event the compositor cares about:
- wlr_output_create, output appears (e.g., HDMI plugged in, or DRM connector activated post-resume).
- We attach a destroy listener and allocate user_data.
- The output is configured, framebuffers attached, frames rendered.
- Suspend: DRM connectors are deactivated. The kernel may or may not destroy the wlr_output depending on the driver.
- Resume: DRM connectors come back. New wlr_output may be created or the old one reused.
- If the old one is destroyed, our handleOutputDestroy is called.
The bug, after a few hours of staring: during step 4, on this hardware, the Mali driver was tearing down the connector, but the wlroots backend was firing the destroy event for a wlr_output that had been partially torn down already. Specifically, the user_data pointer had been freed by an earlier teardown path that we hadn’t finished migrating from a previous architecture.
The Actual Bug
Here’s the code path that bit us. Simplified:
// In our compositor init, we attach two listeners:
void OutputManager::onNewOutput(struct wlr_output *output) {
auto user_data = new OutputUserData();
user_data->state = new output_state_t();
output->user_data = user_data;
user_data->destroy.notify = handleOutputDestroy;
wl_signal_add(&output->events.destroy, &user_data->destroy);
user_data->frame.notify = handleOutputFrame;
wl_signal_add(&output->events.frame, &user_data->frame);
}
// On suspend, we explicitly tear down GPU resources
void OutputManager::onSuspend() {
for (auto &output : outputs) {
if (output->user_data) {
delete output->user_data->state; // <-- problem
delete output->user_data; // <-- problem
output->user_data = nullptr;
}
}
}
// On destroy event from wlroots:
void OutputManager::handleOutputDestroy(struct wl_listener *listener, void *data) {
struct wlr_output *output = static_cast<struct wlr_output *>(data);
output_state_t *state = output->user_data->state; // crash if user_data is null
// ...
}
The flow was:
- Suspend triggers onSuspend.
- We free user_data and set it to null.
- Wlroots, deeper in its own teardown logic, fires the destroy event on the same output.
- Our handleOutputDestroy runs, dereferences user_data, crashes.
Why hadn’t we seen this before? Because on the NUC and earlier development hardware, the DRM driver kept the connector alive across suspend/resume, it didn’t destroy the output. The Mali driver on the production hardware destroyed it. So the crash only manifested in production.
The original onSuspend was added during a refactor in 2022 to free GPU memory during suspend. The hypothesis, at the time, was that we needed to free explicitly because the driver wouldn’t. The hypothesis was wrong on Mali hardware, where the driver does free things, and freeing twice causes this crash.
The Fix
Two changes. First, the handleOutputDestroy had to defend against null user_data. This is good practice anyway; the wlroots event can fire for any reason at any time, and you can’t assume your user_data is still valid:
void OutputManager::handleOutputDestroy(struct wl_listener *listener, void *data) {
struct wlr_output *output = static_cast<struct wlr_output *>(data);
if (!output->user_data) {
// Already cleaned up via our suspend path; nothing to do
wl_list_remove(&listener->link);
return;
}
OutputUserData *ud = static_cast<OutputUserData *>(output->user_data);
delete ud->state;
wl_list_remove(&ud->destroy.link);
wl_list_remove(&ud->frame.link);
delete ud;
output->user_data = nullptr;
}
Second, the onSuspend should not be freeing user_data at all. The wlr_output lifecycle is owned by wlroots, and our user_data should be freed by handleOutputDestroy exclusively. The original “free GPU memory” goal can be achieved by freeing the state object (which holds the GPU buffer references) without freeing the user_data wrapper:
void OutputManager::onSuspend() {
for (auto &output : outputs) {
if (output->user_data) {
OutputUserData *ud = static_cast<OutputUserData *>(output->user_data);
// Free GPU-attached state but keep the user_data wrapper.
// Wlroots will destroy the output and our handleOutputDestroy
// will fire if the kernel actually destroys the connector.
delete ud->state;
ud->state = nullptr;
}
}
}
After this change, handleOutputDestroy had to additionally null-check ud->state:
output_state_t *state = ud->state; // may be null after suspend
if (state) {
// GPU cleanup
}
Three weeks of debugging, six lines of fix.
Tools That Earned Their Keep
After this bug, I added some habits that I’d skipped before.
AddressSanitizer on every CI build. Not just the production build, every PR build runs ASan. The onSuspend/handleOutputDestroy interaction would have shown up as a use-after-free under ASan in seconds, with a clear stack trace of both the allocation and the free. The reason we hadn’t been running ASan: it adds about 2x overhead, and the team had it disabled “for performance.” Re-enabled. The performance hit on CI is fine; the cost of three-week bugs is not.
# In our CMake:
if(SANITIZE_ADDRESS)
target_compile_options(compositor PRIVATE -fsanitize=address -fno-omit-frame-pointer -O1)
target_link_options(compositor PRIVATE -fsanitize=address)
endif()
Valgrind on a small test harness. We can’t valgrind the full compositor on the device; it’s too slow to even boot. We can valgrind a unit-test harness that exercises the OutputManager in isolation against a mock wlroots backend. Worth setting up; would have caught this bug at unit-test time.
A wlroots-aware logging macro. Wlroots logs to its own facility. We’d been routing those to journalctl but not capturing them in our own crash dumps. After this bug, we wrote a wrapper that prepends [WLROOTS] to all wlroots log lines and dumps the last 200 to a file on crash:
static void wlroots_log_handler(enum wlr_log_importance level, const char *fmt, va_list args) {
char buf[1024];
vsnprintf(buf, sizeof(buf), fmt, args);
g_log_ring.push_back({level, buf});
if (g_log_ring.size() > 200) g_log_ring.pop_front();
// also forward to stderr / journal
vfprintf(stderr, fmt, args);
fputc('\n', stderr);
}
// In main():
wlr_log_init(WLR_DEBUG, wlroots_log_handler);
// In our SIGSEGV handler:
void crash_handler(int sig) {
FILE *f = fopen("/var/log/compositor-crash-trail.log", "w");
for (auto &entry : g_log_ring) {
fprintf(f, "[%d] %s\n", entry.level, entry.msg.c_str());
}
fclose(f);
abort();
}
The crash trail captured the wlroots-internal lifecycle events we couldn’t see before. On the next class of bugs we caught, we had context within minutes instead of weeks.
Smart pointers, finally. This bug was a delete problem. Mixed manual new/delete with C library lifecycles is a category of pain. We’ve been migrating OutputUserData and similar structures to std::unique_ptr with custom deleters that null out the wlroots user_data field. It’s not free; wlroots is a C library, and many of its callbacks pass raw pointers, but the structures we own should be unique_ptrs.
What I’d Tell My 11-Week-Ago Self
Three things.
The bug was not in wlroots. It was in our suspend cleanup. I spent ten days reading wlroots source code looking for “the wlroots suspend bug.” There was no such bug. Suspect your own code first, especially the parts you wrote during a refactor.
The repro environment matters more than the debugger. Lakshmi’s two weeks on a NUC produced no signal because the NUC’s DRM driver doesn’t do what the Mali driver does. As soon as I got gdb running on the actual device, the bug fell out within a few hours. If you can’t reproduce on the target hardware, you are not actually debugging.
Add the safety nets before you need them. ASan, valgrind, log ringbuffers, smart pointers. None of these would have prevented this bug from being written, but each of them would have shortened the time to find it. We added them all after. The next display-stack crash took two days to diagnose, not eleven weeks.
The compositor has been stable for 14 months since this fix. The team has switched away from raw new for any wlroots-related allocations. We test suspend/resume in a hardware-in-the-loop nightly job, 200 cycles, no crashes for 9 months running.
I still remember the line number of the segfault. output_manager.cpp:284. There’s a comment on it now that says // see https://internal-wiki/wayland-suspend-bug-2022-Q3 for why this nullcheck exists. The wiki page has 4,300 words and three diagrams and is the closest thing to a war story I’ve published anywhere. Now I guess this is the next closest thing.
Opinions expressed by DZone contributors are their own.
Comments