Production Postmortem: the ARM Is Killing Me
Production Postmortem: the ARM Is Killing Me
We got a report that RavenDB was failing in the field, literally.
Join the DZone community and get the full member experience.Join For Free
“If a tree falls in a forest and no one is around to hear it, does it make a sound?” This is a well-known philosophical statement. The technological equivalent of this is this story. We got a report that RavenDB was failing in the field. But the details around the failure were critical.
The failure happened on the field, literally. This is a system that is running an industrial robot using a custom ARM board. The failure would only happen on the robot on the field and would not reproduce on the user’s test environment or on our own systems. Initially, that was all the information that we had: “This particular robot works fine for a while, but as soon as there is a break, RavenDB dies and needs to be restarted." That was the first time I had run into a system that would crash when it went idle, instead of dying under load, I have to say.
My recommendation that they would just keep the robot busy at all time was shot down, but for a while, we were in the dark. It didn’t help that this was literally a custom ARM machine that we had no access to. We finally managed to figure out that the crash was some variant of SIGSEGV or SIGABRT. That was concerning. The ARM machine in question is running on 32 bits, and the worry was that our 32 bits code was somehow doing a readout of bound. This is a crash in production, so we allocated a couple of people to investigate and try to figure out what was going on.
We started by doing a review of all our 32 bits memory management code and in parallel attempted to reproduce this issue on a Raspberry Pi (the nearest machine we had to what was actually going on). We got a lucky break when we someone did manage to kill the RavenDB process on our own lab somehow. The exit code was 139 (Segmentation fault), but we weren’t sure what was actually going on. We were trying all sort of stuff on the machine, seeing what would cause this. We basically fed it all sorts of data that we had laying around and saw if it would choke on that. One particular data export would sometimes cause a crash. Sometimes. I really really hate this word. That meant that we were stuck with trying to figure out something by repeatedly trying and relying on the law of averages.
It took several more days, but we figured out that a certain sequence of operations would reliably cause a crash within 5 – 30 minutes. As you can imagine, this made debugging pretty hard. The same sequence of operations on Intel machines, either 32 bits or 64 bits worked without issue, regardless of many times we have repeated them.
We followed several false trails with our investigation into RavenDB’s memory management’s code in 32 bits. We had a few cases where we thought that we had something, but nothing popped up. We have instrumented the code and verified that everything seemed kosher, and it certainly did, but the system still crashed on occasion.
RavenDB usually relies on
mmap() to access the data on disk, but on 32 bits, we couldn’t do that. With an addressable memory of just 2 GB, we cannot map the whole file to memory if it is too large. Because of that, we map portions of the file to memory as needed for each transaction. That led us to suspect that we were somehow unmapping memory while it was still in use or something like that. But we have gone through the code with a fine tooth comb and got nothing. We used strace to try to help point out what is going on and we could see that there was no surprise calls to
unmap() that shouldn’t be there.
What was really nasty was the fact that when we failed with SIGSEGV, the error was always on an address just past the area of memory that we mapped. This leads us to suspect that we had an out of boundary write and led to a chase for that rouge pointer operation. We instrumented our code ever more heavily but weren’t able to find any such operation. All our reads and writes were inbound, and that was incredibly frustrating. RavenDB is a CoreCLR application. As such, debugging it on an ARM device is… challenging. We tried lldb and gdb. Both allow unmanaged debugging, but even with lldb, we couldn’t debug managed code or even just pull the managed stack properly from ARM. Eventually, we found this extension, which allows us to do SSH debugging on the Raspberry PI from a Windows machine.
That helped, and we finally figured out where in our managed code the error happened. This always happened during a copy of memory from a document write to a scratch buffer in a memory mapped file. The entire thing was wrapped in boundary checks and everything was good.
We went back to the drawing board and attempted to set it on fire because it was no good for us. Once we put the fire out, we looked at what remained and had an Eureka! moment. One of the differences between ARM and x86/x64 machines is in how they treat alignment. In x64/x86, alignment is pretty much a non-issue for most operations. On ARM, however, an unaligned operation will cause a CPU fault. That led us to suspect that the SIGABRT error we got was indeed an alignment issue. Most of our code is already aligned on memory, because while it isn’t mandatory on x64/x86, it can still get better perf in certain cases. But it is certainly possible that we missed it.
We discovered a horrifying problem:
We were using the
CopyBlock method, and obviously, that was the issue, right? We wrote a small test program that simulated what we were doing and used unaligned
CopyBlock and it just worked. But maybe our situation is different?
CopyBlockUnaligned on x86 led to a 40 percent performance drop (we call this method a lot), and initially, it looked like it fixed the problem on ARM. Except that on the third or fourth attempts to reproduce the problem, we run into our good old SIGSEGV again, so that wasn’t it. This time, we went to the drawing board and broke it.
During this time, we have managed to capture the error inside the debugger several times. Here is what it looked like:
Reading ARM assembly is not something that I’m used to doing, so I looked at the manual, and it looks like this instruction is to store multiple registers in descending order and… no clue beyond that. It didn’t make any sort of sense to us.
At this point, we were several weeks and four or five people into this investigation (we consider such issues serious). We have instrumented our code to the point where it barely runs, we could manage to reproduce the error in a relatively short time, and we were fairly convinced that we were doing things properly. Going over the kernel code for memory mapping and unmapping several times, stracing, debugging — everything. We were stumped. But we also had enough data at this point to be able to point a fairly clear picture of what was going on. So we opened an issue for the CoreCLR about this, suspecting that the issue is in the implementation of this
We got a strange response, though: “This assembly code doesn’t make any sense." I did mention that I have no idea about ARM assembly, right? We tried reproducing the same thing in gdb, instead of lldb, and got the following assembly code:
This looked a lot more readable, to be sure. And it was extremely suspicious. Let me explain why:
The faulting instruction is: ldr r3, [r0, #0]
What this says is, basically, read a word from the address pointed to by r0 (with 0 offset) into r3.
Now, r0, in this case, has this value: 0x523b3ffd. Note the last three characters, ffd.
We are running this on a 32 bits machine, so a word is 4 bytes inside. FFD+4 = 1001
In other words, we had a read beyond the current page boundary. In most cases, the next page is mapped, so everything goes smoothly. In some cases, the next page is not mapped, so you are going to get an access violation trying to read a byte from the next page.
The fix for this is here:
This is literally a single character change. And probably the worst lines of codes/time invested that I have ever seen for any bug. Actually, there wasn’t even any code change in RavenDB’s codebase, so that is 0 lines of code/ 4 people x 4 weeks.
The good thing is that at least we have proven that the 32 bits memory code is rock solid, and we have a lot better understanding of how to resolve the next issue.
Published at DZone with permission of Oren Eini, CEO RavenDB , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.