It was a pretty incredible coincidence. Only a few days apart, I had to tackle two problems that had to do with nested exception handlers. Specifically, an infinite loop of nested exceptions that led to a stack overflow. And that’s a pretty fatal combination. A stack overflow is an extremely nasty error to debug; a nested exception means the exception handler encountered an exception, which can’t be pretty; and to add insult to injury, a stack corruption was also involved behind the scenes. Read on to learn some more about the trickiness of diagnosing nested exceptions and what can cause them in the first place.
Case 1: Read error in VC’s exception filter
The client had multiple dump files for me to look at, and they all exhibited a pretty crazy pattern. An exception would occur in the application — one that is perfectly expected and handled. However, instead of being handled normally, it would cause an infinite cascade of nested exceptions that eventually crashed the process with a stack overflow. Here’s a quick illustration of what it looked like, edited for brevity:
0:000> kcn 10000 ... <repeated hundreds more times> 19e9 <Unloaded_Helper.dll> 19ea NestedExceptions1!exception_filter 19eb NestedExceptions1!trigger_exception 19ec MSVCR120D!_EH4_CallFilterFunc 19ed MSVCR120D!_except_handler4_common 19ee NestedExceptions1!_except_handler4 19ef ntdll!ExecuteHandler2 19f0 ntdll!ExecuteHandler 19f1 ntdll!KiUserExceptionDispatcher 19f2 <Unloaded_Helper.dll> 19f3 NestedExceptions1!exception_filter 19f4 NestedExceptions1!trigger_exception 19f5 MSVCR120D!_EH4_CallFilterFunc 19f6 MSVCR120D!_except_handler4_common 19f7 NestedExceptions1!_except_handler4 19f8 ntdll!ExecuteHandler2 19f9 ntdll!ExecuteHandler 19fa ntdll!KiUserExceptionDispatcher 19fb <Unloaded_Helper.dll> 19fc NestedExceptions1!exception_filter 19fd NestedExceptions1!trigger_exception 19fe MSVCR120D!_EH4_CallFilterFunc 19ff MSVCR120D!_except_handler4_common 1a00 NestedExceptions1!_except_handler4 1a01 ntdll!ExecuteHandler2 1a02 ntdll!ExecuteHandler 1a03 ntdll!KiUserExceptionDispatcher 1a04 NestedExceptions1!trigger_exception 1a05 NestedExceptions1!main 1a06 NestedExceptions1!__tmainCRTStartup 1a07 NestedExceptions1!mainCRTStartup 1a08 kernel32!BaseThreadInitThunk 1a09 ntdll!__RtlUserThreadStart 1a0a ntdll!_RtlUserThreadStart
In the preceding call stack, it’s pretty clear that exception_filter is trying to call a function in an unloaded DLL (Helper.dll). That in turns causes an exception (access violation, most likely) which transfers control to exception_filter, and we’re ankle-deep in the nested exception loop. By the way, it’s pretty easy to follow the exception chain if you know what to look for. Here’s the kb output for a few frames:
006deb54 00a85706 00000000 00000000 00000000 <Unloaded_Helper.dll>+0x1115e 006dec28 00a85eab 006dec4c 0f4e3924 00000000 NestedExceptions1!exception_filter+0x26 006dec30 0f4e3924 00000000 00000000 00000000 NestedExceptions1!trigger_exception+0x6b 006dec44 0f4e9268 006deda0 006dedf0 00000001 MSVCR120D!_EH4_CallFilterFunc+0x12 006dec7c 00a866d2 00a90000 00a81041 006deda0 MSVCR120D!_except_handler4_common+0xb8 006dec9c 7794c881 006deda0 006df734 006dedf0 NestedExceptions1!_except_handler4+0x22 006decc0 7794c853 006deda0 006df734 006dedf0 ntdll!ExecuteHandler2+0x26 006ded88 7794c6bb 006deda0 006dedf0 006deda0 ntdll!ExecuteHandler+0x24 006ded88 58b3115e 006deda0 006dedf0 006deda0 ntdll!KiUserExceptionDispatcher+0xf 006df0d4 00a85706 00000000 00000000 00000000 <Unloaded_Helper.dll>+0x1115e 006df1a8 00a85eab 006df1cc 0f4e3924 00000000 NestedExceptions1!exception_filter+0x26 006df1b0 0f4e3924 00000000 00000000 00000000 NestedExceptions1!trigger_exception+0x6b 006df1c4 0f4e9268 006df324 006df374 00000001 MSVCR120D!_EH4_CallFilterFunc+0x12 006df1fc 00a866d2 00a90000 00a81041 006df324 MSVCR120D!_except_handler4_common+0xb8 006df21c 7794c881 006df324 006df734 006df374 NestedExceptions1!_except_handler4+0x22 006df240 7794c853 006df324 006df734 006df374 ntdll!ExecuteHandler2+0x26 006df30c 7794c6bb 006df324 006df374 006df324 ntdll!ExecuteHandler+0x24 006df30c 00a85e8f 006df324 006df374 006df324 ntdll!KiUserExceptionDispatcher+0xf 006df744 00a86068 00000000 00000000 7ebab000 NestedExceptions1!trigger_exception+0x4f 006df818 00a86a79 00000001 00c37b88 00c35178 NestedExceptions1!main+0x28 006df868 00a86c6d 006df87c 76dc919f 7ebab000 NestedExceptions1!__tmainCRTStartup+0x199 006df870 76dc919f 7ebab000 006df8c0 77960bbb NestedExceptions1!mainCRTStartup+0xd 006df87c 77960bbb 7ebab000 35ed4a97 00000000 kernel32!BaseThreadInitThunk+0xe 006df8c0 77960b91 ffffffff 7794c9d2 00000000 ntdll!__RtlUserThreadStart+0x20 006df8d0 00000000 00a812d5 7ebab000 00000000 ntdll!_RtlUserThreadStart+0x1b
The highlighted values (2nd parameter to ntdll!ExecuteHandler) are the context records, and the preceding values are the exception records. You can inspect them in WinDbg using the .cxr and .exr commands:
0:000> .exr 006df324 ExceptionAddress: 00a85e8f (NestedExceptions1!trigger_exception+0x0000004f) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter: 00000001 Parameter: 00000000 Attempt to write to address 00000000 0:000> .cxr 006dedf0 eax=cccccccc ebx=00000000 ecx=00000000 edx=00000000 esi=006df0dc edi=006df1a8 eip=58b3115e esp=006df0d8 ebp=006df1a8 iopl=0 nv up ei pl nz na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206 <Unloaded_Helper.dll>+0x1115e: 58b3115e ?? ??? 0:000> .exr 006deda0 ExceptionAddress: 58b3115e (<Unloaded_Helper.dll>+0x0001115e) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000010 NumberParameters: 2 Parameter: 00000008 Parameter: 58b3115e Attempt to execute non-executable address 58b3115e
This shows that the original exception was an access violation in the trigger_exception function, but it was shadowed by another access violation that was caused by executing code from the unloaded DLL.
But the original problem was somewhat more subtle than an unloaded DLL. In the actual application, exception_filter was actually the exception filter installed by Visual C++ to handle C++ exceptions: msvcr*!__InternalCxxFrameHandler. And somehow it would trigger a nested exception, with exception code 0xc0000006 (IN_PAGE_ERROR: in-page I/O error) and I/O error code 0xc000020c (STATUS_CONNECTION_DISCONNECTED). That nested exception would transfer control again to __InternalCxxFrameHandler, and it would hit the same nested error again and again.
Armed with this information, we began to investigate. The actual memory access that caused the nested exception was to a data structure passed to __InternalCxxFrameHandler, which is stored inside the binary and contains exception-handling information emitted by the C++ compiler. This information isn’t frequently accessed — it is only required when an exception occurs.
The final piece of the puzzle was that the application was running from a network drive, which was being accessed by a large number of machines, putting great load on the server’s network connection. The working hypothesis, then, was the following:
- An exception occurred early in the application’s initialization path
- A connectivity error also occurred, causing the network drive where the application’s binary resides to be temporarily disconnected
- The Visual C++ exception filter needed access to an data structure stored in the application’s binary, but that data structure wasn’t yet cached in RAM from the network drive because it is only accessed when an exception occurs
- The exception filter then failed with an I/O error, which transferred control to the exception filter again — causing the nested loop of exceptions
I was able to reconstruct this issue by writing an exception filter that accesses a large read-only global data structure (read-only globals are also stored in the application binary), and running the sample app from a network drive. When I disabled the network adapter prior to causing the exception, we got exactly the symptoms described above.
How did we fix the problem? Turns out, Visual C++ has a linker setting called /SWAPRUN (specifically, /SWAPRUN:NET), which can be used to instruct the system to load the application binary completely into memory prior to executing it. It means there’s no way the application started running but suddenly parts of the binary would become unavailable because of a connectivity problem. It’s obviously better to avoid the connectivity problem in the first place, but given that networks are inherently unreliable, this is a must-use switch if you’re running an application from a network drive.
Case 2: The stack corruption that caused a stack overflow
The second case I had to look at also exhibited a nested exception chain leading up to a stack overflow. The only difference was that the stack was also corrupted. Here are some frames from the bottom of the stack:
0:000> kn ... <repeated hundreds more times> f04 002be8c4 7794c881 0xcccccccc f05 002be8e8 7794c853 ntdll!ExecuteHandler2+0x26 f06 002be9b0 7794c6bb ntdll!ExecuteHandler+0x24 f07 002be9b0 cccccccc ntdll!KiUserExceptionDispatcher+0xf f08 002becfc 7794c881 0xcccccccc f09 002bed20 7794c853 ntdll!ExecuteHandler2+0x26 f0a 002bede8 7794c6bb ntdll!ExecuteHandler+0x24 f0b 002bede8 cccccccc ntdll!KiUserExceptionDispatcher+0xf f0c 002bf134 7794c881 0xcccccccc f0d 002bf158 7794c853 ntdll!ExecuteHandler2+0x26 f0e 002bf224 7794c6bb ntdll!ExecuteHandler+0x24 f0f 002bf224 010b51d5 ntdll!KiUserExceptionDispatcher+0xf f10 002bf674 cccccccc NestedExceptions2!trigger_exception+0x65 f11 002bf748 010b5cd9 0xcccccccc f12 002bf798 010b5ecd NestedExceptions2!__tmainCRTStartup+0x199 f13 002bf7a0 76dc919f NestedExceptions2!mainCRTStartup+0xd f14 002bf7ac 77960bbb kernel32!BaseThreadInitThunk+0xe f15 002bf7f0 77960b91 ntdll!__RtlUserThreadStart+0x20 f16 002bf800 00000000 ntdll!_RtlUserThreadStart+0x1b
The 0xcccccccc addresses on the stack look like a sure symptom of a stack corruption. In fact, you probably already have a working hypothesis: the stack has been corrupted with 0xcccccccc, and the exception handling code in ntdll somehow attempts to execute code from the address 0xcccccccc. This causes a nested exception, and we’re in the same situation as in case #1. If we look at the exception records, we can confirm that’s the case (the first exception record is the root cause, and the second exception record is due to the invalid exception handler):
0:000> .exr 002bf23c ExceptionAddress: 010b51d5 (NestedExceptions2!trigger_exception+0x00000065) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter: 00000001 Parameter: 00000000 Attempt to write to address 00000000 0:000> .exr 002bee00 ExceptionAddress: cccccccc ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000010 NumberParameters: 2 Parameter: 00000000 Parameter: cccccccc Attempt to read from address cccccccc
But there are a couple of subtleties. Why would ntdll try to execute code from the invalid address 0xcccccccc? The answer is that in 32-bit Windows applications the exception filter’s address is stored on the stack, as part of a data structure called the exception registration record. If that structure is corrupted, the exception handling code in ntdll might attempt to execute an invalid address, thinking it’s a pointer to an exception filter.
But that’s actually a pretty serious security vulnerability! If an attacker can overwrite the exception registration record and then trigger an exception, they can achieve arbitrary code execution. In fact, exploiting the exception registration record is one of the ways to overcome stack canary defenses introduced in Visual C++ 2003 (the /GS flag).
Fortunately, Windows has a few tricks up its sleeve. First, Visual C++ 2003 introduced a linker flag called/SAFESEH. When this flag is available, the linker embeds a directory of all valid exception handlers in the binary. At runtime, Windows can check whether the exception handler that’s about to run is in the list of valid exception handlers. If it isn’t, it will straight out refuse to execute the invalid exception handler, and halt the process — which is way safer than trying to execute an unknown exception handler.
Second, Windows Vista SP1 and Windows Server 2008 introduced another system-wide protection mechanism called SEHOP (Structured Exception Handling Overwrite Protection). By default, SEHOP is disabled on client operating systems (Windows Vista, Windows 7, Windows 8) and enabled on server operating systems (Windows Server 2008 and Windows Server 2012). SEHOP is a fairly simple defense: when it’s on, every thread has a dummy exception registration record at the beginning of the chain. When an exception occurs, the exception handling code in ntdll verifies that the current exception handler chain terminates with that dummy exception registration record. If it doesn’t, the system assumes the exception handler chain has been tampered with, and terminates the process.
When either of these defenses is turned on, the situation described above is simply impossible. The control transfer to 0xcccccccc as if it was an exception handler would be prevented and the process would abruptly terminate. Therefore, we can conclude that the system on which this crash occurred is not up to par with modern security guidance: the application should have been compiled with /SAFESEH, and the system should have had SEHOP enabled. By the way, starting with Windows 7, you can configure SEHOP for an individual process without affecting the rest of the system using the ImageFileExecutionOptions registry key.
As an illustration, here’s what happens when the binary is compiled with /SAFESEH (note that the /NXCOMPAT flag, which enables software DEP, is also required):
0:000> kn # ChildEBP RetAddr 00 008aee54 779c625f ntdll!NtWaitForMultipleObjects+0xc 01 008af2b8 779c5e38 ntdll!RtlReportExceptionEx+0x3eb 02 008af314 779e81bf ntdll!RtlReportException+0x9b 03 008af394 7798b2e3 ntdll!RtlInvalidHandlerDetected+0x4e 04 008af3ec 7797734a ntdll!RtlIsValidHandler+0x13f1a 05 008af484 7794c6bb ntdll!RtlDispatchException+0xfc 06 008af484 01284ef5 ntdll!KiUserExceptionDispatcher+0xf 07 008af8d4 cccccccc NestedExceptions2!trigger_exception+0x65 WARNING: Frame IP not in any known module. Following frames may be wrong. 08 008af9a8 012858c9 0xcccccccc 09 008af9f8 01285a0d NestedExceptions2!__tmainCRTStartup+0x199 0a 008afa00 76dc919f NestedExceptions2!mainCRTStartup+0xd 0b 008afa0c 77960bbb kernel32!BaseThreadInitThunk+0xe 0c 008afa50 77960b91 ntdll!__RtlUserThreadStart+0x20 0d 008afa60 00000000 ntdll!_RtlUserThreadStart+0x1b 0:000> !analyze -v ... <removed for brevity> DEFAULT_BUCKET_ID: APPLICATION_FAULT PROCESS_NAME: NestedExceptions2.exe ERROR_CODE: (NTSTATUS) 0xc00001a5 - An invalid exception handler routine has been detected. EXCEPTION_CODE: (NTSTATUS) 0xc00001a5 - An invalid exception handler routine has been detected. ... <removed for brevity>
Importantly, there is no infinite loop of nested exceptions. The process is immediately terminated and Windows Error Reporting is invoked.
On the other hand, when SEHOP is enabled for that particular process, it prevents the invalid exception handler from executing. In a typical WER dump file, the result appears as though the original exception, which was supposed to be handled normally, ends up unhandled:
0:000> !analyze -v ... FAULTING_IP: NestedExceptions2!trigger_exception+65 002c51d5 c705000000002a000000 mov dword ptr ds:,2Ah EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff) ExceptionAddress: 002c51d5 (NestedExceptions2!trigger_exception+0x00000065) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000008 NumberParameters: 2 Parameter: 00000001 Parameter: 00000000 Attempt to write to address 00000000 ... BUGCHECK_STR: APPLICATION_FAULT_NULL_POINTER_WRITE_SEHOP PRIMARY_PROBLEM_CLASS: NULL_POINTER_WRITE_SEHOP DEFAULT_BUCKET_ID: NULL_POINTER_WRITE_SEHOP ... FAILURE_BUCKET_ID: NULL_POINTER_WRITE_SEHOP_c0000005_NestedExceptions2.exe!trigger_exception BUCKET_ID: APPLICATION_FAULT_NULL_POINTER_WRITE_SEHOP_nestedexceptions2!trigger_exception+65 ANALYSIS_SOURCE: UM FAILURE_ID_HASH_STRING: um:null_pointer_write_sehop_c0000005_nestedexceptions2.exe!trigger_exception ...
But there is a good hint that SEHOP was involved — the word “SEHOP” appears multiple times in the analysis. I’m not sure if there’s a better way of identifying SEHOP’s involvement, but that’s good enough for me
First, when dealing with an infinite loop of nested exceptions, don’t panic. You need to identify the original exception that started the chain, and then look for a repeating pattern. An exception filter/handler at some point in the chain must have failed, and a series of control transfers lead back to the same exception filter.
Second, structured exception handling is a vulnerable mechanism, especially in 32-bit applications. Make sure you’re using all the protection afforded by your compiler and operating system: SafeSEH, DEP, and SEHOP.
I am posting short links and updates on Twitter as well as on this blog. You can follow me: @goldshtn