A while ago I was explaining runtime mechanisms like the stack and the heap to some folks. (As an aside, I’m writing a debugger course on “Advanced .NET Debugging with WinDbg with SOS”, which is an ongoing project. Time will tell when it’s ready to hit the streets.) Since the context was functional programming where recursion is a typical substitute (or fuel if you will) for loops, an obvious topic for discussion is the possibility to hit a stack overflow. Armed with my favorite editor, Notepad.exe, and the C# command-line compiler, I quickly entered the following sample to show “looping with recursion” and how disaster can strike:
static void Main()
static void Rec(int n)
if (n % 1024 == 0)
Rec(n + 1);
The module-based condition in there is to avoid excessive slowdowns due to Console.WriteLine use, which is rather slow due to the way the Win32 console output system works. To my initial surprise, the overflow didn’t come anywhere in sight and the application kept running happily:
I rather expected something along the following lines:
So, what’s going on here? Though I realized pretty quickly what the root cause is of this unexpected good behavior, I’ll walk the reader through the thought process used to “debug” the application’s code.
I made a call, didn’t I?
The first thing to check is that we really are making a recursive call in our Rec method. Obviously ildasm is the way to go to inspect that kind of stuff, so here’s the output which we did expect.
In fact, the statement made above – “which we did expect” – is debatable. Couldn’t the compiler just turn the call into a jump right to the start of the method after messing around a bit with the local argument slot that holds argument value n? That way we wouldn’t have to make a call and the code would still work as expected. Essentially what we’re saying here is that the compiler could have turned the recursive call into a loop construct. And indeed, some compilers do exactly that. For example, consider the following F# sample:
let rec Rec n =
if n % 1024 = 0 then
printfn "%d" n
Rec (n + 1)
Notice the explicit indication of the recursive nature of a function by means of the “rec” keyword. After compiling this piece of code using fsc.exe, the following code is shown in Reflector (decompiling to C# syntax) for the Rec function:
The mechanics of the printf call are irrelevant. What matters is the code that’s executed after the n++ statement, which isn’t a recursive call to Rec itself. Instead, the compiler has figured out a loop can be used. Hence, no StackOverflowException will result.
Back to the C# sample though. What did protect the code from overflowing the stack? Let’s have some further investigations, but first … some background.
One optimization that can be carried out for recursive functions is to spot tail calls and optimize them away into looping – or at a lower level, jumps – constructs. A tail call is basically a call after which the current stack frame is no longer needed upon return from the call. For example, our simple sample can benefit from tail call optimization since the Rec method doesn’t really do anything anymore after returning from the recursive Rec call:
static void Rec(int n)
if (n % 1024 == 0)
Rec(n + 1);
This kind of optimization – as carried out by F# in the sample shown earlier – can’t always take place. For example, consider the following definition of a factorial method:
static int Fac(int n)
if (n == 0)
return n * Fac(n – 1);
The above has quite a few issues such as the inability to deal with negative values and obviously the arithmetic overflow disaster that will strike when the supplied “n” parameter is too large for the resulting factorial to fit in an Int32. The BigInteger type introduced in .NET 4 (and not in .NET 3.5 as originally planned) would be a better fit for this kind of computation, but let’s ignore this fact for now.
A more relevant issue in the context of our discussion is the code’s use of recursion where a regular loop would suffice, but now I’m making a value judgment of imperative control flow constructs versus a more functional style of using recursion. That’s true nonetheless is the fact that the code above is not immediately amenable for tail call optimization. To see why this is, rewrite the code as follows:
static int Fac(int n)
if (n == 0)
int t = Fac(n – 1);
return n * t;
See what’s going on? After returning from the recursive call to Fac, we still need to have access to the value of “n” in the current call frame. As a result, we can’t reuse the current stack frame when making the recursive call. Implementing the above in F# (just for the sake of it) and decompiling it, shows the following code:
The culprit keeping us from employing tail call optimization is the multiplication instruction needed after the return from the recursive call to Fac. (Note: the second operand to the multiplication was pushed onto the evaluation stack in IL_0005; in fact IL_0006 could also have been a dup instruction.) C# code will be slightly different but achieve the same computation (luckily!).
Sometimes it’s possible to make a function amenable for tail call optimization by carrying out a manual rewrite. In the case of the factorial method, we can employ the following trick:
static int Fac(int n)
return Fac_(n, 1);
static int Fac_(int n, int res)
if (n == 0)
return Fac_(n – 1, n * res);
Here, we’re not only decrementing n in every recursive call, we’re also keeping the running multiplication at the same time. In my post Jumping the trampoline in C# – Stack-friendly recursion, I explained this principle in the “Don’t stand on my tail!” section. The F# equivalent of the code, shown below, results in tail call optimization once more:
let rec Fac_ n res =
if n = 0 then
Fac_ (n - 1) (n * res)
let Fac n =
Fac_ n 1
The compilation result is shown below:
You can clearly see the reuse of local argument slots.
A smart JIT
All of this doesn’t yet explain why the original C# code is just working fine though our look at the generated IL code in the second section of this post did reveal the call instruction to really be there. One more party is involved in getting our much beloved piece of C# code to run on the bare metal of the machine: the JIT compiler.
In fact, as soon as I saw the demo not working as intended, the mental click was made to go and check this possibility. Why? Well, the C# compiler doesn’t optimize tail calls into loops, nor does it emit tail.call instructions. The one and only remaining party is the JIT compiler. And indeed, since I’m running on x64 and am using the command-line compiler, the JIT compiler is more aggressive about performing tail call optimizations.
Let’s explain a few things about the previous paragraph. First of all, why does the use of the command-line compiler matter? Won’t the same result pop up if I used a Console Application project in Visual Studio? Not quite, if you’re using Visual Studio 2010 that is. One the decisions made in the last release is to mark executables IL assemblies (managed .exe files) as 32-bit only. That doesn’t mean the image contains 32-bit instructions (in fact, the C# compiler never emits raw assembler); all it does it tell the JIT to only emit 32-bit assembler at runtime, hence resulting in a WOW64 process on 64-bit Windows. The reasons for this are explained in the Rick Byer’s blog post on the subject. In our case, we’re running the C# compiler without the /platform:x86 flag – which now is passed by the default settings of a Visual Studio 2010 executable (not library!) project – therefore resulting in an “AnyCPU” assembly. The corflags.exe tool can be used to verify this claim:
In Visual Studio 2010, a new Console Application project will have the 32-bit only flag set by default. Again, reasons for this decision are brought up in Rick’s post on the subject.
Indeed, when running the 32-bit only assembly, a StackOverflowException results. An alternative way to tweak the flags of a managed assembly is by using corflags.exe itself, as shown below:
It turns out when the 64-bit JIT is involved, i.e. when the AnyCPU Platform target is set – the default on the csc.exe compiler – tail call optimization is carried out for our piece of code. A whole bunch of conditions under which tail calls can be optimized by the various JIT flavors can be found on David Broman’s blog. Grant Richins has been blogging about improvements made in .NET 4 (which don’t really apply to our particular sample). One important change in .NET 4 is the fact the 64-bit JIT now honors the “tail.” prefix on call instructions, which is essential to the success of functional style languages like F# (indeed, F#’s compiler actually has a tailcalls flags, which is on by default due to the language’s nature).
Seeing the 64-bit JIT’s work in action
In order to show the reader the generated x64 code for our recursive Rec method definition, we’ll switch gears and open up WinDbg, leveraging the SOS debugger extension. Obviously this requires one to install the Debugging Tools for Windows. Also notice the section’s title to apply to x64. For x86 users, the same experiment can be carried out, revealing the x86 instructions generated without the tail call optimization, hence explaining the overflow observed on 32-bit executions.
Loading the ovf.exe sample (making sure the 32-bit only flag is not set!) under the WinDbg debugger – using windbg.exe ovf.exe – brings us to the first loader breakpoint as shown below. In order to load the Son Of Strike (SOS) debugger extension, set a module load breakpoint for clrjit.dll (which puts us in a convenient spot where the CLR has been sufficiently loaded to use SOS successfully). When that breakpoint hits, the extension can be loaded using .loadby sos clr:
Next, we need to set a breakpoint on the Rec method. In my case, the assembly’s file name is ovf.exe, the class is Program and the method is Rec, requiring me to enter the following commands:
The !bpmd extension command is used to set a breakpoint based on a MethodDesc – a structure used by the CLR to describe a method. Since the method hasn’t been JIT compiled yet, and hence no physical address for the executable code is available yet, a pending breakpoint is added. Now we let go the debugger and end up hitting the breakpoint which got automatically set when the JIT compiler took care of compiling the method (since it came “in sight” for execution, i.e. because of Main’s call into it). Using the !U – for unassemble – command we can now see the generated code:
Notice the presence of code like InitializeStdOutError which is the result from inlining of the Console.WriteLine method’s code. What’s going on here with regards to the tail call behavior is the replacement of a call instruction with a jump simply to the beginning of the generated code. The rest of the code can be deciphered with a bit of x86/x64 knowledge. For one thing, you can recognize the 1024 value (used for our modulo arithmetic) in 3FF which is 1023. The module check stretches over a few instructions that basically use a mask over the value to see whether any of the low bits is non-zero. If so, the value is not dividable by 1024; otherwise, it is. Based on this test (whose value gets stored in eax), a jump is made or not, either going through the path of calling Console.WriteLine or not.
Contrasting with the x86 assembler being used
In the x86 setting, we’ll see different code. To show this, let’s use a Console Application in Visual Studio 2010, whose default platform target is – as mentioned earlier – 32-bit. In order to load SOS from inside the Immediate Window, enable the native debugger through the project settings:
Using similar motions as before, we can load the SOS extension upon hitting a breakpoint. Instead of using !bpmd, we can use !name2ee to resolve the JITTED Code Address for the given symbol, in this case the Program.Rec method:
Inspecting the generated code, one will encounter the following call instruction to the same method. This is the regular recursive call without any tail call optimization carried out. Obviously this will cause a StackOverflowException to occur. Also notice from the output below that the Console.WriteLine method call didn’t get inlined in this particular x86 case.
Revisiting the tail. instruction prefix
As referred to before, the IL instruction set has a tail. prefix for call instructions. Before .NET 4, this was merely a hint to the JIT compiler. For x86, it was (and still is) a request of the IL generator to the JIT compiler to perform a tail call. For x64, prior to CLR 4.0, this request was not always granted. For our x86 case, we can have a go at inserting the tail. prefix for the recursive call in the code generated by the C# compiler (which doesn’t emit this instruction by itself as explained before). Using ildasm’s /out parameter, you can export the ovf.exe IL code to a text file. Notice the COR flags have been set to “32-bit required” using either the x86 platform target flag on csc.exe or by using corflags /32bit+:
Now tweak the code of Rec as shown below. After a tail call instruction, no further code should execute other than a ret. If this rule isn’t obeyed, the CLR will throw an exception signaling an invalid program. Hence we remove the nop instruction that resulted from a non-optimized build (Debug build or csc.exe use without /o+ flag). To turn the call into a tail call one, we add the “tail.” prefix. Don’t forget the space after the dot though:
The session of roundtripping through ILDASM and ILASM with the manual tweak in Notepad shown above is shown here:
With this change in place, the ovf.exe will keep on running without overflowing the stack. Looking at the generated code through the debugger, one would see a jmp instruction instead of a call, explaining the fixed behavior.
Tail calls are the bread and butter of iterative programs written in a functional style. As such, the CLR has evolved to support tail call optimization in the JIT when the tail. prefix is present, e.g. as emitted by the F# compiler when needed (though the IL code itself may be turned into a loop by the compiler itself). One thing to know is that on x64, the JIT is more aggressive about detecting and carrying out tail recursive calls (since it has a good value proposition with regards to “runtime intelligence cost” versus “speed-up factor”). For more information, I strongly recommend you to have a look at the CLR team’s blog: Tail Call Improvements in .NET Framework 4.