How Rust Implements Tagged Unions (Part 2)

DZone 's Guide to

How Rust Implements Tagged Unions (Part 2)

In this second and final part, we take a deeper look at how Rust implements tagged unions and how it compares with tagged unions in C.

· Security Zone ·
Free Resource

The movw x86 Instruction,

What does movw mean? And what about -32(%rbp)?

It turns out x86 assembly language isn’t that hard to follow, once you learn the basic syntax. For a quick introduction, see my article from 2016: Learning to Read x86 Assembly Language. Intel, the company that built the microprocessor inside my Mac, defines the mov instruction to mean “move” (Note: the instructions I show here that rustc —emit asm generates aren’t written using Intel x86 syntax, but with GAS x86 syntax instead).

Here’s a diagram showing what the first movw instruction moves:

It turns out that movw stands for “move a word.” A word is defined as 16 bits or 2 bytes. There are a few different variations on move, movb, movw, movl, movq, which move 1 byte, 2 bytes, 4 bytes, or 8 bytes respectively.

Next, the $ notation indicates a literal value – in this case, zero: $0. Now we can see the first instruction above is moving 2 bytes containing the value zero. Similarly, the second instruction is moving 2 bytes containing the value 1234:

The rbp Register

But where are these movw instructions moving these values to? To understand that, we need to understand the odd -32(%rbp) syntax on the right side of the instructions. The % sign indicates a register inside my Mac’s microprocessor, in this case, the “base pointer” register. So “bp” means “base pointer.” And the “r” prefix in “rbp” means the move instruction is using all 8 bytes (64 bits) of this register’s value.

The -32(%rbp) notation calculates a memory address for the instruction using the contents of the %rbp register – in this case, the address of where to move the data to. The expression -32(%rbp) in English means: “Take the 64-bit memory address value from the base pointer register, and subtract 32 from it.”

Compiled Rust programs – all programs really – that run on the x86 platform store values for local variables on the stack, using the base pointer register in this fashion. The base pointer, as its name indicates, stores the base address of my program’s current stack frame. Each local variable in my code, for example, a_number, is saved somewhere on the stack. If you’re not familiar with the concept of a stack, think of it as a convenient place for quickly saving and retrieving values while your program is running.

How Rust Saves an Integer Enum Variant

Taking a step back for a moment, here’s what we’ve learned so far. When I save an enum value containing an integer, Rust saves two values, 0 and 1234:

What does the 0 mean? Rust records a zero to indicate that a_number uses the NumOrStr::Num variant. In other words, a_number is a tagged union, and the zero value is the tag. We know the tag occupies 2 bytes because of the movw instruction above. The integer value itself, 1234, also takes 2 bytes because I declared it using Num(i16), and we saw Rust used a movw to save that also.

How Rust Saves a String Enum Variant

But what about the other variant, the string? When I save a string in NumOrStr, what does Rust do? To find out, I’ll replace my main function from above with this line of code:

Then I’ll compile it again using the —emit asm option. Now I find this assembly language code in the union.s file:

Unfortunately, this code snippet is much more complex: It first calls String::from passing a string literal, and then saves the string into the enum via a method called drop_in_place. This is much harder to understand.

Rather than trying to figure this out, I decided to debug my Rust sample program using LLDB, and inspect the memory a_string occupies. I found that Rust used 26 bytes to represent the string variant, starting with a 16-bit word containing 1:

This is again the tag; in this case, 1 means a_string uses the NumOrStr::Str variant. Following this, I found a pointer to the string itself:

Pointers on a 64-bit microprocessor occupy 8 bytes and contain the memory address of something, in this case, my string, “This is a test.” After the pointer, I found two 64-bit values, each containing 15:

These are two attributes of the string: its capacity and length. By inspecting my process’s memory, I’ve started to learn a bit about how Rust manages memory for strings.

But what’s important for me today is the first word, the value 1. Again, we see the same pattern. Rust saves an integer value, the tag, indicating which variant this instance of the enum uses. Then Rust saves the enum variant’s payload in the memory that follows:

Tagged Unions in Rust and C

Let’s review by declaring a tagged union in C and Rust:

On the left using C, I have to include the tag explicitly in a surrounding struct. Rust handles this for me automatically, saving the tag value inside the enum alongside the enum’s value. The code looks very different, but as we saw above the implementations are identical.

Using a tagged union looks somewhat similar in C and Rust:

But there are very important differences here! Using C, I need to remember to check the tag and to use the proper variant inside the union. The Rust compiler, on the other hand, checks the tag for me automatically and won’t allow me to access the wrong variant. The code inside of if let will never be executed unless the internal tag value matches the NumOrStr::Num variant.

Under the hood, the two languages implement tagged unions the same way. But writing code in C and Rust is very different. C encourages me to write dangerous, crashing code, while Rust prevents me from writing dangerous code in the first place.

c/c++ ,rust ,secure code ,security ,unions

Published at DZone with permission of Pat Shaughnessy , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}