How Rust Implements Tagged Unions (Part 1)
How Rust Implements Tagged Unions (Part 1)
In this two-part series, we take a look at how Rust tackles unions and, more specifically, tagged unions. In this part 1, we take a look at unions in C and Rust.
Join the DZone community and get the full member experience.Join For Free
Rust describes itself as:
…a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.
Of course, this is in contrast to C, a different systems programming language that encourages segfaults and makes no guarantees at all about thread safety. Rust improves on C in many ways, most famously with its innovative ownership model for managing memory.
Another less obvious improvement Rust makes to C has to do with the union keyword. The Rust compiler implements tagged unions, which prevent you from crashing your program by initializing a union with one variant and accessing it with another. But Rust doesn’t include the union keyword at all; instead, Rust uses enum to improve on both C enums and C unions at the same time.
Not sure what a tagged union is? Or why it’s an improvement over an old-fashioned C union? Today I’ll explain. First I’ll start with a quick review of C unions, how they work, and why they are dangerous. Then I’ll show you how Rust enums improve on them.
Unions are one of the most dangerous features of C. Here’s an example:
Here the union
num_or_str saves either a number or a character pointer but not both (a union can contain any number of members; for simplicity, my example union has only two). On the right, I show how the C compiler would allocate memory for an instance of
num_or_str. It allocates enough memory to hold the longest value in the union, but not both values at the same time. The integer is a
short, meaning it occupies 16 bits or two bytes, and the string is a char pointer which takes 64 bits or 8 bytes using a modern 64-bit CPU. The two options for what might be stored in the union,
str in this example, are known as variants.
Why C Unions Are Dangerous
Unions are dangerous because you, the C programmer, need to remember which variant you set in the union. If you save one type of value but then access the other, your program will crash.
For example, this code works fine:
But if you forget
a_number contains a number, and use
a_number as a string instead, your program will crash:
Notice the C compiler didn’t help me here at all. It didn’t display any sort of warning or error when I wrote
a_number.str. It silently allowed me to write dangerous code; in fact, union syntax encouraged me to introduce a segmentation fault.
Writing C code with unions is like driving very fast down a highway full of potholes. You might be the best driver in the world, but eventually, you’re going to hit one of the holes and crash.
C programmers have been writing code with unions for years – for decades in fact. How have they avoided this problem? There must be a safe way of writing C code with unions.
The most common and robust solution is to keep track of which union variant is valid using an integer value saved right next to the union in memory. This integer is known as a tag, and the combination of the tag and the union is a tagged union.
Here’s an example:
On the right side, I’ve allocated some memory right before the union for the tag using a
struct. C structs, unlike unions, allocate enough memory to store all of their members at once. Note: Using two bytes to save a small integer value is unnecessary. C programs often use only one byte or even represent the integer value using a bit mask inside the union’s values. But the principle remains the same.
Now when I save an integer in an instance of the union I can also set the tag to the value 1, for example, which I decide will mean that
a_number contains a number:
And if I want to save a string instead, I set the tag to
2, for example:
Later when I access the tagged union, I first check the tag before deciding which variant I can access:
Of course, tagged unions are not foolproof. I invented the tag values 1 and 2 and wrote the code that checks for them. There’s nothing to prevent me from forgetting to save the tag value, saving the wrong tag value, or misinterpreting the tag value when I read it later. And, if I ever add new variants to the union, I have to add a new branch to every if statement in my app that checks the tags, handling the new value. Needless to say, the C compiler won’t help me find those if statements or check whether I’ve covered all the possible tag values.
I’m a forgetful and easily distracted person. I need a programming language that will keep me out of trouble. Even with tagged unions I’m sure I would write dangerous, crashing C code before long.
Tagged Unions in Rust
Rust implements tagged unions using the
enum keyword. For example, to declare a Rust
enum type equivalent to the C tagged union above I write:
The questions for today are: Why are enums equivalent to tagged unions in C? And: What should I draw on the right side? What would I see if I could find and examine an enum in the memory space of a running Rust process?
Saving a Rust Enum
To find out, let’s create an instance of
Notice that instead of 4, I’ve saved a more recognizable value, 1234. Now, if I compile it with the
--emit asm flag:
Rust generates a file called union.s which contains the assembly language version of my program. If I open union.s and search for 1234, the integer value I saved above, I see:
I’ve found it; here are the x86 assembly language instructions that initialize
a_number. These show me exactly how Rust represents enums in memory, how Rust implements tagged unions.
The only problem is… I have no idea what this means!
JOIN US IN THE SECOND AND FINAL PART OF THIS SERIES TO FIND OUT MORE DETAILS ON HOW RUST HANDLES TAGGED UNIONS!
Published at DZone with permission of Pat Shaughnessy , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.