Golang Race Detection
Time to catch up with those annoying race bugs using Golang and ThreadSanitizer.
Join the DZone community and get the full member experience.
Join For FreeRace conditions are pretty nasty bugs. They're hard to trace, reproduce and isolate, often occurring under unusual circumstances and often lead to hard-to-debug issues. Thus it’s best to prevent and detect them early. Thankfully, there’s a production-ready algorithm to tackle the challenge — ThreadSanitizer v2, a battle-proven library for compiler support in Go, Rust, C, and C++.
First, let’s display a typical race condition. We have a simple global count variable:
var Cnt int
We count the number of events happening; whether we’re counting HTTP requests, the number of likes or tinder matches is irrelevant. What’s relevant it how we do it. We call function Run
:
fun Run(amount int) {for i := 0; i < amount; i++ {Cnt++}}
But that’s not all. We’re calling it from multiple goroutines:
func main() {wg := &sync.WaitGroup{}for i := 0; i < 10; i++ {wg.Add(1)gofunc() {defer wg.Done()Run(1000)}()}wg.Wait();fmt.Println(Cnt)}
And we’d expect our result to be 10000
. Yet it’s improbably this will happen on a multicore multithreaded system. You’re most likely get results like 5234
one run, 1243
second run, or even 1521
last run without any determinism or repeatability. This is a typical data race!
Let’s not stop on this small example. Instead, let’s evaluate some famous real-world race conditions with serious consequences.
Famous Examples
DirtyCow
This is a race condition in the Linux kernel. It involves a tricky memory pages interplay, mmap
and madvise
system call which allow for privilege escalation exploit. That is, you could mmap
a root-owned file copy-on-write, which is a valid operation (every write to this mmap
-ed region will not be written to an underlying file), yet under certain conditions, write would propagate to underlying file, despite we’re an unprivileged user.
Therac-25
Another famous example is the Therac-25 radiotherapy machine. It had a race condition between machine settings and display settings. If the user typed instructions too quickly and changed them quickly, the machine could end up in maximum radiation dosage while displaying false information to the operator. This led to multiple accidents and death cases.
Definitions
Before continuing let’s briefly iterate over required definitions:
Race condition — A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly.
Due to general race condition scope, and requiring domain knowledge understanding what is correct behavior, for the rest of this post we’re focusing on data race, much more narrow and objective definition.
Data race — Concurrent access to a memory location, one of which is a write.
Concurrent access — is one where event ordering isn’t known. There’s no happens before
relation.
A happens-before relation requires some further explaining. Each goroutine has its own logical clock. It is incremented on each operation. Within each goroutine there’s strict ordering, event happen sequentially (or at least we observe them as if they happened sequentially, various compiler optimization/Out-of-order execution might interfere). Between goroutines, there’s no order unless there’s synchronization between them.
In the following image, you can see it visually depicted. happens before
relations are highlighted using arrows. I’d like to remind you that a happens before
relation is transitive.
Common Mistakes
Let’s observe a few common mistakes in Go programs. The examples have a similar equivalent in other major languages; it’s just in Go since I quite like it.
Race on The Loop Counter
func main() {var wg sync.WaitGroupwg.Add(5)for i := 0; i < 5; i++ {gofunc() {fmt.Println(i) // Not the 'i' you are looking for.wg.Done()}()}wg.Wait()}
Try it out. You might get 0 4 4 4 4
as the output depending on the event ordering. Certainly not the output you’d expect. The fix is quite simple:
func main() {var wg sync.WaitGroupwg.Add(5)for i := 0; i < 5; i++ {gofunc(j int) {fmt.Println(j)wg.Done()}(i)}wg.Wait()}
Unprotected Global
This one is similar to the introductory example:
var Cnt intfunc Inc(amount int) {Cnt += amount}func main() {go Inc(10)go Int(100)}
Solutions are to sync access to global Cnt
variable since increment operations aren’t atomic (unless from the sync/atomic
package).
Thus we use a mutex:
var Cnt intvar m = &sync.Mutex{}func Inc(amount int) {m.Lock()defer m.Unlock()Cnt += amount}
or the atomic variable:
var Cnt int64func Inc(amount int64) {atomic.AddInt64(&Cnt, amount)}
Violating Go Memory Model
Go memory model specifies what guarantees you have from the compiler. If you breach that contract, you’re in for a bad time. The compiler is free to optimize away your code or do unpredictable things with it.
var a stringvar done boolfunc setup() {a = "hello, world"done = true}func main() {go setup()for !done {}fmt.Print(a)}
In this example, Go’s memory model doesn’t guarantee to write to done in one goroutine is visible to other goroutines since there was no synchronization between them. The compiler is also free to optimize away
for !done {}
into a simpler construct, not loading done variable each iteration. Furthermore, there are no guarantees memory buffers between CPU cores and L1 caches shall be flushed between concurrent reads for the done variable. This is all due to the nasty complexity of out-of-order execution and various memory guarantees across architectures.
Synchronization Primitives
In Go we have following synchronization primitives:
- channels, that is, sending and receiving is a synchronization point
- sync package
- Mutex
- atomics
For further details, look those up, they are not a concept unique to Go nor race detection. The important point is they bring happens-before ordering into the program code.
Detecting Race Conditions
In Go, this is simply done with -race
compiler flag. We can even run our tests with it and fail on any race condition:
go install -racego build -racego test -race
As said in the beginning it uses the ThreadSanitizer library under the hood. You should expect runtime slowdown of 2-20x and increased memory consumption 5-10x. Other requirements are CGO-enabled and 64-bit operating system. It detects race conditions reliably, without false positives. During its usage it uncovered 1,000+ bugs in Chrome, 100+ in Go’s standard library, and more in other projects.
What -race
flag does is instrument your code with additional instructions. For example, this simple function:
func Inc(x *int) {*x++}
Get’s compiled into:
movq "".x+8(SP), AXincq (AX)
However, upon turning on -race
you get a bunch of assembly code, interesting bits being:
...call runtime.raceread(SB)...call runtime.racewrite(SB)...
That is compiler adds read and write barriers for each concurrently reachable memory location. The compiler is smart enough not to instrument local variables since this incurs a quite performance penalty.
Algorithm
First the bad news:
Determining if an arbitrary program contains potential data races is NP-hard.
First, how and when do we collect our data? Our choices are either dynamic or static analysis. The main static analysis drawback is the time required for proper source code annotation. This dynamic approach is used since it requires no additional programmer intervention except turning it on. Data could be collected on the fly or dumped somewhere for post mortem analysis. ThreadSanitizer uses an on-the-fly approach due to performance considerations. Otherwise, we could pile up huge amounts of unnecessary data.
There are multiple approaches for dynamic race detection, pure happens before based, lockset based and hybrid models. ThreadSanitizer uses pure happens before. Now we’ll go over 3 algorithms, each pure happens before dynamic race detection, each improving upon the last. We’ll see how ThreadSanitizer evolved and understand how it works from its humble origins:
Vector Clocks
First, let’s explain the concept of vector clocks. Instead of each goroutine remembering only its logical clock, we remember the logical clock of the last time we hear from another goroutine.
Vector clocks are partially ordered set. If two events have strictly greater
or less
than relations between them, there’s a happens before
relation between them. Otherwise, they are concurrent.
DJIT+
DJIT+ is an application of vector clocks for pure happens before data race detector.
We remember vector clocks for:
- Each lock
release $
$
- Last memory read on location x $
$
- Last memory write on location x $
$
- Goroutine vector clock, $
$
Let’s see an example where there are no races:
Each row represents a single event as our race detector sees it. First, we write to location from goroutine 0, and it’s remembered in
. Afterward, we release the lock
and goroutine 0 field is updated in
, that is our lock release vector clock. By acquiring the lock, we max by elements vectors clocks of
and
, because we know lock’s lock happens after lock’s release. We perform the write-in goroutine 1 and check whether our own vector clock is concurrent to
and
. It’s strictly ordered, thus everything is good.
And now the same example, but without goroutine synchronization. There are no lock’s lock and release. When comparing goroutine 1 vector clock, is concurrent to the last write.
Thus, we have detected the data race. This is the most important concept to understand, as the rest is optimizations.
FastTrack
The Fast track introduces the first optimization. For full details I recommend reading the original paper, as it’s quite well-written!
We observe the following property. If there are no data races on writes, each write is sequential. That is, it’s sufficient to remember the last write’s goroutine and logical clock for that goroutine. Thus, we create the shadow word, representing the last memory write, logical clock and goroutine id. Format <logical clock>@<goroutine id>
For reads, it’s a bit of a different story. It’s perfectly valid having multiple concurrent reads as long as reads and write are strictly ordered. Thus, we utilize the shadow word read representation as long as a single goroutine reads the data, and fallback expensive full vector clock in concurrent read scenario:
ThreadSanitizer v2
Thread sanitizer further improved on the FastTrack. Instead of having separate ones, we keep a fixed sized shadow word pool for each 8-byte quadword. That is, we approximate the full vector clock with partial shadow words. Upon full shadow word pool we randomly evict an entry.
By introducing this trade-off, we trace false negatives for speed and performance. Our fixed shadow word pool is memory mapped, thus allowing cheap access with additions and byte shifts compared to more expensive hashmaps or variable sized array access.
Let’s go over an example. Each shadow word is 8 bytes wide and consists of goroutine logical clock, goroutine id, write/read flag and position in the 8 byte word we’re reading/writing.
Everything’s good so far. Let’s make another operation.
Now let’s introduce a data race. The goroutine 3 vector clock for goroutine 1 is 5, that is smaller than 10@1
shadow word written in the pool. Thus we’re certain a data race happened. (We assume events for each goroutine arrive in order, otherwise, we couldn’t be sure whether a 10@1
entry for goroutine 3 is bigger or smaller than the current vector clock’s entry for goroutine 3. This is enforced by algorithm design and implementation.)
Summary
To summarize this article, we covered the basics of logical and vector clocks, how they are used in happens before data race detector and we build it up to the ThreadSanitizer v2 which is used in go’s -race
.
We observed the tradeoffs in the algorithm design. It traded a higher false negative rate for speed, however, it forbids a false positive. This property builds trust, and with that trust, we know for certain that we have a race if it screams at us, not some weird edge case in the algorithm. No flaky race tests, only true positives are reported.
Though, keep in mind, this is only data race detector; it’s easy to circumvent it. For example, the following code shall pass the data race detector despite being horribly wrong in the concurrent setting:
var Cnt int64func Run(amount int) {for i := 0; i < amount; i++ {val := atomic.LoadInt64(&Cnt)val ++ // incrementing Cnt isn't atomic, we just load and store the value atomically.atomic.StoreInt64(&Cnt, val)}}
References
- Slides
- https://golang.org/doc/articles/race_detector.html
- Race detector options
- https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm
- The Go memory model
- “”go test -race” Under the Hood” by Kavya Joshi talk
- https://blog.golang.org/race-detector
- https://danluu.com/concurrency-bugs/
- ThreadSanitizer – data race detection in practice google paper
- FastTrack: Efficient and Precise Dynamic Race Detection
- AddressSanitizer/ThreadSanitizer for Linux Kernel and userspace.
- DJIT+ algo MultiRace: efficient on-the-fly data race detection in multithreaded C++ program
Opinions expressed by DZone contributors are their own.
Comments