Lock-Free Programming: From Primitives to Working Structures

Locking is not the only way to deal with concurrency. Lock-free programming approaches are on the opposite side. Let's dive into them.

Bartłomiej Żyliński

CORE ·

Jul. 22, 25 · Analysis

Likes (7)

Comment

Save

4.7K Views

Working with multiple threads is one of the most complex problems we may encounter in our daily work. When put against the wall of multithreading, most people right away reach out for blocking approaches. In Java, it takes the form of the synchronized keyword, or some other less painful mechanisms, like ReentrantLock. Locks are not the only option: Lock-free programming is also the way.

In this text, I will show problems, techniques, and best practices related to Lock-Free Programming. I will also provide a real-life example of how to implement a Lock-Free stack. Besides, I will share common patterns on moving from Lock-Free to Wait-Free.

Here you can find only the most interesting code samples. The full source code is available in my GitHub repository.

What Is Lock-Free Programming?

Guarantee: Lock-free programming algorithms guarantee system-wide progress. Within a finite number of steps, by all threads, at least one thread completes its operation.

Typical techniques: Atomic primitives CAS, LL/SC, FAA.

Pros:

Scales well under contention because there is no kernel blocking or context-switch overhead.
Eliminate deadlocks and livelocks

Cons:

Harder to design and reason about.
Higher cost under very low contention compared with a simple mutex.
Starvation is still possible — one unlucky thread might repeatedly fail while others succeed, but the program as a whole keeps moving forward.

What Is Wait-Free Programming?

Guarantee: Wait-free algorithm provides a per-thread progress bound. Every thread finishes its operation in a finite number of its own steps, regardless of what the other threads do.

Typical techniques: Per-thread operation descriptors and finite-step helper functions

Pros:

All from lock-free programming
Real-time friendly
Eliminate the starvation problem

Cons:

Significantly more memory and bookkeeping.
Often slower in the uncontended “happy path”
Harder (sometimes impossible) to create for complex data structures.

Blocking vs. Lock-Free vs. Wait-Free: The Progress Guarantees

Deadlock

Deadlock is probably the most common error in multithreading. Two (or more) threads each hold resources that the others need and wait forever for those resources to be released. Our threads will not be able to move forward with processing. Thus, we end with a standstill — a deadlock.

    Shell
   
   Thread A: lock(m1); lock(m2);
Thread B: lock(m2); lock(m1);

If A acquires m1 and B acquires m2, both block forever on the second lock().

Livelock

A livelock is motion without progress. Like two people in a hallway, sidestepping left and right in perfect sync and never getting past each other. Threads are active — spinning, retrying, sending messages, but the system never completes a useful operation. Usually caused by over-reactive collision‐avoidance or continual retries without an eventual terminal state.

Starvation

Starvation means a particular thread never gets the resources or CPU time it needs. Even despite the fact that the system as a whole keeps making progress. If other threads hit the same data structure with high frequency. Then one of the threads may have to retry many times because its compare-and-swap keeps failing. This high contention can cause the Thread to starve.

Problem	Lock-Free Programming	Wait-Free Programming
Deadlock	Impossible. No thread ever waits while holding a lock, so circular-wait conditions can’t arise.	Impossible. Same lock-free property plus bounded completion time for every thread.
Livelock	Prevented. At least one operation must complete in a finite number of steps, so the system can’t get stuck in endless mutual retries.	Prevented. Every operation finishes in a bounded number of steps, so the whole system and each thread move forward.
Starvation	Possible. System makes progress, but an unlucky thread may be perpetually overtaken by others.	Impossible. Each thread completes within a fixed bound, so no one can be starved.

Atomic Primitives That Power Non-Blocking Algorithms

Compare-and-Swap (CAS)

Compare-and-Swap (CAS) is probably the single most famous atomic instruction. In one atomic step, the processor:

Reads a memory word,
Checks whether that word still holds an expected value
Only if the comparison succeeds, replace it with a new value

That simple “check-and-set” sequence lets a thread act as if it briefly owned the variable without ever locking it.

Most, if not all, non-blocking data structures rely on this instruction mix with a loop to work correctly. You will see one of them below, in the form of LockFreeStack. However, there is a catch or two hidden here. First, when contention is rising, the loop starts taking more and more time to complete. The other is known as an A-B-A problem.

In Java, we have all the Atomic classes that have compareAndSwap methods. Additionally, you can try to emulate CAS with the use of the volatile keyword.

Load-Link / Store-Conditional (LL/SC)

One of the ways to address the A-B-A problem from above is the LC/SC instruction pair. It is a pair of two steps, atomic with respect to other threads. Effectively creating a “split CAS.”

Step	What happens
LL	Load the word at addr and place the CPU into a reservation state for that cache line
SC	If reservation valid, no intervening write by another core, stores new value, otherwise it does nothing and returns false

The main difference between CAS and LL/SC is the reservation mechanism. Thus, it dodges the ABA problem. Unfortunately, LL/SC is not a silver bullet and has drawbacks. The biggest one is that the Reservation can be lost for reasons other than contention. In such a case, we lose all the benefits of using this primitive.

In Java, the closest thing to LL/SC is AtomicStampedReference.

Fetch-and-Add (FAA)

Fetch-and-Add, or AtomicAdd, performs the following operations in a single atomic step:

Read the current value at addr
Adds a delta k
Stores the result back.
Returns the old value (some variants return the new one).

There are other variants of this primitive, like: Fetch-and-Subtract. FAA is used as a base in more complex techniques like Fetch-And-Store or Test-And-Set.

The main disadvantage of this primitive is what its name suggests. FAA only supports arithmetic transforms of the form old + k. To do anything more complex, you will still need CAS or LL/SC.

In Java, all Atomic classes have exposed getAndIncrement or incrementAndGet methods.

Double-Width CAS

This is an extension of classic CAS. Double Width CAS (or DCAS) extends the regular CAS idea to two adjacent machine words treated as a single 128-bit (or larger) unit. Both sub-words must match their expected values for the store to occur.

In pseudocode:

    C++
   
   cas2(expected_lo, expected_hi,
     new_lo, new

It allows you to store a pointer in the low word and a tag in the high word. Both words must match, so resurrecting an old pointer with the same address, but a different tag, fails. Thus solving the A-B-A problem.

DCAS enables part of lock-free data structures to swap pairs (pointer/counter or value/status) atomically. The main drawback is heavier micro-architectural cost than single-word CAS (locks 16-byte cache line).

Primitives Summary

Primitive	General form	Typical uses	Common pitfalls
CAS	`old ⇒ new`	Universal lock-free building block	ABA problem, retry cost under contention
DW-CAS	`(old_lo, old_hi) ⇒ (new_lo, new_hi)`	Pointer + tag pairs, ABA defense	Heavier locks 16-byte cache line
LL/SC	`LL; …; SC` split	Portable lock-free ops when CAS not present	Reservation loss → more retries
FAA	`new = old + k`	Counters, ticket locks, ref-counts	Cache-line ping-pong, limited to arithmetic ops

Besides these four primitives, we have more complex ones like Atomic Exchange / Test-and-Set (TAS), Fetch-and-store (FAS).

Lock-Free Programming In Java: Lock-Free Stack

Base

Simplest Lock Free Stack, with single atomic top pointer. Push and pop CAS loops operate on that pointer. It is a singly linked Stack; thus, the Node class only contains a next reference.

    Java
   
   import java.util.concurrent.atomic.AtomicReference;

private record Node<E>(E value, Node<E> next) {

}

private final AtomicReference<Node<E>> top = new AtomicReference

Lock-Free Push

Here we have a lock-free push. You can see the CAS loop wrap around the top pointer.

    Java
   
 

    /* ---------------- push ----------------------- */
public void push(E item) {
    Node<E> newNode;
    Node<E> oldTop;
    do {
        oldTop = top.get();
        newNode = new Node<>(item, oldTop);
    } while (!top.compareAndSet(oldTop, newNode));
  

What’s happening:

Read the current head pointer.
Build the replacement node – safe because nothing else can see newNode yet.
CAS attempt cas(oldTop, newNode).
- If no one has changed top in the meantime, the CAS succeeds, and the loop exits.
- If another thread slipped in, the CAS fails, we loop to re-snapshot the new top.

Lock-Free Pop

    Java
   
 

   public E pop() {
    Node<E> oldTop;
    Node<E> newTop;
    do {
        oldTop = top.get();
        if (oldTop == null) return null;
        newTop = oldTop.next();
    } while (!top.compareAndSet(oldTop, newTop));
    return oldTop.value();
}
  

What’s happening:

Read the current head pointer.
Null-check nothing to pop → return immediately, no CAS needed.
Assign a new top.
CAS from oldTop ➜ newTop
- If no one has changed top in the meantime, the CAS succeeds, we atomically detach oldTop.
- If another thread slipped in, the CAS fails, we loop to re-snapshot the new top.
Return — safe because oldTop is now private to this thread, GC will eventually reclaim it.

From Lock-Free Programming to Wait-Free Programming

I won't be presenting a complete Wait-Free implementation. It is long, complex, and explaining it correctly will probably double the size of this blog. However, there are a few good sources on building Wait-Free data structures.

Queues
- Scalable Synchronous Queue
- Wait-Free Queue
Stack
- A Wait-Free Stack
- Fetch-And-Add Wait-Free Stack
General

There are a couple of common points shared between all Wait-Free implementations:

Op Log — Each thread writes a small record describing what it wants to do.
Per-thread slot reservation → Thread grabs a unique index with FAA; writes its OpLogs there.
Phase (ticket) numbers → Every request gets a monotonically increasing “phase”.
Bounded Help loops → A helper will finish at most k foreign operations before returning to its own.

Summary

As you could see, both Lock-free and wait-free are not simple things. However, with a couple of good tools and techniques, they can be much easier.

Key Takeaways

Definitions

Lock-free programming → from all active threads, at least one makes progress in a finite number of steps.
Wait-free programming→ every operation by every thread completes in a bounded number of its own steps.

Atomic Primitives

CAS
LL/Sc
FAA
Double-width CAS

Best Practices in Implementing Lock-Free and Wait-Free Structures

Design first, code later → Draw the state diagram and identify every location that can change concurrently.
One atomic word to one logical invariant → Keep each CAS/LL-SC operating on the entire state you need to test.
Use DCAS → rather than splitting an invariant across two variables.
Write the fast path first → Attempt a cheap single-CAS path first; after N failures, publish a descriptor and switch to the helping (“slow”) path.
Bounded helping (Wait-Free) → Ensure a helper can finish at most k foreign operations before returning to its own.

Thank you for your time!

Central Authentication Service Requirements engineering Java (programming language) Lock (computer science)

Published at DZone with permission of Bartłomiej Żyliński. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending