Architecting Sub-Microsecond HFT Systems With C++ and Zero-Copy IPC

Building sub-microsecond HFT dispatchers requires bypassing the operating system. Learn how to achieve zero-copy IPC using C++ lock-free structures and memory mapping.

Apr. 30, 26 · Analysis

Likes (1)

Comment

Save

4.0K Views

If you spend enough time building backend services, you start to think 50 milliseconds is a "fast" response time. But when you transition into the architecture of high-frequency trading (HFT) systems, you quickly realize that standard software engineering paradigms are not just slow — they are fundamentally flawed for this domain.

In the HFT world, latency is measured in microseconds (and increasingly, nanoseconds). When building a market data dispatcher and execution engine, you are no longer just writing software; you are negotiating directly with the physics of the hardware and the limitations of the operating system.

This article breaks down the architectural foundations required to build a sub-microsecond trading dispatcher, focusing on hardware proximity, OS-level tuning, lock-free data structures in C++, and zero-copy inter-process communication (IPC).

Step Zero: The Physics of Latency and Hardware Colocation

You can write the most optimized C++ code on the planet, but if your server is sitting in an AWS region 500 miles away from the exchange's matching engine, you have already lost the trade.

Light travels through fiber optic cables at roughly 200,000 kilometers per second. That means every 200 kilometers adds a hard, unbreakable physical penalty of 1 millisecond (round trip). In an environment where the lifespan of a profitable arbitrage opportunity is under 50 microseconds, geographical distance is fatal.

Colocation is mandatory. Your servers must sit inside the same data center as the exchange (e.g., CME in Aurora, or B3 in São Paulo). But physical proximity is just the baseline. Once the packet hits your NIC (Network Interface Card), the operating system becomes your biggest enemy.

Bypassing the Kernel and Thread Affinity

Standard TCP/IP stacks interrupt the CPU, forcing a context switch from user space to kernel space to read the packet, and then back again. This context switch alone can cost 2 to 3 microseconds. HFT architectures rely on Kernel Bypass technologies (like Solarflare's OpenOnload or DPDK) to read packets directly from the NIC hardware into user-space memory.

Furthermore, we cannot allow the OS scheduler to pause our critical threads to run background tasks. We use Thread Affinity (CPU pinning) to bind our hot-path threads to specific, isolated CPU cores. By telling the OS scheduler to completely ignore these cores, we ensure our L1 and L2 caches remain perfectly warm, and our threads never sleep.

Language Selection: Why C++ Owns the Hot Path

There is a persistent debate about whether Java or C# can be used for HFT. While you can use off-heap memory and GC-tuning to achieve low latency in managed languages, C++ remains the undisputed king of the hot path. The reason is simple: determinism.

We don't just care about average latency; we care deeply about the 99.99th percentile (tail latency). A garbage collection pause of even 1 millisecond is a catastrophic failure in an active market. C++ provides RAII (resource acquisition is initialization) and precise control over memory layouts, allowing us to build predictable, zero-allocation pipelines.

Consider how we align memory for IPC sharing. To prevent compilers from adding hidden padding bytes (which ruins cross-language compatibility), we force strict 1-byte alignment:

    C++
   
 

   #pragma pack(push, 1)
struct TickEvent {
    long long timestamp;    // 8 bytes
    char type[4];           // 4 bytes
    char symbol[16];        // 16 bytes
    double price;           // 8 bytes
    long long qty;          // 8 bytes
    int side;               // 4 bytes
};
#pragma pack(pop)
  

This guarantees that when a higher-level strategy node (written in C# or Python) reads this struct from memory, the byte offsets match exactly, with zero serialization overhead.

Concurrency Without Sleeping: The TTAS Spinlock

In standard multi-threading, when two threads want to access an order book, you use a std::mutex. If the lock is held, the OS puts the waiting thread to sleep. Waking it up later requires a context switch. As established, context switches are poison.

In our architecture, the hot-path threads never sleep. Instead, we use a Test and Test-and-Set (TTAS) spinlock implemented with atomic flags.

    C++
   
 

   class SpinLock {
    std::atomic_flag flag_ = ATOMIC_FLAG_INIT;
public:
    void lock() noexcept {
        for (;;) {
            // Test and set with acquire semantics
            if (!flag_.test_and_set(std::memory_order_acquire)) return;
            // Spin loop with relaxed memory order to prevent cache bouncing
            while (flag_.test_and_set(std::memory_order_relaxed)) {
                _mm_pause(); // Hints the CPU that this is a spin-wait loop
            }
        }
    }
    void unlock() noexcept { 
        flag_.clear(std::memory_order_release); 
    }
};
  

The magic here is _mm_pause(). Without it, the tight while loop would aggressively consume CPU pipeline resources and generate massive heat. The pause instruction tells the CPU architecture to optimize memory bus traffic, significantly reducing power consumption and preventing cache-coherency storms across cores.

Data Structure Sympathy: The FlatBook Approach

When junior developers are asked to build an Order Book (L3 data), they immediately reach for a Red-Black Tree (std::map in C++) because it offers $O(\log n)$ insertion and deletion. In theory, this is great. In reality, it is a cache-miss nightmare.

Tree nodes are allocated randomly on the heap. Traversing the tree means chasing pointers across RAM, resulting in constant L1 cache misses. Reading from main memory takes about 100 nanoseconds — an eternity.

Instead, we use a FlatBook architecture: a contiguous std::vector of flat structs.

    C++
   
 

   struct PriceLevel {
    double price;
    long long qty;
};

struct FlatBook {
    std::vector<PriceLevel> levels;
    // ...
    void add(double price, long long qty) noexcept {
        // Linear scan, but incredibly fast due to hardware prefetching
        for (auto it = levels.begin(); it != levels.end(); ++it) {
            if (std::abs(it->price - price) < 0.0001) {
                it->qty += qty;
                if (it->qty <= 0) levels.erase(it);
                return;
            }
        }
        // ... insert logic
    }
};
  

Even though a linear scan is $O(n)$, an order book usually only has 10 to 20 active levels near the spread. Because std::vector stores data contiguously, the CPU's hardware prefetcher loads the entire array into the L1 cache at once. A "slow" $O(n)$ algorithm operating entirely in the L1 cache will absolutely obliterate a "fast" $O(\log n)$ algorithm that has to hit main memory.

The Ultimate Bridge: Zero-Copy IPC via Memory-Mapped Files

Eventually, the raw data processed by the C++ dispatcher must reach the algorithmic execution nodes. Using local TCP sockets or named pipes introduces kernel overhead.

The architectural solution is memory-mapped files. We instruct the OS to map a block of RAM directly into the virtual address space of our C++ process.

    C++
   
 

   // Windows API example for creating a shared memory map
HANDLE hMapFile = CreateFileMappingA(
    INVALID_HANDLE_VALUE,
    &sa,                    // Security attributes for cross-session access
    PAGE_READWRITE,
    0,
    sizeof(AssetTicket),
    "HFT_Ticket_SYMBOL"
);

AssetTicket* sharedData = (AssetTicket*)MapViewOfFile(
    hMapFile, FILE_MAP_ALL_ACCESS, 0, 0, sizeof(AssetTicket)
);
  

By doing this, the C++ engine writes the parsed order book directly to this RAM block. The algorithmic node (which could be a separate process running in a different OS session) maps the exact same physical memory block.

When the C++ engine updates sharedData->last_price, the strategy node reads it on the very next CPU cycle. There is no serialization, no JSON parsing, no socket buffer, and no OS intervention. It is pure, zero-copy Inter-Process Communication at the speed of RAM.

The Convergence of Software and Metal

Designing an HFT architecture forces you to unlearn many standard software engineering practices. You stop abstracting the hardware and start embracing it. You replace elegant heap-based data structures with contiguous arrays. You bypass the kernel, abandon the garbage collector, and count the nanoseconds between L1 cache hits.

In the end, sub-microsecond trading systems prove that software is not an abstract entity. At the absolute limits of performance, your code is just an extension of the silicon it runs on.

systems Performance

Opinions expressed by DZone contributors are their own.

Related

Trending