Real-Time Market Data Processing: Designing Systems for Low Latency and High Throughput

This article explores key challenges in real-time market data processing, design strategies, and optimizations—with code snippets where applicable.

Vishal Jain

Jul. 03, 25 · Analysis

Likes (0)

Comment

Save

3.7K Views

In financial markets, real-time data processing is critical for trading, risk management, and decision-making. Market data systems must ingest and process millions of updates per second while ensuring ultra-low latency. During my time at Bloomberg and Two Sigma, I worked on optimizing such systems for speed and reliability.

By the means of this article, I’d like to explore key challenges in real-time market data processing, design strategies, and optimizations—with code snippets where applicable. I’ll try my best to keep things succinct without going into too much details!

High-Performance Data Ingestion

Market data systems need to handle a continuous stream of updates from multiple exchanges with minimal latency. Traditional approaches using TCP-based brokers like Kafka introduce too much overhead. Instead, many trading firms rely on UDP multicast for market data distribution.

UDP Multicast for Low Latency

A lightweight, high-performance approach uses UDP multicast to distribute data to multiple consumers efficiently.

The program below simply connects to the the specified multicast group, binds to a socket, and continuously receives and displays any data sent to the group by other devices on the network. So, it pretty much acts as a data receiver. Simple!

// High-performance UDP multicast receiver (C++)
#include <iostream>
#include <arpa/inet.h>
#include <unistd.h>
#define PORT 5000
#define GROUP "239.255.0.1" // Multicast group addresses are in the range of 224.0.0.0 to 239.255.255.255
#define BUFFER_SIZE 1024

int main() {
    // Create a UDP socket
    int sock = socket(AF_INET, SOCK_DGRAM, 0);
    if (sock < 0) {
        std::cerr << "Error creating socket" << std::endl;
        return -1;
    }

    // Set up the address structure
    struct sockaddr_in addr;
    addr.sin_family = AF_INET;
    addr.sin_port = htons(PORT);
    addr.sin_addr.s_addr = htonl(INADDR_ANY);

    // Bind the socket to the address
    if (bind(sock, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
        std::cerr << "Error binding socket" << std::endl;
        return -1;
    }

    // Join the multicast group
    struct ip_mreq mreq;
    mreq.imr_multiaddr.s_addr = inet_addr(GROUP);
    mreq.imr_interface.s_addr = htonl(INADDR_ANY);
    if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq)) < 0) {
        std::cerr << "Error joining multicast group" << std::endl;
        return -1;
    }

    // Receive and print messages
    char buffer[BUFFER_SIZE];
    while (true) {
        ssize_t bytes_received = recv(sock, buffer, BUFFER_SIZE, 0);
        if (bytes_received < 0) {
            std::cerr << "Error receiving message" << std::endl;
            return -1;
        }
        buffer[bytes_received] = '\0'; // null-terminate the string
        std::cout << "Received: " << buffer << std::endl;
    }
    return 0;
}

Why UDP Multicast?

Eliminates the need for multiple TCP connections (which basically leads to connection overhead)
Reduces overhead compared to Kafka or RabbitMQ (such as centralized broker, message queue, or ack mechanism)
Used by exchanges like Nasdaq and CME for tick data distribution (basically, it’s industry standard leveraged by key realtime tick distributers)

Optimizing for Ultra-Low Latency

Real-time systems must minimize processing delays at every stage.

Zero-Copy Data Processing

Copying data between buffers adds unnecessary overhead. Using mmap-based shared memory avoids these copies. This C++ program demonstrates the use of zero-copy shared memory using mmap to update market data, and implements a lock-free ring buffer for message passing in multi-threaded systems. The program showcases efficient and thread-safe ways to share data between processes and threads, making it suitable for high-performance applications.

// Zero-copy shared memory using mmap (C++)

#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <cstring>
#include <iostream>

const char* SHM_NAME = "/market_data";

int main() {
    int fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(fd, 4096);
    void* addr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

    strcpy((char*)addr, "New Market Data Update");
    std::cout << "Shared memory updated!" << std::endl;

    return 0;
}

// Lock-Free Data Structures

// In a multi-threaded system, traditional locks (std::mutex) cause contention.
// Instead, use lock-free ring buffers for message passing.

// Lock-free ring buffer (C++)

#include <atomic>
#include <vector>

template <typename T, size_t Size>
class LockFreeQueue {
    std::vector<T> buffer{Size};
    std::atomic<size_t> head{0}, tail{0};

public:
    bool enqueue(const T& item) {
        size_t currentTail = tail.load(std::memory_order_relaxed);
        if ((currentTail + 1) % Size == head.load(std::memory_order_acquire))
            return false; // Buffer full
        buffer[currentTail] = item;
        tail.store((currentTail + 1) % Size, std::memory_order_release);
        return true;
    }

    bool dequeue(T& item) {
        size_t currentHead = head.load(std::memory_order_relaxed);
        if (currentHead == tail.load(std::memory_order_acquire))
            return false; // Buffer empty
        item = buffer[currentHead];
        head.store((currentHead + 1) % Size, std::memory_order_release);
        return true;
    }
};

Why Lock-Free?

Avoids context-switching delays from mutexes
Prevents thread contention in high-throughput scenarios
Used in real-time trading engines and order book matching

However, one might ask why don’t we use lock free everywhere then? I think the answer is because that is often more complex and error-prone than traditional locking mechanisms, requiring a deep understanding of concurrency, atomic operations, and memory models. This complexity can lead to subtle bugs, making it difficult to develop and maintain lock-free code. And hence this is critically used only for specific users cases such as above.

Efficient Failover for High Availability

To ensure high availability, real-time systems require efficient failover mechanisms. Traditional failover approaches can introduce latency, which is unacceptable in high-performance applications.

Hot Standby Replication

Instead of cold failover, a hot standby replica can take over instantly.

• Primary node: Processes market data.

• Replica node: Runs the same pipeline but discards output until failover.

Real-Time Data Sync

Sync market data logs in real-time using tools like rsync e.g. :

rsync -az --progress --delete /data/market_data/ standby:/data/market_data/

RDMA for Fast Failover

Utilize RDMA (Remote Direct Memory Access) for fast and efficient failover:

ibv_post_send(qp, &send_wr, &bad_wr); // Send data over RDMA
ibv_post_recv(qp, &recv_wr, &bad_wr); // Receive data

Key Advantages of such failover strategies:

No TCP overhead, as RDMA transfers memory directly.
Failover occurs in microseconds instead of milliseconds.
Used in cross-region data replication for financial exchanges.

By incorporating these strategies, you can achieve efficient failover and minimize downtime in your real-time system.

Scaling for High Throughput - A Note on Partitioning!

So generally, one trades not just one but many (thousands of?!) securities (stocks/options tickers) and hence handling millions of messages per second requires intelligent partitioning and parallelism—a single machine just can’t handle that load (we are still not there with quantum chips just yet :))

To mitigate this, one ideal way is to shard/partition by ticker symbol and assign dedicated hosts based on on a given ticker symbol. This ensures fair load distribution. Furthermore, batching can also be leveraged for throughput optimization

Sending data in batches reduces network and disk I/O:

// Batch writes to persistent storage

std::vector<std::string> batch;

for (auto& msg : marketDataQueue) {
    batch.push_back(msg);
    if (batch.size() >= 100) {
        database.bulkInsert(batch);
        batch.clear();
    }
}

Other Lessons from High-Frequency Trading (HFT)

So far we have learned how to receive and read data with failover. However, the for actual processing of data itself we’d need to make sure the processor does no other job but the actual processing itself. Reason I highlight this is because in CPUs have multiple cores and each core can execute multiple threads simultaneously. This can lead to context switching, where the CPU switches between different tasks, causing delays and thus reducing overall performance.

To mitigate this issue, we can use CPU sets (also known as CPU affinity or CPU pinning) to dedicate specific CPU cores to our data processing task. By doing so, we ensure that the CPU cores are not interrupted by other tasks, allowing them to focus solely on processing our data.

CPU basically sets are a way to bind a process or thread to a specific set of CPU cores, ensuring that it runs exclusively on those cores. This can be particularly useful in high-performance computing applications, such as financial trading platforms, where low latency and high throughput are critical.

By using CPU sets, we can:

Reduce context switching and minimize delays
Increase cache locality and reduce memory access times
Improve overall system performance and responsiveness

Techniques from HFT apply to any real-time system:

Pinned CPU threads – Assigning threads to specific cores using taskset.
NUMA-aware memory allocation – Ensures local memory access.
Prefetching cache lines – Using _mm_prefetch() in C++ to optimize CPU cache access.

Beyond Finance: Broader Applications

The similar real-time techniques apply to:

Ad Tech (RTB Systems) – Auctioning ads within tens of milliseconds.
Cybersecurity – Analyzing real-time network traffic for threats.
IoT & Telemetry – Handling sensor data at ultra-high throughput.

Conclusion

Building real-time market data systems requires deep optimizations across networking, memory, serialization, and distributed computing. By applying techniques like zero-copy processing, lock-free structures, and RDMA networking, systems can achieve extreme performance.

Whether in finance, ad tech, or security, every microsecond matters.

Data processing Data (computing) Throughput (business)

Opinions expressed by DZone contributors are their own.

Related

Trending