DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Microservices for Machine Learning
  • Processing Cloud Data With DuckDB And AWS S3
  • Delta Live Tables in Databricks: A Guide to Smarter, Faster Data Pipelines
  • Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service

Trending

  • Testing the MongoDB MCP Server Using SingleStore Kai
  • Advanced Insight Generation: Revolutionizing Data Ingestion for AI-Powered Search
  • Contract-Driven ML: The Missing Link to Trustworthy Machine Learning
  • 10 Predictions Shaping the Future of Web Data Extraction Services
  1. DZone
  2. Data Engineering
  3. Data
  4. Real-Time Market Data Processing: Designing Systems for Low Latency and High Throughput

Real-Time Market Data Processing: Designing Systems for Low Latency and High Throughput

This article explores key challenges in real-time market data processing, design strategies, and optimizations—with code snippets where applicable.

By 
Vishal Jain user avatar
Vishal Jain
·
Jul. 03, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.2K Views

Join the DZone community and get the full member experience.

Join For Free

In financial markets, real-time data processing is critical for trading, risk management, and decision-making. Market data systems must ingest and process millions of updates per second while ensuring ultra-low latency. During my time at Bloomberg and Two Sigma, I worked on optimizing such systems for speed and reliability.

By the means of this article, I’d like to explore key challenges in real-time market data processing, design strategies, and optimizations—with code snippets where applicable. I’ll try my best to keep things succinct without going into too much details!

High-Performance Data Ingestion

Market data systems need to handle a continuous stream of updates from multiple exchanges with minimal latency. Traditional approaches using TCP-based brokers like Kafka introduce too much overhead. Instead, many trading firms rely on UDP multicast for market data distribution.

UDP Multicast for Low Latency

A lightweight, high-performance approach uses UDP multicast to distribute data to multiple consumers efficiently.

The program below simply connects to the the specified multicast group, binds to a socket, and continuously receives and displays any data sent to the group by other devices on the network. So, it pretty much acts as a data receiver. Simple!

// High-performance UDP multicast receiver (C++)
#include <iostream>
#include <arpa/inet.h>
#include <unistd.h>
#define PORT 5000
#define GROUP "239.255.0.1" // Multicast group addresses are in the range of 224.0.0.0 to 239.255.255.255
#define BUFFER_SIZE 1024

int main() {
    // Create a UDP socket
    int sock = socket(AF_INET, SOCK_DGRAM, 0);
    if (sock < 0) {
        std::cerr << "Error creating socket" << std::endl;
        return -1;
    }

    // Set up the address structure
    struct sockaddr_in addr;
    addr.sin_family = AF_INET;
    addr.sin_port = htons(PORT);
    addr.sin_addr.s_addr = htonl(INADDR_ANY);

    // Bind the socket to the address
    if (bind(sock, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
        std::cerr << "Error binding socket" << std::endl;
        return -1;
    }

    // Join the multicast group
    struct ip_mreq mreq;
    mreq.imr_multiaddr.s_addr = inet_addr(GROUP);
    mreq.imr_interface.s_addr = htonl(INADDR_ANY);
    if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq)) < 0) {
        std::cerr << "Error joining multicast group" << std::endl;
        return -1;
    }

    // Receive and print messages
    char buffer[BUFFER_SIZE];
    while (true) {
        ssize_t bytes_received = recv(sock, buffer, BUFFER_SIZE, 0);
        if (bytes_received < 0) {
            std::cerr << "Error receiving message" << std::endl;
            return -1;
        }
        buffer[bytes_received] = '\0'; // null-terminate the string
        std::cout << "Received: " << buffer << std::endl;
    }
    return 0;
}


Why UDP Multicast?

  • Eliminates the need for multiple TCP connections (which basically leads to connection overhead)
  • Reduces overhead compared to Kafka or RabbitMQ (such as centralized broker, message queue, or ack mechanism)
  • Used by exchanges like Nasdaq and CME for tick data distribution (basically, it’s industry standard leveraged by key realtime tick distributers)

Optimizing for Ultra-Low Latency

Real-time systems must minimize processing delays at every stage.

Zero-Copy Data Processing

Copying data between buffers adds unnecessary overhead. Using mmap-based shared memory avoids these copies. This C++ program demonstrates the use of zero-copy shared memory using mmap to update market data, and implements a lock-free ring buffer for message passing in multi-threaded systems. The program showcases efficient and thread-safe ways to share data between processes and threads, making it suitable for high-performance applications.

// Zero-copy shared memory using mmap (C++)

#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <cstring>
#include <iostream>

const char* SHM_NAME = "/market_data";

int main() {
    int fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
    ftruncate(fd, 4096);
    void* addr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

    strcpy((char*)addr, "New Market Data Update");
    std::cout << "Shared memory updated!" << std::endl;

    return 0;
}

// Lock-Free Data Structures

// In a multi-threaded system, traditional locks (std::mutex) cause contention.
// Instead, use lock-free ring buffers for message passing.

// Lock-free ring buffer (C++)

#include <atomic>
#include <vector>

template <typename T, size_t Size>
class LockFreeQueue {
    std::vector<T> buffer{Size};
    std::atomic<size_t> head{0}, tail{0};

public:
    bool enqueue(const T& item) {
        size_t currentTail = tail.load(std::memory_order_relaxed);
        if ((currentTail + 1) % Size == head.load(std::memory_order_acquire))
            return false; // Buffer full
        buffer[currentTail] = item;
        tail.store((currentTail + 1) % Size, std::memory_order_release);
        return true;
    }

    bool dequeue(T& item) {
        size_t currentHead = head.load(std::memory_order_relaxed);
        if (currentHead == tail.load(std::memory_order_acquire))
            return false; // Buffer empty
        item = buffer[currentHead];
        head.store((currentHead + 1) % Size, std::memory_order_release);
        return true;
    }
};


Why Lock-Free?

  • Avoids context-switching delays from mutexes
  • Prevents thread contention in high-throughput scenarios
  • Used in real-time trading engines and order book matching

However, one might ask why don’t we use lock free everywhere then? I think the answer is because that is often more complex and error-prone than traditional locking mechanisms, requiring a deep understanding of concurrency, atomic operations, and memory models. This complexity can lead to subtle bugs, making it difficult to develop and maintain lock-free code. And hence this is critically used only for specific users cases such as above.

Efficient Failover for High Availability

To ensure high availability, real-time systems require efficient failover mechanisms. Traditional failover approaches can introduce latency, which is unacceptable in high-performance applications.

Hot Standby Replication

Instead of cold failover, a hot standby replica can take over instantly.

• Primary node: Processes market data.

• Replica node: Runs the same pipeline but discards output until failover.

Real-Time Data Sync

Sync market data logs in real-time using tools like rsync e.g. :

rsync -az --progress --delete /data/market_data/ standby:/data/market_data/


RDMA for Fast Failover

Utilize RDMA (Remote Direct Memory Access) for fast and efficient failover:

ibv_post_send(qp, &send_wr, &bad_wr); // Send data over RDMA
ibv_post_recv(qp, &recv_wr, &bad_wr); // Receive data


Key Advantages of such failover strategies:

  • No TCP overhead, as RDMA transfers memory directly.
  • Failover occurs in microseconds instead of milliseconds.
  • Used in cross-region data replication for financial exchanges.

By incorporating these strategies, you can achieve efficient failover and minimize downtime in your real-time system.

Scaling for High Throughput - A Note on Partitioning!

So generally, one trades not just one but many (thousands of?!) securities (stocks/options tickers) and hence handling millions of messages per second requires intelligent partitioning and parallelism—a single machine just can’t handle that load (we are still not there with quantum chips just yet :))

To mitigate this, one ideal way is to shard/partition by ticker symbol and assign dedicated hosts based on on a given ticker symbol. This ensures fair load distribution. Furthermore, batching can also be leveraged for throughput optimization

Sending data in batches reduces network and disk I/O:

// Batch writes to persistent storage

std::vector<std::string> batch;

for (auto& msg : marketDataQueue) {
    batch.push_back(msg);
    if (batch.size() >= 100) {
        database.bulkInsert(batch);
        batch.clear();
    }
}


Other Lessons from High-Frequency Trading (HFT)

So far we have learned how to receive and read data with failover. However, the for actual processing of data itself we’d need to make sure the processor does no other job but the actual processing itself. Reason I highlight this is because in CPUs have multiple cores and each core can execute multiple threads simultaneously. This can lead to context switching, where the CPU switches between different tasks, causing delays and thus reducing overall performance.

To mitigate this issue, we can use CPU sets (also known as CPU affinity or CPU pinning) to dedicate specific CPU cores to our data processing task. By doing so, we ensure that the CPU cores are not interrupted by other tasks, allowing them to focus solely on processing our data.

CPU basically sets are a way to bind a process or thread to a specific set of CPU cores, ensuring that it runs exclusively on those cores. This can be particularly useful in high-performance computing applications, such as financial trading platforms, where low latency and high throughput are critical.

By using CPU sets, we can:

  • Reduce context switching and minimize delays
  • Increase cache locality and reduce memory access times
  • Improve overall system performance and responsiveness

Techniques from HFT apply to any real-time system:

  • Pinned CPU threads – Assigning threads to specific cores using taskset.
  • NUMA-aware memory allocation – Ensures local memory access.
  • Prefetching cache lines – Using _mm_prefetch() in C++ to optimize CPU cache access.

Beyond Finance: Broader Applications

The similar real-time techniques apply to:

  • Ad Tech (RTB Systems) – Auctioning ads within tens of milliseconds.
  • Cybersecurity – Analyzing real-time network traffic for threats.
  • IoT & Telemetry – Handling sensor data at ultra-high throughput.

Conclusion

Building real-time market data systems requires deep optimizations across networking, memory, serialization, and distributed computing. By applying techniques like zero-copy processing, lock-free structures, and RDMA networking, systems can achieve extreme performance.

Whether in finance, ad tech, or security, every microsecond matters.

Data processing Data (computing) Throughput (business)

Opinions expressed by DZone contributors are their own.

Related

  • Microservices for Machine Learning
  • Processing Cloud Data With DuckDB And AWS S3
  • Delta Live Tables in Databricks: A Guide to Smarter, Faster Data Pipelines
  • Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: