Real-Time Market Data Processing: Designing Systems for Low Latency and High Throughput
This article explores key challenges in real-time market data processing, design strategies, and optimizations—with code snippets where applicable.
Join the DZone community and get the full member experience.
Join For FreeIn financial markets, real-time data processing is critical for trading, risk management, and decision-making. Market data systems must ingest and process millions of updates per second while ensuring ultra-low latency. During my time at Bloomberg and Two Sigma, I worked on optimizing such systems for speed and reliability.
By the means of this article, I’d like to explore key challenges in real-time market data processing, design strategies, and optimizations—with code snippets where applicable. I’ll try my best to keep things succinct without going into too much details!
High-Performance Data Ingestion
Market data systems need to handle a continuous stream of updates from multiple exchanges with minimal latency. Traditional approaches using TCP-based brokers like Kafka introduce too much overhead. Instead, many trading firms rely on UDP multicast for market data distribution.
UDP Multicast for Low Latency
A lightweight, high-performance approach uses UDP multicast to distribute data to multiple consumers efficiently.
The program below simply connects to the the specified multicast group, binds to a socket, and continuously receives and displays any data sent to the group by other devices on the network. So, it pretty much acts as a data receiver. Simple!
// High-performance UDP multicast receiver (C++)
#include <iostream>
#include <arpa/inet.h>
#include <unistd.h>
#define PORT 5000
#define GROUP "239.255.0.1" // Multicast group addresses are in the range of 224.0.0.0 to 239.255.255.255
#define BUFFER_SIZE 1024
int main() {
// Create a UDP socket
int sock = socket(AF_INET, SOCK_DGRAM, 0);
if (sock < 0) {
std::cerr << "Error creating socket" << std::endl;
return -1;
}
// Set up the address structure
struct sockaddr_in addr;
addr.sin_family = AF_INET;
addr.sin_port = htons(PORT);
addr.sin_addr.s_addr = htonl(INADDR_ANY);
// Bind the socket to the address
if (bind(sock, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
std::cerr << "Error binding socket" << std::endl;
return -1;
}
// Join the multicast group
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr(GROUP);
mreq.imr_interface.s_addr = htonl(INADDR_ANY);
if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq)) < 0) {
std::cerr << "Error joining multicast group" << std::endl;
return -1;
}
// Receive and print messages
char buffer[BUFFER_SIZE];
while (true) {
ssize_t bytes_received = recv(sock, buffer, BUFFER_SIZE, 0);
if (bytes_received < 0) {
std::cerr << "Error receiving message" << std::endl;
return -1;
}
buffer[bytes_received] = '\0'; // null-terminate the string
std::cout << "Received: " << buffer << std::endl;
}
return 0;
}
Why UDP Multicast?
- Eliminates the need for multiple TCP connections (which basically leads to connection overhead)
- Reduces overhead compared to Kafka or RabbitMQ (such as centralized broker, message queue, or ack mechanism)
- Used by exchanges like Nasdaq and CME for tick data distribution (basically, it’s industry standard leveraged by key realtime tick distributers)
Optimizing for Ultra-Low Latency
Real-time systems must minimize processing delays at every stage.
Zero-Copy Data Processing
Copying data between buffers adds unnecessary overhead. Using mmap-based shared memory avoids these copies. This C++ program demonstrates the use of zero-copy shared memory using mmap to update market data, and implements a lock-free ring buffer for message passing in multi-threaded systems. The program showcases efficient and thread-safe ways to share data between processes and threads, making it suitable for high-performance applications.
// Zero-copy shared memory using mmap (C++)
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <cstring>
#include <iostream>
const char* SHM_NAME = "/market_data";
int main() {
int fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
ftruncate(fd, 4096);
void* addr = mmap(0, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
strcpy((char*)addr, "New Market Data Update");
std::cout << "Shared memory updated!" << std::endl;
return 0;
}
// Lock-Free Data Structures
// In a multi-threaded system, traditional locks (std::mutex) cause contention.
// Instead, use lock-free ring buffers for message passing.
// Lock-free ring buffer (C++)
#include <atomic>
#include <vector>
template <typename T, size_t Size>
class LockFreeQueue {
std::vector<T> buffer{Size};
std::atomic<size_t> head{0}, tail{0};
public:
bool enqueue(const T& item) {
size_t currentTail = tail.load(std::memory_order_relaxed);
if ((currentTail + 1) % Size == head.load(std::memory_order_acquire))
return false; // Buffer full
buffer[currentTail] = item;
tail.store((currentTail + 1) % Size, std::memory_order_release);
return true;
}
bool dequeue(T& item) {
size_t currentHead = head.load(std::memory_order_relaxed);
if (currentHead == tail.load(std::memory_order_acquire))
return false; // Buffer empty
item = buffer[currentHead];
head.store((currentHead + 1) % Size, std::memory_order_release);
return true;
}
};
Why Lock-Free?
- Avoids context-switching delays from mutexes
- Prevents thread contention in high-throughput scenarios
- Used in real-time trading engines and order book matching
However, one might ask why don’t we use lock free everywhere then? I think the answer is because that is often more complex and error-prone than traditional locking mechanisms, requiring a deep understanding of concurrency, atomic operations, and memory models. This complexity can lead to subtle bugs, making it difficult to develop and maintain lock-free code. And hence this is critically used only for specific users cases such as above.
Efficient Failover for High Availability
To ensure high availability, real-time systems require efficient failover mechanisms. Traditional failover approaches can introduce latency, which is unacceptable in high-performance applications.
Hot Standby Replication
Instead of cold failover, a hot standby replica can take over instantly.
• Primary node: Processes market data.
• Replica node: Runs the same pipeline but discards output until failover.
Real-Time Data Sync
Sync market data logs in real-time using tools like rsync
e.g. :
rsync -az --progress --delete /data/market_data/ standby:/data/market_data/
RDMA for Fast Failover
Utilize RDMA (Remote Direct Memory Access) for fast and efficient failover:
ibv_post_send(qp, &send_wr, &bad_wr); // Send data over RDMA
ibv_post_recv(qp, &recv_wr, &bad_wr); // Receive data
Key Advantages of such failover strategies:
- No TCP overhead, as RDMA transfers memory directly.
- Failover occurs in microseconds instead of milliseconds.
- Used in cross-region data replication for financial exchanges.
By incorporating these strategies, you can achieve efficient failover and minimize downtime in your real-time system.
Scaling for High Throughput - A Note on Partitioning!
So generally, one trades not just one but many (thousands of?!) securities (stocks/options tickers) and hence handling millions of messages per second requires intelligent partitioning and parallelism—a single machine just can’t handle that load (we are still not there with quantum chips just yet :))
To mitigate this, one ideal way is to shard/partition by ticker symbol and assign dedicated hosts based on on a given ticker symbol. This ensures fair load distribution. Furthermore, batching can also be leveraged for throughput optimization
Sending data in batches reduces network and disk I/O:
// Batch writes to persistent storage
std::vector<std::string> batch;
for (auto& msg : marketDataQueue) {
batch.push_back(msg);
if (batch.size() >= 100) {
database.bulkInsert(batch);
batch.clear();
}
}
Other Lessons from High-Frequency Trading (HFT)
So far we have learned how to receive and read data with failover. However, the for actual processing of data itself we’d need to make sure the processor does no other job but the actual processing itself. Reason I highlight this is because in CPUs have multiple cores and each core can execute multiple threads simultaneously. This can lead to context switching, where the CPU switches between different tasks, causing delays and thus reducing overall performance.
To mitigate this issue, we can use CPU sets (also known as CPU affinity or CPU pinning) to dedicate specific CPU cores to our data processing task. By doing so, we ensure that the CPU cores are not interrupted by other tasks, allowing them to focus solely on processing our data.
CPU basically sets are a way to bind a process or thread to a specific set of CPU cores, ensuring that it runs exclusively on those cores. This can be particularly useful in high-performance computing applications, such as financial trading platforms, where low latency and high throughput are critical.
By using CPU sets, we can:
- Reduce context switching and minimize delays
- Increase cache locality and reduce memory access times
- Improve overall system performance and responsiveness
Techniques from HFT apply to any real-time system:
- Pinned CPU threads – Assigning threads to specific cores using taskset.
- NUMA-aware memory allocation – Ensures local memory access.
- Prefetching cache lines – Using
_mm_prefetch()
in C++ to optimize CPU cache access.
Beyond Finance: Broader Applications
The similar real-time techniques apply to:
- Ad Tech (RTB Systems) – Auctioning ads within tens of milliseconds.
- Cybersecurity – Analyzing real-time network traffic for threats.
- IoT & Telemetry – Handling sensor data at ultra-high throughput.
Conclusion
Building real-time market data systems requires deep optimizations across networking, memory, serialization, and distributed computing. By applying techniques like zero-copy processing, lock-free structures, and RDMA networking, systems can achieve extreme performance.
Whether in finance, ad tech, or security, every microsecond matters.
Opinions expressed by DZone contributors are their own.
Comments