DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Supercharging Productivity in Microservice Development With AI Tools
  • Testing Serverless Functions
  • Microservices Testing: Key Strategies and Tools
  • Top 10 Open Source Projects for SREs and DevOps

Trending

  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • Developers Beware: Slopsquatting and Vibe Coding Can Increase Risk of AI-Powered Attacks
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • Performing and Managing Incremental Backups Using pg_basebackup in PostgreSQL 17
  1. DZone
  2. Data Engineering
  3. Data
  4. Low Latency Microservices, a Retrospective

Low Latency Microservices, a Retrospective

In this article, I share some important lessons from our experience and what we learned after five years of developing and supporting low latency microservices.

By 
Peter Lawrey user avatar
Peter Lawrey
·
Dec. 11, 21 · Opinion
Likes (4)
Comment
Save
Tweet
Share
6.5K Views

Join the DZone community and get the full member experience.

Join For Free

I wrote an article on low latency microservices almost five years ago now. In that time, my company has worked with a number of tier-one investment banks to implement and support those systems. What has changed in that time and what lessons have we learned?

Read this article and learn what we learned after five years of developing and supporting low latency microservices.

Separation of Concerns Give Better Testability

Microservices repeatedly demonstrated that testing and debugging business components were much easier with simple, stand-alone components with clear contracts between microservices.

While unit tests were still used to start with, in 2017 we moved almost entirely to behavior-driven development of microservices. Unit tests are still used for lower-level libraries and utilities. As our microservices are all based on Kappa Architecture, all our behavior-driven tests are modeled as a series of events in and out of the service.

An input test might look like this:

YAML
 
---
oms: OMS1 # the service to receive this message
newOrder: {
  eventId: orderevent1,
  eventTime: 2017-04-27T07:26:40.9836487,
  triggerTime: 2017-05-05T12:56:40.7534873,
  instrument: USDCHF,
  settlementTime: 2017-04-26T09:46:40,
  market: FXALL_MID,
  orderId: orderid1,
  hedgerName: Hedger1,
  user: MidHedger,
  orderType: MARKET,
  side: SELL,
  quantity: 500E3,
  maxShowQuantity: 500E3,
  timeInForceType: GTC,
  timeInForceExpireTime: 2018-01-01T01:00:00
}
---
# more messages


While the output looks very similar as this is an Order Management Service and its job is to filter and track orders.

YAML
 
---
newOrder: {
  eventId: orderevent1,
  eventTime: 2017-04-27T07:26:40.9836487,
  triggerTime: 2017-05-05T12:56:40.7534873,
  instrument: USDCHF,
  settlementTime: 2017-04-26T09:46:40,
  market: FXALL_MID,
  orderId: orderid1,
  hedgerName: Hedger1,
  orderType: MARKET,
  quantity: 500E3,
  user: MidHedger,
  side: SELL,
  maxShowQuantity: 500E3,
  timeInForceType: GTC,
  timeInForceExpireTime: 2018-01-01T01:00:00
}
---
# more results


Building variations on tests to explore all the things which could go wrong and check how they are handled is easy.

What We Needed to Add

Beyond implementing what we envisioned five years ago, there were some features we discovered we needed to add.

  • A deterministic clock

To ensure our services produced the same results every time, whether in tests or between production and any redundant system, we made the time input. This appeared in our test like this:

YAML
 
periodicUpdate: 2017-04-27T07:26:51
---


This ensured that all time-outs or events triggered by the clock could be tested, but also ensure each redundant system did the same things at the same point and produced the same output.

  • Nanosecond timestamps

We started with millisecond timestamps but quickly found we needed greater resolution switching to microseconds timestamps, we now use nanosecond resolution timestamps.

We found that nanosecond timestamps were more useful if we could also ensure some level of uniqueness. We added support for ensuring nanosecond timestamps are unique on a single host in a low latency way. This takes well under 100 nanoseconds. That way a timestamp can be used as a unique id for tracing events through a system (adding a pre-set host id if needed).

  • Ability to store complex data as primitives

For testing purposes, all messages appear as text, however for performance reasons all data is written/read in a binary form. The typical latency of a persisted message between microservices is less than a microsecond so how the objects are stored can make a big difference.

The use of String and LocalDateTime was very expensive for our use case. To reduce the impact of storing this information, we developed a number of strategies for:

  • Encoding Strings and dates in long fields.
  • Object pooling Strings
  • Storing text in a mutable/reusable field

Ultimately, this lead to the support of Trivially Copyable objects, a concept adopted from C++, where the majority (or entire) Java object could be copied as a memory copy without any serialization logic. This allowed us to support passing complex market data with around 50 fields between microservices in different processes at well under a microsecond most of the time.

  • Ability to run microservices as a single process or a single thread

One problem with microservices is that they are less productive when it comes to integration tests and development. To solve this we created a means of running multiple microservices in a single container/JVM, or even in a single thread. This gave our customers the flexibility of running parts of the system as a compound microservice in the test environment while they could still deploy individual microservices independently in production.

  • Simplifying restarts with idempotency

Idempotency allows an operation to be harmlessly attempted multiple times. Restarts are simplified by replaying any messages you are not sure if they were processed fully. 

In low latency systems, transnationality needs to be as simple as possible. This suits most needs, however when you get a complex transaction, idempotency can significantly reduce the complexity of recovery.

  • Multiple replay strategies

Having a complete deterministic persisted record of every message made restarts and failover simpler, however, we found that different use cases needed different strategies as to how that is done.  We have learned a lot about real replayability/restartability requirements from customers' real use cases and the trade-offs they make for service start time vs accuracy etc.

One key feature is ensuring upstream messages are replicated before downstream ones are to simplify rebuilding the current state.

What Became Less Important

Some of the features we thought would be vital five years ago, turned out to be not always required as clients used our technology in broader contexts.

  • Ultra-low garbage collection

Five years ago, we were entirely focused on no minor collections over a day of processing. However, many clients wanted the productivity gains we offered but didn't have stringent performance requirements and occasional garbage collections were not a concern.  Creating microservices naturally supports having latency-critical services as well as less latency-critical services in the same system. 

This required a shift in some of our thinking which assumed GCs never occurred normally. We significantly reduced our use of WeakReferences for example.

  • Low throughput data processing

The lowest latencies tend to be at around 1% of the peak capacity. At lower throughputs, you tend to get either your hardware trying to power save, increasing latency when an event does occur, and at higher throughputs, you increase the chance of a message coming in while you are still processing previous ones, adding to latency.

Again, we saw a broadening of use cases to very low message rates which showed up behavior that is only seen when the machine is spending most of its time waiting. We added performance test tools to see how our code behaves when the CPU is running cold.

We also saw the need to support higher throughputs of messages for hours at a time (rather than seconds or minutes). For example, if we had a microservice that could process a million messages per second, we would test the latency at 1% of this as this was considered the normal volume. It also wasn't possible to get high-performance, high-volume drives that could sustain this rate for hours without filling up. Today, we are testing the latency of systems sustaining long bursts of one million messages per second for many hours at a time. 

If you are looking for such a drive you can test on a desktop, I suggest looking at the Corsair MP600 Pro series

Make Your Infrastructure as Fast as Your Application Needs

Over the last five years, the requirements for core systems have been more stringent, however, as we need to integrate with existing systems, we have seen the need to easily support systems that don't have the same requirements (and would rather it be easy and natural to work with).

For the more stringent systems, the latencies clients care about:

  • Have moved from the 99%ile (worst 1 in 100) to the 99.9%ile or 99.99%ile
  • The latencies they are looking to achieve, wire to wire (as measured on the network), is more around the 99.99%ile at 20 microseconds or 99.9%ile at 100 microseconds
  • Have a clearer idea of the worst-case latencies they need to see. Many are looking for low milliseconds end to end worst case

At the same time, our client needs to integrate with existing systems where all they need is for that to be as easy as possible.

Make the Message Format a Configuration Consideration

Five years ago I imagined we would need to support all sorts of formats however a relatively small number turned out to be really useful. Have a look:

  • An existing format. Make using a corporate standard format easy, no need to use ours
  • Text format, YAML seems the best one, esp for  readability and typed data
  • The binary form of YAML. The good trade-off between ease of use and performance
  • Typed JSON for use over WebSockets to Javascript based UIs
  • Marshaling of fields as binary
  • Direct memory copy objects e.g., Trivially Copyable, to maximize speed
  • FIX protocol

We regularly use a single DTO in multiple formats depending on what is most appropriate without the need to copy data between DTO specialized for a given format.

Conclusion

For the use case of our clients, I believe most of the concerns around replacing microservices with a monolith have been solved. However, having the option to run multiple microservices as if it was a monolith handles those cases where it is easier to work with e.g., testing and debugging multiple services.

OpenHFT open-source libraries

microservice unit test retrospective

Opinions expressed by DZone contributors are their own.

Related

  • Supercharging Productivity in Microservice Development With AI Tools
  • Testing Serverless Functions
  • Microservices Testing: Key Strategies and Tools
  • Top 10 Open Source Projects for SREs and DevOps

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!