DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why Queues Don’t Fix Scaling Problems
  • When Events Move Faster Than Your Database: A Resilient Design Pattern
  • AI-Driven Automated Trading System
  • Tokens and Transactions With AI

Trending

  • Text Summarization With OpenAI and Ruby on Rails
  • A Tool Is Not a Platform (And Your Team Knows the Difference)
  • The AI Definition of Done
  • How Agent Frameworks Solve Human-in-the-Loop
  1. DZone
  2. Data Engineering
  3. Databases
  4. Resilience Lost in the Stack: How Abstraction Layers Silently Mask Distributed Systems’ Topology Awareness

Resilience Lost in the Stack: How Abstraction Layers Silently Mask Distributed Systems’ Topology Awareness

The focus is on distributed systems resilience, specifically the integration gap between application abstractions and high-availability infrastructure layers.

By 
Rithra Ravikumar user avatar
Rithra Ravikumar
·
Jul. 03, 26 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
151 Views

Join the DZone community and get the full member experience.

Join For Free

Distributed coordination services exist for a reason, and they are the CPUs of distributed systems that give them their high availability. When it's in your stack, you assume failover is handled. Some services that operate in this layer include Apache Zookeeper, Redis Sentinel, etcd, etc. These services are mathematically engineered for HA. Protocols such as Raft/Paxos/ZAB guarantee this. We know that the DCS itself cannot go wrong as long as a quorum of nodes exists. 

Here, we want to explore one specific problem that makes this high availability subjective. It is an issue where individual layers hold this promise, while as we go to higher-level abstractions, the intelligence silently dies. The article focuses on how topology awareness needs to be preserved mindfully as we move up the stack, and that, when using smart clients and drivers, we should inherit the responsibility not to silence their intelligence.

We take a use case in the Java ecosystem. In a production-grade system, we have microservices distributed across multiple regions that take care of specific components. We discuss a rate-limiting use-case here to demonstrate the underlying problem. This is one area where the problem manifests itself. A similar architecture can still have the same problem under the hood.

Deconstructing the Stack

The use case: A rate limiter that is implemented in production-grade using Bucket4j.

A distributed rate limiter needs to keep track of token buckets across multiple instances. We assume the application runs in 3 instances and want to rate-limit requests to 5 req/sec. To enforce this strict global token limit across 3 instances, a centralized state coordinator becomes mandatory. So we introduce Redis, which centrally stores the token bucket state. This way, we decouple the bucket state from application instances and do not make every instance accept 5 tokens each, making it 15 req/sec.

Redis is a distributed caching layer. In Java, if you're using Redis, you're likely talking to it through Lettuce, Redisson, or Jedis. Since Lettuce is the widely adopted Redis client, it acts as the asynchronous engine that bridges the Java application layer with the Redis database infrastructure.

A sample boilerplate for the connection configuration would look something like:

Java
 
@Bean

public ProxyManager<String> proxyManager() {

RedisURI.Builder builder = RedisURI.builder().withSentinelMasterId(masterId).withTimeout(Duration.ofSeconds(12));;

    for (String node : nodes) {
        String[] hostPort = node.split(":");
        builder.withSentinel(hostPort[0], Integer.parseInt(hostPort[1]));
    }

    RedisClient redisClient = RedisClient.create(builder.build());
    RedisCodec<String, byte[]> bucket4jCodec = RedisCodec.of(StringCodec.UTF8, ByteArrayCodec.INSTANCE);
    StatefulRedisConnection<String, byte[]> redisConnection = redisClient.connect(bucket4jCodec);

    return LettuceBasedProxyManager.builderFor(redisConnection)
            .withClientSideConfig(ClientSideConfig.getDefault())
            .build();
}


This code looks straightforward in terms of establishing the connection.

  • We build the Redis cluster nodes along with the Sentinel configuration.
  • Feed this to the Redis client and plug this code into the Bucket4j instance.
  • Create a StatefulRedisConnection wrapper available as part of the Bucket4j library.
  • Configure a LettuceProxyManager that interacts with the Redis cluster.

This code complies properly, passes all integration tests, and handles throttling by rate limiting, yet silently harbors a topology blindness that will cause the application to stall during a failover. 

The Failure Mode: What It Looks Like

At the time of failover, when the primary Redis master suddenly drops, the Redis sentinel infrastructure kicks in and promotes a healthy replica to master. It does so by holding a rapid quorum vote, and high availability is achieved. But at the application side, we experience the following:

Thread Stalls and Application Freeze

The app does not throw any obvious connection errors; instead, it just freezes and waits indefinitely for Redis to recover. The executing thread stalls in the process and never recovers. On the application side, Bucket4j continuously starts receiving requests, and its internal token buckets are routed to a broken Redis connection.

The Illusion of Network Failure

Internally, Lettuce has methods and wrappers to handle this type of failure, and it transparently buffers and queues commands while trying to reconnect. But Bucket4j does not have this information and keeps waiting for Lettuce. Even a generous timeout from the application side is not going to help us in this situation.

The Rate Limiter Paradox

While some threads are stalled, other parts of the application may continue; for instance, the endpoint might still receive requests and route them to Bucket4j, but because this exception is swallowed, no actual rate limiting occurs. Bucket4j keeps talking to a broken connection and doesn't ideally keep track of tokens.

The Wrapper Deficit

While Lettuce does have a StatefulMasterReplicaConnection that comes with topology awareness, Bucket4j never exposes wrappers to use this StatefulMasterReplicaConnection. So a user using Bucket4j may or may not be aware of internal wrappers available in Lettuce. In the case where this is not known to the user, engineers naturally instantiate static connections and can easily overlook this. This results in code that seems to handle failovers but is completely devoid of master-replica awareness.

System architecture with the topology awareness blindspot

System architecture with the topology awareness blindspot


The Abstraction Blindspot

This becomes hard to catch at some level precisely because every layer seems to work and does its job correctly. Redis-Sentinel experiences a failover and successfully recovers. Meanwhile, lettuce, the client that interacts with Redis, also has a MasterReplicaConnections that is capable of knowing that this event has occurred. Bucket 4j is responsible for token buckets and rate limiting, and it is also doing its job well. 

The problem happens in the composition. This brings us to a broader principle: abstraction layers do not just simplify complexity but also inadvertently suppress capabilities silently. The HA awareness exactly gets broken at this point in the stack

Why did testing not expose this?

  • Unit tests almost always don't uncover these types of errors. 
  • Load/Performance focus on high throughput under normal conditions to ensure the rate limiter is functioning correctly and is handling throttling.
  • Health checks and readiness probes target the wrong layer, namely Redis, in ensuring availability.

The Solution Blueprint

To solve the thread stall and force the application layer to inherit The High Availability of Redis, we have to preserve the topology at every layer of the stack. The fix requires choosing the exact Lettuce connection interface that tracks topology shifts while preserving the raw command execution engine.

Navigating Lettuce’s Topology refresh options

connecTion interface target redis infra use case

StatefulRedisConnection

Standalone Node

General single-node use; blind to topology changes.

StatefulRedisClusterConnection

Sharded Cluster

Data partitioning across many nodes.

StatefulRedisPubSubConnection

Messaging Channels

Real-time pub/sub event listening.

StatefulRedisSentinelConnection

Sentinel Nodes Directly

Administrative tracking and master discovery.

StatefulRedisMasterReplicaConnection

Primary + Replicas via Sentinel

Dynamic health tracking, automatic failover, and read/write splitting.


Why We Choose StatefulRedisMasterReplicaConnection over StatefulRedisSentinelConnection

When working with Redis and when the need is to explicitly inherit Sentinel properties, the intuitive solution is to reach StatefulRedisSentinelConnection. However, a closer look at the source code for RedisSentinelConnection extends the basic StatefulConnection interface, and its async command blocks only expose Sentinel APIs.

Java
 
public interface StatefulRedisSentinelConnection<K, V> extends StatefulConnection<K, V> {
    RedisSentinelAsyncCommands<K, V> async();
}

// Sneak peek inside RedisSentinelAsyncCommands:

RedisFuture<List<Map<K, V>>> slaves(K key);

RedisFuture<String> failover(K key);

RedisFuture<String> monitor(K key, String ip, int port, int quorum);

RedisFuture<Long> reset(K key);


A point to note here is that the standard data manipulation commands like GET, SET, HGET are absent in this abstraction. We would require evaluation scripts for the wrapping client (Bucket4j) to execute Lua scripts. This shows that the interface was built to manage the cluster but not to read and write app data.

On the other hand, StatefulRedisMasterSlaveConnection directly extends StatefulRedisConnection, inheriting the complete data manipulation layer.

Java
 
public interface StatefulRedisMasterReplicaConnection<K, V> extends StatefulRedisConnection<K, V> {
    
  	void setReadFrom(ReadFrom readFrom);
    RedisAsyncCommands<K, V> async(); // Exposes GET, SET
}


By choosing StatefulRedisMasterReplicaConnection instantiated via a Sentinel-backed MasterReplica builder, we inherit:

  • Topology awareness: As it hooks directly into the Sentinel Pub/Sub event stream to automatically reroute traffic
  • Asynchronous engine preservation: Which exposes the RedisAsyncCommands necessary for Bucket4j to asynchronously execute thread-safe token

The Wrapper Integration

To cleanly connect our new topology-aware connection with the rate-limiter in our example (Bucket4J), we introduce a dedicated wrapper to the integration layer.

Java
 
public static <K> LettuceBasedProxyManagerBuilder<K> casBasedBuilder(StatefulRedisMasterReplicaConnection<K, byte[]> statefulRedisMasterReplicaConnection) {

 return casBasedBuilder(statefulRedisMasterReplicaConnection.async());

}


A word on CAS:  Compare-And-Swap (CAS) is a builder that uses a non-blocking database pattern to update data safely without heavy locks. It reads the token bucket value, does the math, and writes it back only if another thread hasn't changed it in the meantime. If the value did change, it safely retries the operation automatically. 

Bucket4j exposes similar builders for CAS. This builder expects a standard Lettuce asynchronous command interface. By including the above builder, we enable Bucket4j’s proxy manager to accept a StatefulRedisMasterSlaveConnection.

Validation Through a Little Chaos Engineering 

The Test Stack

Since standard testing frameworks won’t expose this issue, we needed a real-world setup to simulate the production environment. The test stack included:

  • Docker and Docker Compose: To manage a multi-node Redis cluster (1 Master, 2 replicas, 3 sentinels)
  • Java/Springboot: The host-side sample application integrating the Bucket4j logic to rate limit an endpoint
  • Lettuce/Bucket4j: The libraries that we want to test 
  • Apache Benchmark: A command-line utility used for injecting the load

Test Setup

The following docker-compose.yml served as the baseline configuration for the Redis setup.

YAML
 
version: '3.8'

services:
  redis-master:
    image: redis:7-alpine
    container_name: redis-master
    # Network mode host maps directly to your machine's ports, bypassing docker bridge DNS
    network_mode: "host"
    command: redis-server --port 6379

  redis-replica:
    image: redis:7-alpine
    container_name: redis-replica
    network_mode: "host"
    # Since we are on host mode, the replica connects to localhost 6379 and binds its own engine to 6380
    command: >
      redis-server 
      --port 6380
      --replicaof 127.0.0.1 6379 
      --replica-announce-ip 127.0.0.1 
      --replica-announce-port 6380
    depends_on:
      - redis-master

  redis-sentinel:
    image: redis:7-alpine
    container_name: redis-sentinel
    network_mode: "host"
    command: >
      sh -c "
        echo 'port 26379' > /sentinel.conf &&
        echo 'sentinel monitor mymaster 127.0.0.1 6379 1' >> /sentinel.conf &&
        echo 'sentinel down-after-milliseconds mymaster 3000' >> /sentinel.conf &&
        echo 'sentinel failover-timeout mymaster 6000' >> /sentinel.conf &&
        redis-server /sentinel.conf --sentinel
      "
    depends_on:
      - redis-master
      - redis-replica


Here, we use a Redis master-replica configuration and a portable built-in sentinel.conf. This makes it easier and starts the service using the configs provided in the file.

Architectural Parameters

  • down-after-milliseconds mymaster 3000: Sentinel waits for 3 seconds before deciding the master is unreachable. The host is marked “subjectively down” (SDOWN) if the master does not continuously respond in this window.
  • failover-timeout mymaster 6000: The window for the promotion of a new master and the reconfiguration of the cluster. At this point, it is marked “objectively down” (ODOWN) and starts the failover.

Step 1: Validating the Infrastructure Baseline and Sanity

Before testing how it performs with different abstractions, a standard failover was executed to see if redis-sentinel was working as expected and proceeded with the leader election.

Shell
 
docker stop redis-master


The logs where Sentinel performs a leader election process:

Shell
 
redis-replica-1  | 1:S 17 May 2026 19:44:23.894 # Unable to connect to MASTER: Success
sentinel-1       | 9:X 17 May 2026 19:44:24.780 # +sdown master mymaster redis-master 6379
sentinel-1       | 9:X 17 May 2026 19:44:24.780 # +odown master mymaster redis-master 6379 #quorum 1/1
sentinel-1       | 9:X 17 May 2026 19:44:24.780 # +try-failover master mymaster redis-master 6379
sentinel-1       | 9:X 17 May 2026 19:44:24.785 # +vote-for-leader e1db4435d0770f294d0d13835729c5102cb5a4cd 1
sentinel-1       | 9:X 17 May 2026 19:44:24.785 # +elected-leader master mymaster redis-master 6379
sentinel-1       | 9:X 17 May 2026 19:44:24.785 # +failover-state-select-slave master mymaster redis-master 6379
sentinel-1       | 9:X 17 May 2026 19:44:24.850 # +selected-slave slave 172.18.0.3:6379 172.18.0.3 6379 @ mymaster redis-master 6379
sentinel-1       | 9:X 17 May 2026 19:44:24.850 * +failover-state-send-slaveof-noone slave 172.18.0.3:6379 172.18.0.3 6379 @ mymaster redis-master 6379


Step 2: The Naive Connection and Indefinite Freeze

We now use the StatefulRedisConnection and expose a test endpoint from our SpringBoot application. This endpoint is now sent 10,000 concurrent requests, and mid-stream we kill the master node to allow for the re-election of the master.

Java
 
@GetMapping("/test")

public ResponseEntity<String> handleRequest() {

    Bucket bucket = proxyManager.builder().build("reproduction-key", () -> bucketConfiguration);

    // Under high traffic concurrent loads, this execution point will freeze solid
    // the moment the master container is stopped!

    if (bucket.tryConsume(1)) {
        return ResponseEntity.ok("SUCCESS");
    } else {
        return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS).body("RATE_LIMITED");
    }


Observation: The application logs revealed a blind reconnection loop. The client was stuck knocking on the door of the dead port 6379, oblivious to the new Master on 638. The infrastructure healed at this point; however, the application remained in an indefinite freeze. Even when we increased the command timeout to 12 seconds, the results were the same. The Apache Benchmark test did not complete and timed out. 

Below are the results:

Plain Text
 
rithraravikumar@Rithras-MacBook-Air redis-sentinel-lab % ab -n 10000 -c 10 http://localhost:8080/test

This is ApacheBench, Version 2.3 <$Revision: 1903618 $>

Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)

Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests

apr_pollset_poll: The timeout specified has expired (70007)

Total of 5460 requests completed


Step 3: The Master-Replica Connection Pivot

The final step of the process was to use the MasterReplicaStatefulConnection and observe if the application bounced back. The code change looks like the below:

Java
 
RedisClient redisClient = RedisClient.create(builder.build());
RedisCodec<String, byte[]> bucket4jCodec = RedisCodec.of(StringCodec.UTF8, ByteArrayCodec.INSTANCE);

RedisURI sentinelUri = RedisURI.builder()
        .withSentinelMasterId(masterId)     // Looks up "mymaster"
        .withSentinel("127.0.0.1", 26379)   // The host and port where Sentinel is listening
        .build();

StatefulRedisMasterReplicaConnection<String, byte[]> redisConnection = MasterReplica.connect(redisClient, bucket4jCodec, sentinelUri);

return LettuceBasedProxyManager.builderFor(redisConnection)
        .withClientSideConfig(ClientSideConfig.getDefault())
        .build();


Observation: With this topology-aware connection, when we killed the master node at the 5,000-request mark. While there was a noticeable stall, the connection was able to recognize there was a new master and resumed processing. 

Plain Text
 
rithraravikumar@Rithras-MacBook-Air redis-sentinel-lab % ab -n 10000 -c 10 http://localhost:8080/test

This is ApacheBench, Version 2.3 <$Revision: 1903618 $>

Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/

Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)

Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests

....

Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests

Finished 10000 requests

Time taken for tests:   24.522 seconds
Complete requests:      10000
Failed requests:        5983
   (Connect: 0, Receive: 0, Length: 5983, Exceptions: 0)

Non-2xx responses:      5983
Total transferred:      1814793 bytes
HTML transferred:       656334 bytes
Requests per second:    407.80 [#/sec] (mean)
Time per request:       24.522 [ms] (mean)
Time per request:       2.452 [ms] (mean, across all concurrent requests)
Transfer rate:          72.27 [Kbytes/sec] received
Connection Times (ms)

              min  mean[+/-sd] median   max

Connect:        0    0   2.5      0     251
Processing:     0   24 671.0      1   21235
Waiting:        0   24 671.0      1   21234
Total:          0   24 671.0      1   21235


Analyzing the Metrics

The test recorded 5,983 failed requests (non-2xx responses). Those 5,983 failures represent the exact window of time when the master went down, and Bucket4j rejected traffic or failed fast instead of hanging. The longest request took a massive 21.2 seconds (21235 ms) to process.

The critical difference was that the application bounced back.

The Bigger Lessons — Not Just Topology 

  • Framework abstractions can inadvertently mask underlying driver capabilities, turning a standard infrastructure-level database failover into an application-level thread stall.
  • When configuring stateful caching or synchronization wrappers, developers must ensure that connection pools explicitly utilize master-replica topology providers rather than static endpoints.
  • High-availability verification cannot rely on static environment tests; engineers must actively simulate node terminations under concurrent load using tools like Apache Benchmark to uncover edge-case race conditions.
  • Designing frameworks with extensible builder patterns allows downstream developers to inject custom infrastructure topologies without altering the core business logic of the library.
  • Generous timeouts during a network failure are not an effective strategy if your client driver is blind to network routing shifts, as it merely forces application threads to spend more time waiting on a dead address.
  • Under high-concurrency workloads, a client's internal memory buffer will saturate within milliseconds, rapidly cascading into total application thread pool exhaustion.
rate limit Redis (company) systems Database

Opinions expressed by DZone contributors are their own.

Related

  • Why Queues Don’t Fix Scaling Problems
  • When Events Move Faster Than Your Database: A Resilient Design Pattern
  • AI-Driven Automated Trading System
  • Tokens and Transactions With AI

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook