Rate Limiting Beyond “N Requests/sec”: Adaptive Throttling for Spiky Workloads (Spring Cloud Gateway)

Build smarter Spring Cloud Gateway throttling — fair per-client limits, a global cap, and adaptive tuning — to survive spikes without meltdowns.

Varun Pandey

Feb. 04, 26 · Analysis

Likes (1)

Comment

Save

1.3K Views

Most teams add rate limiting after an outage, not before one. I’ve done it both ways, and the “after” version usually looks like this: someone picks a number (say 500 rps), wires up a filter, and feels safer. Then the next incident happens anyway — because the problem wasn’t the number.

The real problems tend to be:

Burstiness (traffic arrives in clumps, not a smooth stream)
Retry amplification (timeouts → retries → more load → more timeouts)
Noisy neighbors (one client or tenant degrades everyone)
Capacity drift (your service is “fine” at 10:00 am and struggling at 10:10 am)

This article is about building a rate-limiting system at the gateway — still simple enough to run in a normal Spring Cloud Gateway (SCG) setup — but smarter than a single static “N req/sec.”

What “Good” Looks Like

I judge a limiter by these outcomes, not by the algorithm name:

Stability: The system doesn’t spiral when traffic spikes.
Fairness: One client can’t starve others.
Predictability: Allowed requests don’t suffer runaway tail latency.
Graceful degradation: When capacity drops, you get a controlled brownout, not a full blackout.

If your limiter returns 429 but your downstream still melts, you didn’t actually protect anything — you just moved the pain around.

Token Bucket vs. Leaky Bucket

You’ll hear “token bucket vs. leaky bucket” a lot. Here’s the version that matters in real systems.

Token Bucket: Lets You Burst, Enforces an Average

A token bucket is like a wallet that refills at a steady rate. Each request spends a token. If you have tokens saved up, you can burst. If you don’t, you wait (or get a 429).

Why it’s popular at the gateway: Bursts are common and often harmless. A user clicking “refresh” three times shouldn’t be punished if your system can handle it.

Where it bites you: If you set the bucket too large, a burst can still shock a fragile downstream. Token bucket doesn’t magically make your database love bursts.

Leaky Bucket: Smooths Output, But Queues Can Become a Slow Failure

The leaky bucket tries to output at a steady rate. If input is higher, requests pile up. Smoothing is useful — but queuing is not free. Queues add latency, and latency triggers retries. I’ve seen systems “look fine” on throughput charts while users time out because everything is stuck in a backlog.

My rule of thumb:

Use a token bucket at the edge (gateways) to handle normal burstiness.
Use smoothing closer to fragile dependencies only if you can bound the queue and the time spent waiting.

Spring Cloud Gateway Setup: The Part You Can Implement Today

SCG has a built-in RequestRateLimiter filter backed by Redis (RedisRateLimiter). It’s a token-bucket style limiter and it’s easy to wire up.

Snippet 1: Per-Client Fairness With a KeyResolver

application.yml

    YAML
   
 

   spring:
  redis:
    host: localhost
    port: 6379

  cloud:
    gateway:
      default-filters:
        # Optional: standardize error responses a bit
        - AddResponseHeader=X-Gateway, scg
      routes:
        - id: api_route
          uri: http://localhost:8081
          predicates:
            - Path=/api/**
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 50
                redis-rate-limiter.burstCapacity: 100
                key-resolver: "#{@clientKeyResolver}"
  

KeyResolver (use a header, API key, or principal):

    Java
   
 

   import org.springframework.cloud.gateway.filter.ratelimit.KeyResolver;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import reactor.core.publisher.Mono;

@Configuration
public class RateLimitKeys {

  @Bean
  public KeyResolver clientKeyResolver() {
    return exchange -> {
      // Pick something stable and hard to spoof in your environment:
      // API key, JWT subject, client id, mTLS cert fingerprint, etc.
      String clientId = exchange.getRequest().getHeaders().getFirst("X-Client-Id");
      if (clientId == null || clientId.isBlank()) clientId = "anonymous";
      return Mono.just(clientId);
    };
  }
}
  

Snippet 2: Add a Global Ceiling (Second Limiter)

This is the step a lot of teams miss. They build per-client limits and still overload downstream because all clients spike together.

SCG lets you stack filters. Add a second limiter keyed to a constant string.

application.yml

    YAML
   
 

   spring:
  cloud:
    gateway:
      routes:
        - id: api_route
          uri: http://localhost:8081
          predicates:
            - Path=/api/**
          filters:
            # Per-client fairness
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 50
                redis-rate-limiter.burstCapacity: 100
                key-resolver: "#{@clientKeyResolver}"

            # Global system budget
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 500
                redis-rate-limiter.burstCapacity: 800
                key-resolver: "#{@globalKeyResolver}"
  

    Java
   
   @Bean
public KeyResolver globalKeyResolver() {
  return exchange -> Mono.just("global");
}

The Part That Makes It “beyond N req/sec”: Adaptive Throttling

Static limits assume your capacity is static. It isn’t.

Capacity changes because of deployments, garbage collection, a slow dependency, cache cold starts, noisy neighbors in your cluster, you name it. If your limiter doesn’t respond, the gateway will happily admit traffic at a rate your service can’t handle right now.

I’ve learned to keep the controller simple and stable. Avoid overfitting. Pick a few signals you trust and apply conservative adjustments.

What Signals Actually Work

You can start with two or three:

p95 latency at the gateway or a key downstream hop
5xx rate
saturation (thread pool queue depth, connection pool usage, CPU, or consumer lag)

How They Should Behave (This Matters More Than the Formula)

Tighten fast when things look bad (protect quickly).
Relax slowly when things look good (avoid oscillation).
Add hysteresis: don’t flip-flop based on one bad interval.

Snippet 3: Pseudocode Controller (Implementation-Agnostic)

    Plain Text
   
 

   Every 10 seconds:

  bad = (p95_latency_ms > 300) OR (error_rate_5xx > 1%) OR (saturation > 0.85)

  if bad:
      limit = limit * 0.85        # tighten quickly
      healthy_streak = 0
  else:
      healthy_streak += 1
      if healthy_streak >= 6:
          limit = limit * 1.05    # relax slowly
          healthy_streak = 6      # cap streak to avoid runaway

  limit = clamp(limit, min=100, max=800)
  publish(limit)  -> gateway config source
  

“Publish(limit)” Without Turning Your System Into a Science Project

You have options. The fastest path is usually one of these:

Store the current limit in Redis and read it in a custom limiter
Use Spring Cloud Config + refresh (works, but can be heavy if you refresh too often)
Use a feature flag system if you already have one

If you want to keep SCG’s RedisRateLimiter but make it dynamic, you typically end up writing a small wrapper/custom filter so you can pull replenishRate and burstCapacity from a dynamic source. That’s a good second iteration; don’t start there if you’re still proving the concept.

Testing It Like You Mean It (Not Just “it returns 429”)

If you only test “send 1000 requests, see some 429,” you’ll miss the failure modes that matter.

Here’s the test plan I actually use:

Baseline: Steady traffic from multiple clients. Confirm p95 stays stable.
Noisy neighbor: One client ramps aggressively; others remain steady. Confirm fairness.
Aggregate spike: All clients ramp together. Confirm the global ceiling protects the downstream.
Degradation: Inject latency or reduce downstream capacity. Confirm adaptive throttling tightens quickly.
Recovery: Remove the injection. Confirm throttling relaxes slowly (no flapping).

If you do this once with a simple load tool (k6, Gatling, JMeter), you’ll learn more than debating bucket algorithms for a week.

    JavaScript
   
 

   import http from "k6/http";
import { sleep } from "k6";

export const options = { vus: 20, duration: "30s" };

export default function () {
  const clientId = __VU % 2 === 0 ? "clientA" : "clientB";
  http.get("http://localhost:8080/api/hello", { headers: { "X-Client-Id": clientId } });
  sleep(0.1);
}
  

Closing Thought

Rate limiting is usually presented as a switch: on or off, limit or no limit. In practice, it’s closer to a control system that protects your service from the physics of traffic — bursts, retries, and shifting capacity. Spring Cloud Gateway gives you solid primitives. The “beyond N req/sec” part is combining them: fairness, global budgets, and an adaptive loop that reacts before users feel the outage.

If you tell me what you use for metrics (Prometheus/Grafana, CloudWatch, Datadog, etc.), I can add a short, practical subsection showing exactly which two or three signals to start with and where to hook them from SCG without making the post longer or more complex.

Spring Cloud rate limit Performance

Opinions expressed by DZone contributors are their own.

Related

Trending