Rafael Panisset

Posted on Jan 31

Retry Strategies Compared: Constant vs Exponential Backoff vs Jitter in Go (With Simulation)

#go #distributedsystems #backend #tutorial

Your server goes down. There are 1000 clients already waiting. What happens when it comes back?

The Problem

Picture a server handling 200 requests per second. A dependency goes down for about 10 seconds; every in-flight request fails, and all clients start retrying.

When your server recovers, all 1000 clients retry at once. This is called the Thundering Herd problem, and your retry strategy determines whether your server recovers in seconds or drowns under a wave of redundant/unnecessary requests.

Most engineers know they should use Exponential Backoff, but fewer know it can still cause synchronized spikes. Even fewer have seen what Decorrelated Jitter actually does to the request distribution curve.

I built a simulation of 1000 concurrent Go clients, a server with fixed capacity, and four retry strategies fighting for recovery, to test each strategy.

The simulation tries to show a scenario that production systems face: fixed server capacity, a hard outage window, and clients that keep retrying until they succeed. Every request is tracked per-second, so you can see the thundering herd form, peak, and (hopefully) dissipate.

The Four Strategies

Each strategy is a function with the same signature: given the attempt number and the previous delay, return how long to sleep before the next retry.

type Strategy func(attempt int, prevDelay time.Duration) time.Duration

1. Constant Retry

Fixed delay, no backoff.

func constantRetry(_ int, _ time.Duration) time.Duration {
    return 1 * time.Millisecond
}

Formula: delay = 1ms (constant)

The worst case retry pattern: a tiny fixed delay with no backoff. It is what happens when engineers write a quick retry loop without thinking about backoff.

2. Exponential Backoff

Double the delay on each attempt, starting at 100ms and capped at 10 seconds.

func exponentialBackoff(attempt int, _ time.Duration) time.Duration {
    d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
    if d > capSleep {
        d = capSleep
    }
    return d
}

Formula: delay = min(base * 2^attempt, cap)

It is better than constant retry, but there is a catch. All 1000 clients started at the same time, so they all hit attempt 1 at t=100ms, attempt 2 at t=300ms, attempt 3 at t=700ms.

3. Full Jitter

Randomize the delay between 0 and the exponential backoff value.

func fullJitter(attempt int, _ time.Duration) time.Duration {
    d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
    if d > capSleep {
        d = capSleep
    }
    return time.Duration(rand.Int63n(int64(d)))
}

Formula: delay = random(0, min(base * 2^attempt, cap))

This is one of the strategies analyzed in the AWS Architecture Blog, where it showed the fewest total calls. Because each client picks a random delay, they no longer retry at the same time. Instead of 1000 clients all hitting the server at once, they arrive spread out.

4. Decorrelated Jitter

Each delay is random between base and 3 * previous_delay.

func decorrelatedJitter(_ int, prev time.Duration) time.Duration {
    if prev < baseSleep {
        prev = baseSleep
    }
    d := time.Duration(rand.Int63n(int64(prev)*3-int64(baseSleep))) + baseSleep
    if d > capSleep {
        d = capSleep
    }
    return d
}

Formula: delay = random(base, prev * 3) (capped)

This one comes from the same AWS blog post. Instead of using the attempt number, each delay depends on the previous delay. Why 3? It is just a practical choice from AWS. The multiplier controls how fast delays grow. With 3, the average delay grows about 1.5x per retry (the midpoint of random(base, prev*3)). That's enough to spread clients out without making them wait too long. You could use 2 or 4 and it would still work, just with different tradeoffs.

The Simulation Setup

The server counts how many requests it gets each second. If it's still in the outage window or already at capacity, it says no.

type Server struct {
    mu       sync.Mutex
    capacity int
    downFor  time.Duration
    start    time.Time
    requests map[int]int // second -> request count
    accepted map[int]int // second -> accepted count
}

Each client is a goroutine that keeps retrying until it gets a success:

func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) {
    start := time.Now()
    attempt := 0
    prevDelay := baseSleep

    for {
        if srv.Do() {
            metrics.Record(time.Since(start), attempt)
            return
        }
        delay := strategy(attempt, prevDelay)
        prevDelay = delay
        attempt++
        time.Sleep(delay)
    }
}

Parameters for all runs:

1000 clients, all starting simultaneously
Server capacity: 200 req/s
Outage duration: 10 seconds
Server rejects requests during outage and when at capacity

We count every request per second and draw it as a bar chart in the terminal. You can literally see the thundering herd.

Results

Constant Retry

Each client retries every 1ms. With 1000 clients, the server gets hundreds of thousands of requests per second, but it can only handle 200. Over 8 million wasted requests to serve 1000 clients. Also, the flood doesn't stop when the server comes back it keeps going for 4 more seconds. It only stops because every time a client gets served, it stops retrying. Fewer clients means fewer retries, and by second 14 all 1000 are done.

Exponential Backoff

Look at the histogram: spikes at around seconds 0, 1, 3, 6, 12, 22, 32, 42, 52 with nothing in between. All 1000 clients double their delay at the same rate, so they all retry at the same time. The gaps between spikes grow but the problem stays, you can see that every spike is a mini thundering herd.

Wasted requests dropped from 8 million to just 9,000. But the p99 latency is 52 seconds.

Full Jitter

No spikes. The histogram shows a smooth curve going down from ~5000 at second 0 to 13 at second 19. Each client picks a random delay, so they stop hitting the server not all at once.

Decorrelated Jitter

Same smooth shape as full jitter, no spikes. A bit more noise though, 10695 wasted requests vs 8468, and 137 requests over capacity. Why? If a client randomly gets a short delay, the next one is based on that short delay, so it can keep retrying fast for a bit.

Full jitter won every metric here. But we're testing the worst case, all 1000 clients break at once. That doesn't really happen in prod. Clients fail at different times, and decorrelated jitter is better at that. Each client figures out its own rhythm instead of everyone following the same retry schedule.

When to Use What

Constant retry: Never in production.

Exponential backoff without jitter: Only when you have very few clients, the synchronization problem does not manifest if there is no herd.

Full jitter: The default choice. Best results in this simulation and the simplest to implement.

Decorrelated jitter: Similar performance, but it works better in scenarios where clients don't all fail at the same time. Each client adapts its delay based on its own history, which handles staggered and rolling failures more naturally.

Try It Yourself

Clone the repo and run each strategy:

git clone https://github.com/RafaelPanisset/retry-strategies-simulator

cd retry-strategies-simulator

go run main.go -strategy=constant
go run main.go -strategy=backoff
go run main.go -strategy=jitter
go run main.go -strategy=decorrelated

Each run takes 15-60 seconds, depending on the strategy. The ASCII histograms are generated live from the simulation

References

Marc Brooker, "Exponential Backoff and Jitter"
Marc Brooker, "Timeouts, retries, and backoff with jitter"
Google Cloud, "Retry Strategy"

DEV Community