Your server goes down. There are 1000 clients already waiting. What happens when it comes back?
The Problem
Picture a server handling 200 requests per second. A dependency goes down for about 10 seconds; every in-flight request fails, and all clients start retrying.
When your server recovers, all 1000 clients retry at once. This is called the Thundering Herd problem, and your retry strategy determines whether your server recovers in seconds or drowns under a wave of redundant/unnecessary requests.
Most engineers know they should use Exponential Backoff, but fewer know it can still cause synchronized spikes. Even fewer have seen what Decorrelated Jitter actually does to the request distribution curve.
I built a simulation of 1000 concurrent Go clients, a server with fixed capacity, and four retry strategies fighting for recovery, to test each strategy.
The simulation tries to show a scenario that production systems face: fixed server capacity, a hard outage window, and clients that keep retrying until they succeed. Every request is tracked per-second, so you can see the thundering herd form, peak, and (hopefully) dissipate.
The Four Strategies
Each strategy is a function with the same signature: given the attempt number and the previous delay, return how long to sleep before the next retry.
type Strategy func(attempt int, prevDelay time.Duration) time.Duration
1. Constant Retry
Fixed delay, no backoff.
func constantRetry(_ int, _ time.Duration) time.Duration {
return 1 * time.Millisecond
}
Formula: delay = 1ms (constant)
The worst case retry pattern: a tiny fixed delay with no backoff. It is what happens when engineers write a quick retry loop without thinking about backoff.
2. Exponential Backoff
Double the delay on each attempt, starting at 100ms and capped at 10 seconds.
func exponentialBackoff(attempt int, _ time.Duration) time.Duration {
d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
if d > capSleep {
d = capSleep
}
return d
}
Formula: delay = min(base * 2^attempt, cap)
It is better than constant retry, but there is a catch. All 1000 clients started at the same time, so they all hit attempt 1 at t=100ms, attempt 2 at t=300ms, attempt 3 at t=700ms.
3. Full Jitter
Randomize the delay between 0 and the exponential backoff value.
func fullJitter(attempt int, _ time.Duration) time.Duration {
d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
if d > capSleep {
d = capSleep
}
return time.Duration(rand.Int63n(int64(d)))
}
Formula: delay = random(0, min(base * 2^attempt, cap))
This is one of the strategies analyzed in the AWS Architecture Blog, where it showed the fewest total calls. Because each client picks a random delay, they no longer retry at the same time. Instead of 1000 clients all hitting the server at once, they arrive spread out.
4. Decorrelated Jitter
Each delay is random between base and 3 * previous_delay.
func decorrelatedJitter(_ int, prev time.Duration) time.Duration {
if prev < baseSleep {
prev = baseSleep
}
d := time.Duration(rand.Int63n(int64(prev)*3-int64(baseSleep))) + baseSleep
if d > capSleep {
d = capSleep
}
return d
}
Formula: delay = random(base, prev * 3) (capped)
This one comes from the same AWS blog post. Instead of using the attempt number, each delay depends on the previous delay. Why 3? It is just a practical choice from AWS. The multiplier controls how fast delays grow. With 3, the average delay grows about 1.5x per retry (the midpoint of random(base, prev*3)). That's enough to spread clients out without making them wait too long. You could use 2 or 4 and it would still work, just with different tradeoffs.
The Simulation Setup
The server counts how many requests it gets each second. If it's still in the outage window or already at capacity, it says no.
type Server struct {
mu sync.Mutex
capacity int
downFor time.Duration
start time.Time
requests map[int]int // second -> request count
accepted map[int]int // second -> accepted count
}
Each client is a goroutine that keeps retrying until it gets a success:
func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) {
start := time.Now()
attempt := 0
prevDelay := baseSleep
for {
if srv.Do() {
metrics.Record(time.Since(start), attempt)
return
}
delay := strategy(attempt, prevDelay)
prevDelay = delay
attempt++
time.Sleep(delay)
}
}
Parameters for all runs:
- 1000 clients, all starting simultaneously
- Server capacity: 200 req/s
- Outage duration: 10 seconds
- Server rejects requests during outage and when at capacity
We count every request per second and draw it as a bar chart in the terminal. You can literally see the thundering herd.
Results
Constant Retry
Each client retries every 1ms. With 1000 clients, the server gets hundreds of thousands of requests per second, but it can only handle 200. Over 8 million wasted requests to serve 1000 clients. Also, the flood doesn't stop when the server comes back it keeps going for 4 more seconds. It only stops because every time a client gets served, it stops retrying. Fewer clients means fewer retries, and by second 14 all 1000 are done.
Exponential Backoff
Look at the histogram: spikes at around seconds 0, 1, 3, 6, 12, 22, 32, 42, 52 with nothing in between. All 1000 clients double their delay at the same rate, so they all retry at the same time. The gaps between spikes grow but the problem stays, you can see that every spike is a mini thundering herd.
Wasted requests dropped from 8 million to just 9,000. But the p99 latency is 52 seconds.
Full Jitter
No spikes. The histogram shows a smooth curve going down from ~5000 at second 0 to 13 at second 19. Each client picks a random delay, so they stop hitting the server not all at once.
Decorrelated Jitter
Same smooth shape as full jitter, no spikes. A bit more noise though, 10695 wasted requests vs 8468, and 137 requests over capacity. Why? If a client randomly gets a short delay, the next one is based on that short delay, so it can keep retrying fast for a bit.
Full jitter won every metric here. But we're testing the worst case, all 1000 clients break at once. That doesn't really happen in prod. Clients fail at different times, and decorrelated jitter is better at that. Each client figures out its own rhythm instead of everyone following the same retry schedule.
When to Use What
Constant retry: Never in production.
Exponential backoff without jitter: Only when you have very few clients, the synchronization problem does not manifest if there is no herd.
Full jitter: The default choice. Best results in this simulation and the simplest to implement.
Decorrelated jitter: Similar performance, but it works better in scenarios where clients don't all fail at the same time. Each client adapts its delay based on its own history, which handles staggered and rolling failures more naturally.
Try It Yourself
Clone the repo and run each strategy:
git clone https://github.com/RafaelPanisset/retry-strategies-simulator
cd retry-strategies-simulator
go run main.go -strategy=constant
go run main.go -strategy=backoff
go run main.go -strategy=jitter
go run main.go -strategy=decorrelated
Each run takes 15-60 seconds, depending on the strategy. The ASCII histograms are generated live from the simulation
References
- Marc Brooker, "Exponential Backoff and Jitter"
- Marc Brooker, "Timeouts, retries, and backoff with jitter"
- Google Cloud, "Retry Strategy"





Top comments (0)