DEV Community

Byron Hsieh
Byron Hsieh

Posted on

Kafka Consumer Rebalancing: From Stop-the-World to Cooperative Protocol

Introduction

When learning about Kafka consumer groups, I discovered an important concept called rebalancing - the process where Kafka redistributes partitions among consumers when the group changes.

What I learned is that when a consumer joins or leaves a group, all consumers pause briefly during the partition reassignment. This behavior isn't a bug - it's a deliberate design choice, and Kafka actually offers two different strategies for handling it.

In this article, I'll explain the two rebalancing strategies (Eager and Cooperative), show real logs from my experiments, and discuss the trade-offs to help you choose the right strategy for your use case.

This guide is based on the excellent course "Apache Kafka Series - Learn Apache Kafka for Beginners v3".


Two Rebalancing Strategies

Kafka offers two fundamentally different approaches to rebalancing. Each has trade-offs depending on your use case.


Strategy 1: Eager Rebalance (Stop-the-World)

How It Works

  1. Trigger event occurs (consumer joins/leaves)
  2. ALL consumers stop consuming (stop-the-world event)
  3. All consumers give up their partition assignments
  4. Kafka reassigns partitions to all consumers
  5. Consumers resume processing with new assignments

Characteristics

Simplicity:

  • ✅ Single-step process - easy to reason about
  • ✅ Clean state transitions - all consumers synchronized
  • ✅ Simpler implementation and debugging

Trade-offs:

  • ⚠️ Complete pause in processing during rebalance
  • ⚠️ All consumers affected, even if their partitions don't change
  • ⚠️ Local state/caches must be rebuilt after reassignment

When to Use

Eager rebalancing works well when:

  • Small consumer groups (2-5 consumers)
  • Infrequent scaling events
  • Rebalance duration is acceptable (typically seconds)
  • Simplicity is valued over minimal disruption
  • Consumers are stateless or have minimal state

When Does Rebalancing Trigger?

  • Consumer joins or leaves the group
  • Consumer crashes or becomes unresponsive
  • session.timeout.ms expires

Strategy 2: Cooperative Rebalance (Incremental)

How It Works

  1. Trigger event occurs
  2. Kafka identifies only the partitions that need to move
  3. Only affected consumers pause those specific partitions
  4. Other partitions continue processing uninterrupted
  5. May take multiple iterations to reach stable state

Characteristics

Minimal Disruption:

  • ✅ Only revoked partitions pause
  • ✅ Non-affected partitions keep consuming
  • ✅ Sticky assignment - partitions stay with consumers when possible
  • ✅ Lower latency impact

Trade-offs:

  • ⚠️ More complex - multiple rebalance steps
  • ⚠️ Harder to debug (multi-phase process)
  • ⚠️ Requires all consumers to support the protocol

When to Use

Cooperative rebalancing is beneficial when:

  • Large consumer groups (10+ consumers)
  • Frequent scaling events (auto-scaling, deployments)
  • Stateful consumers with large local caches
  • Processing interruption is costly
  • High-throughput systems where pauses impact SLAs

Example Scenario

Setup: 3 partitions, 2 consumers, then 1 new consumer joins

Eager Rebalance:

Before: Consumer 1: [P0, P1]    Consumer 2: [P2]
        ↓ [ALL STOP]
After:  Consumer 1: [P0]    Consumer 2: [P1]    Consumer 3: [P2]
Enter fullscreen mode Exit fullscreen mode

All consumers stopped, all partitions reassigned.

Cooperative Rebalance:

Before: Consumer 1: [P0, P1]    Consumer 2: [P2]
        ↓ [Only P1 pauses]
After:  Consumer 1: [P0]    Consumer 2: [P2]    Consumer 3: [P1]
Enter fullscreen mode Exit fullscreen mode

Only partition P1 moved, others kept consuming.


Partition Assignment Strategies

Kafka provides multiple assignment strategies via the partition.assignment.strategy config.

Eager Strategies (Stop-the-World)

1. RangeAssignor

  • Assigns partitions on per-topic basis
  • Can lead to imbalanced assignments
  • Old default strategy

2. RoundRobin

  • Distributes partitions evenly across consumers
  • All consumers have ±1 the same number of partitions
  • Better balance than RangeAssignor

3. StickyAssignor

  • Balanced like RoundRobin initially
  • Minimizes partition movements during rebalance
  • Still causes stop-the-world event

Cooperative Strategy

4. CooperativeStickyAssignor

  • Uses cooperative rebalancing protocol
  • Minimizes partition movements
  • Consumers keep processing non-moved partitions
  • Preferred for large-scale, stateful systems

Default Configuration

Kafka 3.0+ Default

partition.assignment.strategy = [RangeAssignor, CooperativeStickyAssignor]
Enter fullscreen mode Exit fullscreen mode

Why both?

  • Provides backward compatibility
  • Allows gradual migration from eager to cooperative
  • Group coordinator picks the first strategy supported by all members

Other Components

  • Kafka Connect: Cooperative rebalance enabled by default
  • Kafka Streams: Uses StreamsPartitionAssignor (cooperative) by default

Implementing Cooperative Rebalancing

Configuration

Add this property to your consumer:

properties.setProperty("partition.assignment.strategy",
    CooperativeStickyAssignor.class.getName());
Enter fullscreen mode Exit fullscreen mode

Before:

partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor,
                                  org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
Enter fullscreen mode Exit fullscreen mode

After:

partition.assignment.strategy = [org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
Enter fullscreen mode Exit fullscreen mode

Real Logs: Observing Cooperative Rebalance

I ran consumers with CooperativeStickyAssignor enabled and captured the logs during different scaling events.

Scenario 1: Single Consumer Starts

First consumer joins and gets all 3 partitions:

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1, demo_java-2]
        Current owned partitions:                  []
        Added partitions (assigned - owned):       [demo_java-0, demo_java-1, demo_java-2]
        Revoked partitions (owned - assigned):     []
Enter fullscreen mode Exit fullscreen mode

State: Consumer 1 owns partitions 0, 1, 2


Scenario 2: Second Consumer Joins (Scale Up)

A new consumer joins the group. Watch how only partition 2 is revoked:

Consumer 1 - Revokes partition 2

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1]
        Current owned partitions:                  [demo_java-0, demo_java-1, demo_java-2]
        Added partitions (assigned - owned):       []
        Revoked partitions (owned - assigned):     [demo_java-2]  ← Only this one!
Enter fullscreen mode Exit fullscreen mode

Key insight: Consumer 1 continues processing partitions 0 and 1 during this operation.

Consumer 1 - Assignment stabilizes

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1]
        Current owned partitions:                  [demo_java-0, demo_java-1]
        Added partitions (assigned - owned):       []
        Revoked partitions (owned - assigned):     []
Enter fullscreen mode Exit fullscreen mode

Consumer 2 - Receives the revoked partition

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-2]
        Current owned partitions:                  []
        Added partitions (assigned - owned):       [demo_java-2]
        Revoked partitions (owned - assigned):     []
Enter fullscreen mode Exit fullscreen mode

Result:

  • Consumer 1: partitions 0, 1 (kept processing throughout)
  • Consumer 2: partition 2 (received smoothly)
  • Only 1 partition moved!

Scenario 3: Consumer Leaves (Scale Down)

Consumer 2 shuts down, Consumer 1 picks up the orphaned partition:

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1, demo_java-2]
        Current owned partitions:                  [demo_java-0, demo_java-1]
        Added partitions (assigned - owned):       [demo_java-2]  ← Picked up orphan
        Revoked partitions (owned - assigned):     []
Enter fullscreen mode Exit fullscreen mode

Result: Consumer 1 seamlessly adds partition 2 while continuing to process 0 and 1.


Choosing Between Strategies

Aspect Eager (RangeAssignor) Cooperative (CooperativeStickyAssignor)
Partition Revocation ALL partitions Only affected partitions
Consumption During Rebalance STOPPED CONTINUES on non-revoked partitions
Complexity Simple (single step) Complex (multiple steps)
Debugging Easier to trace Multi-phase, harder to debug
Consumer Lag Impact Higher (all partitions pause) Lower (only moved partitions pause)
State Management All state reset Partial state retention
Best For Small groups, stateless consumers Large groups, stateful consumers
Good Fit Infrequent changes, simple systems Frequent scaling, high-throughput

Static Group Membership

Cooperative rebalancing is great, but what if you don't want any rebalance during brief restarts?

The Problem

By default:

  1. Consumer leaves → Loses member ID
  2. Consumer rejoins → Gets new member ID
  3. Rebalance triggered (even for brief restart)

The Solution: Static Members

Configure consumers with fixed IDs:

properties.setProperty("group.instance.id", "consumer-1");
Enter fullscreen mode Exit fullscreen mode

Behavior

Consumer rejoins within session.timeout.ms:

  • ✅ Keeps same partition assignment
  • NO rebalance triggered

Consumer away longer than session.timeout.ms:

  • ❌ Rebalance triggered
  • ❌ Partitions reassigned

Use Cases

1. Kubernetes/Container Environments

  • Pod restarts don't trigger rebalance
  • Rolling updates happen smoothly

2. Local Cache/State Maintenance

  • Consumers maintain local state for their partitions
  • Avoid rebuilding cache on restart
  • Ensure partition affinity

Key Takeaways

  1. Rebalancing strategies are design choices - not one-size-fits-all
  2. Eager rebalancing is simpler but pauses all consumers
  3. Cooperative rebalancing minimizes disruption but adds complexity
  4. Choose based on your use case - group size, scaling frequency, state management
  5. Multiple strategies in config allows backward compatibility and migration
  6. Static group membership prevents rebalance during brief restarts
  7. Logs reveal the process - watch "Revoked partitions" to understand impact

Conclusion

Understanding rebalancing strategies was a turning point in my Kafka learning journey. Rather than one being "better," each strategy solves different problems.

Key insights:

  • Eager rebalancing works well for simple, small-scale systems where simplicity matters
  • Cooperative rebalancing shines in large-scale, stateful, high-throughput scenarios
  • The "best" strategy depends on your specific requirements and constraints
  • Static group membership complements both strategies for handling restarts
  • Real logs help you understand what's actually happening during rebalances

There's no universal "right" answer - choose the strategy that fits your system's characteristics and operational needs.


This article is part of my learning journey through Apache Kafka. If you found it helpful, please give it a like and follow for more Kafka tutorials!

Course Reference: Apache Kafka Series - Learn Apache Kafka for Beginners v3

Top comments (0)