Byron Hsieh

Posted on Jan 4

Kafka Consumer Rebalancing: From Stop-the-World to Cooperative Protocol

#kafka #java #tutorial #backend

Introduction

When learning about Kafka consumer groups, I discovered an important concept called rebalancing - the process where Kafka redistributes partitions among consumers when the group changes.

What I learned is that when a consumer joins or leaves a group, all consumers pause briefly during the partition reassignment. This behavior isn't a bug - it's a deliberate design choice, and Kafka actually offers two different strategies for handling it.

In this article, I'll explain the two rebalancing strategies (Eager and Cooperative), show real logs from my experiments, and discuss the trade-offs to help you choose the right strategy for your use case.

This guide is based on the excellent course "Apache Kafka Series - Learn Apache Kafka for Beginners v3".

Two Rebalancing Strategies

Kafka offers two fundamentally different approaches to rebalancing. Each has trade-offs depending on your use case.

Strategy 1: Eager Rebalance (Stop-the-World)

How It Works

Trigger event occurs (consumer joins/leaves)
ALL consumers stop consuming (stop-the-world event)
All consumers give up their partition assignments
Kafka reassigns partitions to all consumers
Consumers resume processing with new assignments

Characteristics

Simplicity:

✅ Single-step process - easy to reason about
✅ Clean state transitions - all consumers synchronized
✅ Simpler implementation and debugging

Trade-offs:

⚠️ Complete pause in processing during rebalance
⚠️ All consumers affected, even if their partitions don't change
⚠️ Local state/caches must be rebuilt after reassignment

When to Use

Eager rebalancing works well when:

Small consumer groups (2-5 consumers)
Infrequent scaling events
Rebalance duration is acceptable (typically seconds)
Simplicity is valued over minimal disruption
Consumers are stateless or have minimal state

When Does Rebalancing Trigger?

Consumer joins or leaves the group
Consumer crashes or becomes unresponsive
session.timeout.ms expires

Strategy 2: Cooperative Rebalance (Incremental)

How It Works

Trigger event occurs
Kafka identifies only the partitions that need to move
Only affected consumers pause those specific partitions
Other partitions continue processing uninterrupted
May take multiple iterations to reach stable state

Characteristics

Minimal Disruption:

✅ Only revoked partitions pause
✅ Non-affected partitions keep consuming
✅ Sticky assignment - partitions stay with consumers when possible
✅ Lower latency impact

Trade-offs:

⚠️ More complex - multiple rebalance steps
⚠️ Harder to debug (multi-phase process)
⚠️ Requires all consumers to support the protocol

When to Use

Cooperative rebalancing is beneficial when:

Large consumer groups (10+ consumers)
Frequent scaling events (auto-scaling, deployments)
Stateful consumers with large local caches
Processing interruption is costly
High-throughput systems where pauses impact SLAs

Example Scenario

Setup: 3 partitions, 2 consumers, then 1 new consumer joins

Eager Rebalance:

Before: Consumer 1: [P0, P1]    Consumer 2: [P2]
        ↓ [ALL STOP]
After:  Consumer 1: [P0]    Consumer 2: [P1]    Consumer 3: [P2]

All consumers stopped, all partitions reassigned.

Cooperative Rebalance:

Before: Consumer 1: [P0, P1]    Consumer 2: [P2]
        ↓ [Only P1 pauses]
After:  Consumer 1: [P0]    Consumer 2: [P2]    Consumer 3: [P1]

Only partition P1 moved, others kept consuming.

Partition Assignment Strategies

Kafka provides multiple assignment strategies via the partition.assignment.strategy config.

Eager Strategies (Stop-the-World)

1. RangeAssignor

Assigns partitions on per-topic basis
Can lead to imbalanced assignments
Old default strategy

2. RoundRobin

Distributes partitions evenly across consumers
All consumers have ±1 the same number of partitions
Better balance than RangeAssignor

3. StickyAssignor

Balanced like RoundRobin initially
Minimizes partition movements during rebalance
Still causes stop-the-world event

Cooperative Strategy

4. CooperativeStickyAssignor

Uses cooperative rebalancing protocol
Minimizes partition movements
Consumers keep processing non-moved partitions
Preferred for large-scale, stateful systems

Default Configuration

Kafka 3.0+ Default

partition.assignment.strategy = [RangeAssignor, CooperativeStickyAssignor]

Why both?

Provides backward compatibility
Allows gradual migration from eager to cooperative
Group coordinator picks the first strategy supported by all members

Other Components

Kafka Connect: Cooperative rebalance enabled by default
Kafka Streams: Uses StreamsPartitionAssignor (cooperative) by default

Implementing Cooperative Rebalancing

Configuration

Add this property to your consumer:

properties.setProperty("partition.assignment.strategy",
    CooperativeStickyAssignor.class.getName());

Before:

partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor,
                                  org.apache.kafka.clients.consumer.CooperativeStickyAssignor]

After:

partition.assignment.strategy = [org.apache.kafka.clients.consumer.CooperativeStickyAssignor]

Real Logs: Observing Cooperative Rebalance

I ran consumers with CooperativeStickyAssignor enabled and captured the logs during different scaling events.

Scenario 1: Single Consumer Starts

First consumer joins and gets all 3 partitions:

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1, demo_java-2]
        Current owned partitions:                  []
        Added partitions (assigned - owned):       [demo_java-0, demo_java-1, demo_java-2]
        Revoked partitions (owned - assigned):     []

State: Consumer 1 owns partitions 0, 1, 2

Scenario 2: Second Consumer Joins (Scale Up)

A new consumer joins the group. Watch how only partition 2 is revoked:

Consumer 1 - Revokes partition 2

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1]
        Current owned partitions:                  [demo_java-0, demo_java-1, demo_java-2]
        Added partitions (assigned - owned):       []
        Revoked partitions (owned - assigned):     [demo_java-2]  ← Only this one!

Key insight: Consumer 1 continues processing partitions 0 and 1 during this operation.

Consumer 1 - Assignment stabilizes

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1]
        Current owned partitions:                  [demo_java-0, demo_java-1]
        Added partitions (assigned - owned):       []
        Revoked partitions (owned - assigned):     []

Consumer 2 - Receives the revoked partition

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-2]
        Current owned partitions:                  []
        Added partitions (assigned - owned):       [demo_java-2]
        Revoked partitions (owned - assigned):     []

Result:

Consumer 1: partitions 0, 1 (kept processing throughout)
Consumer 2: partition 2 (received smoothly)
Only 1 partition moved!

Scenario 3: Consumer Leaves (Scale Down)

Consumer 2 shuts down, Consumer 1 picks up the orphaned partition:

[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator -
  [Consumer clientId=consumer-my-java-application-1, groupId=my-java-application]
  Updating assignment with
        Assigned partitions:                       [demo_java-0, demo_java-1, demo_java-2]
        Current owned partitions:                  [demo_java-0, demo_java-1]
        Added partitions (assigned - owned):       [demo_java-2]  ← Picked up orphan
        Revoked partitions (owned - assigned):     []

Result: Consumer 1 seamlessly adds partition 2 while continuing to process 0 and 1.

Choosing Between Strategies

Aspect	Eager (RangeAssignor)	Cooperative (CooperativeStickyAssignor)
Partition Revocation	ALL partitions	Only affected partitions
Consumption During Rebalance	STOPPED	CONTINUES on non-revoked partitions
Complexity	Simple (single step)	Complex (multiple steps)
Debugging	Easier to trace	Multi-phase, harder to debug
Consumer Lag Impact	Higher (all partitions pause)	Lower (only moved partitions pause)
State Management	All state reset	Partial state retention
Best For	Small groups, stateless consumers	Large groups, stateful consumers
Good Fit	Infrequent changes, simple systems	Frequent scaling, high-throughput

Static Group Membership

Cooperative rebalancing is great, but what if you don't want any rebalance during brief restarts?

The Problem

By default:

Consumer leaves → Loses member ID
Consumer rejoins → Gets new member ID
Rebalance triggered (even for brief restart)

The Solution: Static Members

Configure consumers with fixed IDs:

properties.setProperty("group.instance.id", "consumer-1");

Behavior

Consumer rejoins within session.timeout.ms:

✅ Keeps same partition assignment
✅ NO rebalance triggered

Consumer away longer than session.timeout.ms:

❌ Rebalance triggered
❌ Partitions reassigned

Use Cases

1. Kubernetes/Container Environments

Pod restarts don't trigger rebalance
Rolling updates happen smoothly

2. Local Cache/State Maintenance

Consumers maintain local state for their partitions
Avoid rebuilding cache on restart
Ensure partition affinity

Key Takeaways

✅ Rebalancing strategies are design choices - not one-size-fits-all
✅ Eager rebalancing is simpler but pauses all consumers
✅ Cooperative rebalancing minimizes disruption but adds complexity
✅ Choose based on your use case - group size, scaling frequency, state management
✅ Multiple strategies in config allows backward compatibility and migration
✅ Static group membership prevents rebalance during brief restarts
✅ Logs reveal the process - watch "Revoked partitions" to understand impact

Conclusion

Understanding rebalancing strategies was a turning point in my Kafka learning journey. Rather than one being "better," each strategy solves different problems.

Key insights:

Eager rebalancing works well for simple, small-scale systems where simplicity matters
Cooperative rebalancing shines in large-scale, stateful, high-throughput scenarios
The "best" strategy depends on your specific requirements and constraints
Static group membership complements both strategies for handling restarts
Real logs help you understand what's actually happening during rebalances

There's no universal "right" answer - choose the strategy that fits your system's characteristics and operational needs.

This article is part of my learning journey through Apache Kafka. If you found it helpful, please give it a like and follow for more Kafka tutorials!

Course Reference: Apache Kafka Series - Learn Apache Kafka for Beginners v3