DEV Community

Cover image for Building a Production-Grade RAFT Consensus Implementation
Sudharshan Madhavan
Sudharshan Madhavan

Posted on

Building a Production-Grade RAFT Consensus Implementation

Introduction: Why I Built a RAFT Implementation

When I started my Master's degree in Software Engineering, I kept hearing the term "RAFT" in system design discussions. Databases use it. Kubernetes uses it. Google, Meta, and every company managing millions of requests relies on it. But I didn't understand it not really.

So I decided to build one from scratch.

I built a production-grade RAFT consensus implementation that goes beyond theoretical understanding. This project demonstrates:

  • Complete RAFT protocol with all 10 invariants verified
  • Real-world concerns: persistence, concurrent RPC handling, recovery from failures
  • Production-thinking: thread safety, error handling, clean abstractions
  • 3-node distributed cluster that actually works
  • Real-time WebSocket visualization to see the algorithm in action
  • 100% test pass rate across 10 comprehensive resilience tests

But more importantly: I now understand not just RAFT, but why distributed systems are hard.


What is RAFT? The Consensus Algorithm Behind Modern Databases

Before jumping into my implementation, let me explain RAFT clearly, because if you don't understand it, the code won't make sense.

The Core Problem

Imagine you're building a database that needs to serve 100,000 users simultaneously. You can't use a single server, it'll fail. So you replicate data across 3 servers (A, B, C).

But now: Which server's version of the data is "correct"?

If you write to Server A, then Server A crashes before telling B and C, what happens to your write? Did it succeed or fail? If you read from B, do you get stale data? What if B and C disagree about what the data should be?

This is the consensus problem. You need all servers to agree on the order and content of changes, even when some servers fail, get slow, or disappear from the network.

Why RAFT Solves It

RAFT is a consensus algorithm that solves this by:

  1. Electing a leader: One server takes charge and decides the order of writes
  2. Replicating to followers: The leader sends each write to all other servers
  3. Waiting for majority: Only when a majority confirms the write is complete
  4. Applying to state machine: All servers apply the same writes in the same order

This guarantees: Even if one server crashes, the others have the data and can elect a new leader.

Real-World Usage

  • Consul (HashiCorp): Service discovery, uses RAFT to keep all nodes in sync
  • Etcd (Kubernetes): Configuration store, RAFT ensures cluster state is consistent
  • TiDB: Distributed SQL-RAFT ensures multiple replicas stay in sync

The RAFT paper (2014) was revolutionary because it made consensus understandable compared to older algorithms like Paxos. As the paper says: "RAFT prioritizes understandability over optimality."


Why This Matters: RAFT's Role in Modern Distributed Systems

Let me connect this to business value, because this is why companies care.

Scaling to Millions of Requests

When Instagram wants to scale to 1 billion users, a single database server isn't enough. They need:

  • Replication: Store data on multiple servers (safety)
  • Consensus: Make sure all replicas agree (correctness)
  • Automatic failover: If one server dies, others take over (availability)

Without RAFT (or similar consensus), your system becomes:

  • Inconsistent: Different replicas have different data
  • Unreliable: Data loss when servers fail
  • Unpredictable: No guarantees about what data you read

With RAFT:

  • Consistent: All replicas agree on every write
  • Reliable: Data survives single server failures
  • Predictable: Guarantees about which writes are permanent

Why Google and Meta Care

Google runs Spanner (distributed database), Meta runs similar systems. Both need:

  • Consensus for metadata consistency
  • Automatic leader election when servers die
  • Log replication without data loss

My RAFT implementation handles all of this. When a node crashes, the cluster elects a new leader in ~5-10 seconds. No data is lost. Reads work from any node (if you want to read from a follower). This is what production systems need.


Architecture & Design Decisions

Let me show you how I designed this system and why each choice matters.

System Overview

Raft Architecture

Key Design Decisions

1. Persistent Logs on Disk

def _persist_log_entry(self, log_entry):
    """Persist a single log entry to the LOG file (append-only)"""
    with open(self.raft_terms.logs_file_path, 'a') as f:
        entry = {
            "index": log_index,
            "term": log_entry["term"],
            "command": log_entry["command"]
        }
        f.write(json.dumps(entry) + '\n')
Enter fullscreen mode Exit fullscreen mode

Why? If a node crashes and restarts, it loads logs from disk. This is a RAFT requirement: "Servers must persist their current term and vote before responding to RPCs."

Without this: Node could vote twice in the same term (breaking consensus). With this: Node recovers safely.

2. Thread-Safe State Management

Every RAFT state change is protected by a lock:

with self.lock:
    self.raft_terms.current_term += 1
    self.raft_terms.state = RaftState.candidate
    self.raft_terms.voted_for = self.raft_terms.id
Enter fullscreen mode Exit fullscreen mode

Why? Multiple threads access RAFT state:

  • Election timeout thread: checks if election should start
  • Heartbeat thread: sends periodic messages
  • RPC handler threads: processes incoming requests

Without this: Race condition could corrupt state. With this: All updates are atomic.

Note: I release the lock before blocking operations (RPC calls, disk I/O). This prevents starvation-other threads can proceed while waiting for network.

3. Log Replication with Consistency Check

This is the core of RAFT's safety:

# Rule 3: Log consistency check
if health_check_arguments.prev_log_index >= 0:
    if health_check_arguments.prev_log_index >= len(self.raft_terms.logs):
        # We're missing entries
        return {"success": False, "term": self.raft_terms.current_term}

    prev_entry_local = obtain(self.raft_terms.logs[health_check_arguments.prev_log_index])
    prev_term = int(prev_entry_local["term"])

    if prev_term != health_check_arguments.prev_log_term:
        # Term mismatch -> delete conflicting entries
        self.raft_terms.logs = self.raft_terms.logs[:health_check_arguments.prev_log_index]
        return {"success": False, "term": self.raft_terms.current_term}

# Rule 4: Append new entries
if health_check_arguments.entries:
    insert_index = health_check_arguments.prev_log_index + 1
    self.raft_terms.logs = self.raft_terms.logs[:insert_index]
    for entry_ref in health_check_arguments.entries:
        self.raft_terms.logs.append(entry_ref)
Enter fullscreen mode Exit fullscreen mode

Why this approach?

RAFT ensures all servers have the same log history by checking the "previous entry" before appending new ones. If a follower is behind, the leader will eventually find the match point and fill in the gaps.

Why it's safe:

  • If prev_log_index/term don't match: Follower rejects (leader will retry with earlier index)
  • Once match found: All future entries are guaranteed to match
  • Deleted entries were never committed (leader wouldn't have replicated them to majority)

Deep Dive: Implementation Highlights

Now let's look at the two phases of RAFT in action.

Phase 1: Leader Election (The Hard Part)

When a node doesn't hear from the leader for 5-10 seconds, it starts an election:

def request_vote(self):
    """
    PHASE 1: LEADER ELECTION

    1. Increment currentTerm
    2. Transition to CANDIDATE
    3. Vote for self
    4. Send RequestVote RPCs to all peers in PARALLEL
    5. Count votes
    6. If majority: become LEADER
    """

    with self.lock:
        # Step 1: Increment term and become candidate
        self.raft_terms.current_term += 1
        self.raft_terms.state = RaftState.candidate
        self.raft_terms.voted_for = self.raft_terms.id
        self.vote_count = 1

        # Step 2: Persist before sending RPCs (safety requirement)
        self._persist_state()

        # Step 3: Get election timeout
        self.reset_election_timer()

        # Step 4: Prepare vote request with log information
        vote_arguments = VoteArguments(
            candidate_id=self.raft_terms.id,
            current_term=self.raft_terms.current_term,
            last_log_index=self.raft_terms.last_log_index,
            last_log_term=self.raft_terms.last_log_term
        )
        peers = dict(self.raft_terms.peers)

    # Step 5: Send RPCs in parallel (outside lock to prevent starvation)
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(peers)) as executor:
        futures = []
        for peer_id, peer_value in peers.items():
            if peer_id == self.raft_terms.id:
                continue
            future = executor.submit(self._send_vote_request_to_peer, peer_id, peer_value, vote_arguments)
            futures.append(future)
        concurrent.futures.wait(futures)

    # Step 6: Count votes AFTER all RPCs complete
    with self.lock:
        total_servers = len(self.raft_terms.peers) + 1
        majority = (total_servers // 2) + 1

        if self.vote_count >= majority:
            self.raft_terms.state = RaftState.leader
            # Initialize leader state for log replication
            for peer_id in self.raft_terms.peers.keys():
                self.next_index[peer_id] = len(self.raft_terms.logs)
                self.match_index[peer_id] = -1
Enter fullscreen mode Exit fullscreen mode

The Vote Decision Logic

When a node receives a vote request, it uses this logic:

def handle_request_vote(self, vote_arguments: VoteArguments):
    """
    PHASE 1: LEADER ELECTION - Vote decision

    Rules:
    1. Reject if candidate's term < currentTerm
    2. If candidate's term > currentTerm: update and become follower
    3. Grant vote if:
       - Haven't voted in this term OR already voted for this candidate
       - Candidate's log is at least as up-to-date as receiver's log
    """

    with self.lock:
        # Rule 1: Reject stale term
        if vote_arguments.current_term < self.raft_terms.current_term:
            return False

        # Rule 2: Higher term -> become follower
        if vote_arguments.current_term > self.raft_terms.current_term:
            self.raft_terms.current_term = vote_arguments.current_term
            self.raft_terms.state = RaftState.follower
            self.raft_terms.voted_for = -1

        # Rule 3: Can we vote?
        can_vote = (
            self.raft_terms.voted_for == -1 or 
            self.raft_terms.voted_for == vote_arguments.candidate_id
        )

        # Rule 4: Is candidate's log up-to-date?
        log_is_current = self._is_log_up_to_date(
            vote_arguments.last_log_term,
            vote_arguments.last_log_index
        )

        # Grant vote if both conditions met
        if can_vote and log_is_current:
            self.raft_terms.voted_for = vote_arguments.candidate_id
            self.reset_election_timer()
            return True

        return False
Enter fullscreen mode Exit fullscreen mode

Why is the log check necessary?

Imagine:

  • Node A has 100 log entries
  • Node B has 50 log entries
  • Node C has 50 log entries

If B becomes leader, it will replicate its 50 entries to everyone, erasing the 50 committed entries on A. Disaster!

RAFT prevents this by checking: "Is your log at least as up-to-date as mine?" Specifically, it compares last_log_term and last_log_index. A's 100 entries (higher index) mean A's log is more up-to-date than B's 50 entries.


Phase 2: Log Replication (The Intricate Part)

This is where the magic happens. The leader continuously sends updates to followers.

The Replication Cycle

def send_health_checks(self):
    """
    LEADER ONLY: Send periodic heartbeats to followers

    Purpose:
    1. Replicate log entries (or empty for heartbeat)
    2. Prevent election timeouts (heartbeat resets timer)
    3. Update follower commitIndex
    """

    with self.lock:
        if self.raft_terms.state != RaftState.leader:
            return

        followers = [peer_id for peer_id in self.raft_terms.peers.keys() if peer_id != self.raft_terms.id]
        peers = dict(self.raft_terms.peers)

    # Send to all followers in parallel
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(peers)) as executor:
        for peer_id, peer_value in peers.items():
            if peer_id == self.raft_terms.id:
                continue
            executor.submit(self._send_health_check_to_peer, peer_id, peer_value)
Enter fullscreen mode Exit fullscreen mode

Receiving and Appending Entries

When a follower receives an AppendEntries RPC:

def handle_health_check_request(self, health_check_arguments: HealthCheckArguments):
    """
    FOLLOWER: Process AppendEntries RPC from leader

    Steps:
    1. Reject if term < currentTerm
    2. Reset election timer (leader alive!)
    3. Check log consistency (prev_log_index/term match)
    4. Append new entries if consistent
    5. Update commitIndex from leader signal
    6. Apply committed entries to state machine
    """

    with self.lock:
        # Rule 1: Reject stale term
        if health_check_arguments.current_term < self.raft_terms.current_term:
            return {"success": False, "term": self.raft_terms.current_term}

        # Rule 2: Valid leader -> become follower and reset election timer
        if health_check_arguments.current_term >= self.raft_terms.current_term:
            self.raft_terms.current_term = health_check_arguments.current_term
            self.raft_terms.state = RaftState.follower
            self.reset_election_timer()

        # Rule 3: Log consistency check (THIS IS CRITICAL)
        if health_check_arguments.prev_log_index >= 0:
            if health_check_arguments.prev_log_index >= len(self.raft_terms.logs):
                # Follower is missing entries
                return {"success": False, "term": self.raft_terms.current_term}

            # Check if the entry at prev_log_index matches term
            prev_entry = self.raft_terms.logs[health_check_arguments.prev_log_index]
            if prev_entry["term"] != health_check_arguments.prev_log_term:
                # Mismatch! Delete conflicting entries
                self.raft_terms.logs = self.raft_terms.logs[:health_check_arguments.prev_log_index]
                self._truncate_log_file(health_check_arguments.prev_log_index)
                return {"success": False, "term": self.raft_terms.current_term}

        # Rule 4: Append new entries
        if health_check_arguments.entries:
            insert_index = health_check_arguments.prev_log_index + 1
            self.raft_terms.logs = self.raft_terms.logs[:insert_index]

            for entry in health_check_arguments.entries:
                self.raft_terms.logs.append(entry)
                self._persist_log_entry(entry)

            # Update metadata
            self.raft_terms.last_log_index = len(self.raft_terms.logs) - 1
            self.raft_terms.last_log_term = self.raft_terms.logs[-1]["term"]

        # Rule 5: Update commitIndex from leader
        if health_check_arguments.leader_commit > self.raft_terms.commit_index:
            old_commit = self.raft_terms.commit_index
            self.raft_terms.commit_index = min(
                health_check_arguments.leader_commit,
                len(self.raft_terms.logs) - 1
            )
            print(f"Updated commitIndex: {old_commit} -> {self.raft_terms.commit_index}")

            # Apply newly committed entries to state machine
            self._apply_committed_entries()

        return {"success": True, "term": self.raft_terms.current_term}
Enter fullscreen mode Exit fullscreen mode

The State Machine Application (Where Data Becomes Real)

Here's the critical part - when entries actually get applied to the KV store:

def _apply_committed_entries(self):
    """
    Apply all committed but not yet applied log entries to state machine.
    RAFT Safety: Only apply when commitIndex > lastApplied
    """
    while self.raft_terms.last_applied < self.raft_terms.commit_index:
        self.raft_terms.last_applied += 1
        log_entry = self.raft_terms.logs[self.raft_terms.last_applied]

        print(f"Applying entry {self.raft_terms.last_applied}: '{log_entry['command']}'")

        # Use state machine applier to apply to KV store
        if self.state_machine_applier:
            result = self.state_machine_applier.apply(log_entry["command"])
Enter fullscreen mode Exit fullscreen mode

The state_machine_applier (a bridge between RAFT and KV store) deserializes and applies commands:

def apply(self, command_str: str) -> Dict[str, Any]:
    """Apply a serialized command to the state machine (KV store)."""
    try:
        command = deserialize_command(command_str)
        cmd_type = command.get("type")

        if cmd_type == CommandType.SET.value:
            self.db.set_at(
                command["key"],
                command["field"],
                command["value"],
                command["timestamp"],
                command.get("ttl")
            )
        elif cmd_type == CommandType.DELETE.value:
            self.db.delete_at(
                command["key"],
                command["field"],
                command["timestamp"]
            )

        return {"success": True, "operation": cmd_type}
    except Exception as e:
        return {"success": False, "error": str(e)}
Enter fullscreen mode Exit fullscreen mode

Understanding the Critical Variables: A Concrete Example

This is where things get confusing, so let me walk through a real scenario:

Scenario: Client writes "SET user1=Alice" to 3-node cluster (A=leader, B=follower, C=follower)

TIME 0: Initial State
────────────────────
All nodes:
  logs = []
  last_log_index = -1           # No entries in log
  last_log_term = 0             # No entries, so term is 0
  commit_index = -1             # Nothing committed yet
  last_applied = -1             # Nothing applied to KV store yet

TIME 1: Client writes to Leader A
────────────────────────────────
Leader A receives: "SET user1=Alice"

Action: Append to log
  logs = [
    {term: 5, command: "SET user1=Alice"}   ← INDEX 0
  ]
  last_log_index = 0               # UPDATED: Just added entry at index 0
  last_log_term = 5                # UPDATED: That entry is in term 5
  commit_index = -1                # NOT CHANGED: Still not committed!
  last_applied = -1                # NOT CHANGED: Still not applied!

  Entry is in log
  NOT committed (only leader has it)
  NOT applied to KV (waiting for replication)

Leader A sets next_index[B] = 1 and next_index[C] = 1
  (Next entry to send to each follower)

TIME 2: Leader A sends AppendEntries to Followers
──────────────────────────────────────────────────
Leader A creates AppendEntries RPC:
  {
    current_term: 5,
    leader_id: A,
    prev_log_index: -1,            # Entry BEFORE the new ones
    prev_log_term: 0,              # Term of that entry
    entries: [                      # New entries to append
      {term: 5, command: "SET user1=Alice"}
    ],
    leader_commit: -1               # Leader's current commitIndex
  }

Follower B receives:
  Term 5 >= B's term 5 (valid leader)
  Consistency check: prev_log_index=-1 matches (no previous entry)
  Append new entry to log

  logs = [
    {term: 5, command: "SET user1=Alice"}   ← INDEX 0
  ]
  last_log_index = 0               # UPDATED: Now has entry 0
  last_log_term = 5                # UPDATED: Entry is in term 5
  commit_index = -1                # NOT CHANGED: Waits for leader to say it's committed
  last_applied = -1                # NOT CHANGED: Waiting for commit

  Entry is in log
  NOT committed yet (leader hasn't confirmed majority)
  NOT applied to KV (waiting for commit)

Follower C receives: Same as B
  logs = [{term: 5, command: "SET user1=Alice"}]
  last_log_index = 0
  last_log_term = 5
  commit_index = -1
  last_applied = -1

TIME 3: Leader Receives ACKs from Followers
───────────────────────────────────────────
Leader A receives success from B and C.

Update match_index:
  match_index = {
    A: 0,    # I have index 0
    B: 0,    # B confirmed it has index 0
    C: 0     # C confirmed it has index 0
  }

Check for majority:
  total_servers = 3 (A, B, C)
  majority_needed = 2
  servers_with_index_0 = [A, B, C] = 3 servers

  3 >= 2 MAJORITY REACHED!

Leader A can NOW commit:
  commit_index = 0                 # UPDATED: Entry 0 is now committed
  last_applied = -1                # Still not applied (will apply next)

Entry is committed (can't be lost)
NOT applied to KV (will happen on next apply cycle)

TIME 4: Leader Applies Entry
────────────────────────────
Leader A checks: if commit_index (-1) < commit_index (0)?
  Yes! new entries are committed.

Apply loop:
  while last_applied (-1) < commit_index (0):
    last_applied = 0              # UPDATED
    entry = logs[0] = {SET user1=Alice}

    state_machine_applier.apply(entry)
    → KV store: {user1: Alice}

  Entry applied to KV
  Data is now durable

STATE AFTER APPLICATION:
  logs = [{term: 5, command: "SET user1=Alice"}]
  last_log_index = 0
  last_log_term = 5
  commit_index = 0                 # Committed
  last_applied = 0                 # Applied to KV
  KV store = {user1: Alice}

TIME 5: Leader Tells Followers About Commit
────────────────────────────────────────────
In next heartbeat, Leader A sends:
  {
    current_term: 5,
    leader_id: A,
    prev_log_index: 0,
    prev_log_term: 5,
    entries: [],                   # Empty (just heartbeat)
    leader_commit: 0               # ← KEY: "Commit up to index 0!"
  }

Follower B receives:
  Check: leader_commit (0) > commit_index (-1)?
  Yes!

  commit_index = min(0, len(logs)-1) = 0  # UPDATED

  Apply loop:
    while last_applied (-1) < commit_index (0):
      last_applied = 0             # UPDATED
      entry = logs[0]
      state_machine_applier.apply(entry)
      → KV store: {user1: Alice}

  last_applied = 0                 # UPDATED

Follower C receives: Same as B
  commit_index = 0
  last_applied = 0
  KV store = {user1: Alice}

FINAL STATE (After Replication Complete):
──────────────────────────────────────────
Leader A:
  logs = [{term: 5, command: "SET user1=Alice"}]
  last_log_index = 0
  last_log_term = 5
  commit_index = 0                 ← COMMITTED
  last_applied = 0                 ← APPLIED
  KV = {user1: Alice}

Follower B:
  logs = [{term: 5, command: "SET user1=Alice"}]
  last_log_index = 0
  last_log_term = 5
  commit_index = 0                 ← COMMITTED
  last_applied = 0                 ← APPLIED
  KV = {user1: Alice}

Follower C:
  logs = [{term: 5, command: "SET user1=Alice"}]
  last_log_index = 0
  last_log_term = 5
  commit_index = 0                 ← COMMITTED
  last_applied = 0                 ← APPLIED
  KV = {user1: Alice}

ALL 3 NODES HAVE IDENTICAL STATE
Enter fullscreen mode Exit fullscreen mode

Why These Variables Matter

Variable Meaning When Changes Why It's Critical
last_log_index "How many entries have I received?" When log appended Tells followers/candidates if they're behind
last_log_term "What term was my last entry?" When log appended Ensures log consistency check works
commit_index "Up to which index can I trust?" Leader counts replicas If you apply an uncommitted entry and it gets lost, consistency breaks
last_applied "Up to which index have I applied to KV?" When entries are applied Must match commit_index eventually, or KV state diverges

The Safety Guarantee:

  • An entry is only applied when last_applied reaches it
  • This only happens when commit_index reaches it
  • commit_index only advances when majority replicates
  • So: Only entries that can't be lost are applied

Health Checks & Fault Tolerance

RAFT handles failures through two mechanisms:

1. Election Timeout (Detects Dead Leaders)

def _ticker(self):
    """
    Election timeout thread

    If heartbeat arrives before timeout: reset and wait again
    If timeout fires: start election (become candidate)
    """
    while not self.killed():
        self.election_timer_event.clear()

        # Randomize 5000-10000 ms to prevent ties
        new_timeout_ms = random.uniform(5000, 10000)
        self.raft_terms.election_timer = new_timeout_ms / 1000.0

        # Wait for heartbeat OR timeout
        heartbeat_received = self.election_timer_event.wait(
            timeout=self.raft_terms.election_timer
        )

        # If heartbeat arrived, reset and retry
        if heartbeat_received:
            continue

        # Timeout fired -> start election
        self.request_vote()
Enter fullscreen mode Exit fullscreen mode

Why randomization? Without it, all followers timeout simultaneously and you get split votes. With randomization, one node times out first and wins the election.

2. Heartbeat Timeout (Followers Detect Stale Leader)

Every 3sec, the leader sends heartbeats:

def _heartbeat_ticker(self):
    """
    Heartbeat thread (leader only)
    Sends AppendEntries RPCs every 3000ms
    """
    while not self.killed():
        if self.raft_terms.state == RaftState.leader:
            self.send_health_checks()
        self.heartbeat_timer_event.wait(timeout=3)  # 3s interval (for clear visualization)
Enter fullscreen mode Exit fullscreen mode

If a leader crashes, followers don't receive heartbeats and election timeout fires. New leader elected within ~5-10 seconds.


Handling Real-World Failures

Building RAFT taught me that the algorithm's real strength isn't in the happy path - it's in how it handles failures that would crash other systems. Let me walk through the 4 most critical failure scenarios and show exactly how my implementation prevents them.


Failure #1: Split Brain - Multiple Leaders Elected Simultaneously

The Problem

Imagine this nightmare scenario:

  • You have 5 nodes in your cluster
  • Due to network latency, nodes A and B think C is dead
  • They start elections simultaneously
  • Both get 2 votes (A votes for itself, B votes for itself)
  • Both think they're leaders
  • Now your system has two sources of truth

Writes go to Leader A, other writes go to Leader B. Data diverges. Consistency breaks. Database corrupts.

How My Code Prevents It

The answer is majority voting. A node can only become leader if it gets votes from MORE than half the cluster.

def request_vote(self):
    """
    PHASE 1: LEADER ELECTION

    Only candidates with majority votes become leaders
    """

    with self.lock:
        # Step 1: Increment term and become candidate
        self.raft_terms.current_term += 1
        self.raft_terms.state = RaftState.candidate
        self.raft_terms.voted_for = self.raft_terms.id
        self.vote_count = 1  # Vote for self

        # ... send vote requests to all peers ...

    # Step 2: Count votes AFTER all RPCs complete
    with self.lock:
        total_servers = len(self.raft_terms.peers) + 1
        majority = (total_servers // 2) + 1  # ← KEY: Strict majority needed

        print(f"Election result: {self.vote_count}/{total_servers} votes (need {majority})")

        # Step 3: Only become leader if majority
        if self.vote_count >= majority and self.raft_terms.state == RaftState.candidate:
            self.raft_terms.state = RaftState.leader
Enter fullscreen mode Exit fullscreen mode

Why this works:

In a 5-node cluster:

  • Majority needed = (5 // 2) + 1 = 3 votes
  • If Node A gets 2 votes (itself + one other): NOT leader
  • If Node B gets 2 votes (itself + one other): NOT leader
  • Maximum: Both get 2 votes each. Neither reaches 3. No split brain

In a 3-node cluster:

  • Majority needed = (3 // 2) + 1 = 2 votes
  • Only one node can get 2+ votes (the other node can't get 2 if the first got 2)
  • Exactly one leader per term guaranteed

Real-World Impact

Without this: Instagram's database could have two leaders writing conflicting data. With this: Mathematically impossible.


Failure #2: Node Crashes Before Sending Requests to Followers

The Problem

A client writes to the leader:

  • Leader appends to its log
  • Leader returns success to client ("your write is saved!")
  • Leader crashes before sending to followers
  • Followers never get the entry
  • New leader elected
  • Entry is gone

Client thinks data is saved, but it's lost.

How My Code Prevents It

The answer is don't commit (and don't acknowledge to client) until majority replicates.

def append_log_entries(self, command):
    """
    CLIENT WRITE: Only leader accepts writes
    But leader doesn't acknowledge until replicated to majority
    """

    with self.lock:
        # Only leader accepts
        if self.raft_terms.state != RaftState.leader:
            return {
                "success": False,
                "error": "NOT_LEADER",
                "message": "Redirect to actual leader"
            }

        # Append to leader's log (safe, it's the leader)
        log_entry = {
            "term": self.raft_terms.current_term,
            "command": str(command)
        }
        self.raft_terms.logs.append(log_entry)
        log_index = len(self.raft_terms.logs) - 1

        # Persist to disk (safety requirement)
        self._persist_log_entry(log_entry)

        # Update metadata
        self.raft_terms.last_log_index = len(self.raft_terms.logs) - 1
        self.raft_terms.last_log_term = self.raft_terms.current_term

        print(f"Leader appended entry at index {log_index}")

    # Return to client: entry is in log, waiting for replication
    return {
        "success": True,
        "index": log_index,
        "term": self.raft_terms.current_term,
        "message": "Appended to leader log, waiting for replication"
    }
Enter fullscreen mode Exit fullscreen mode

Client gets success, but this means "appended to leader's log, replication in progress."

Then the leader sends to followers in parallel:

def send_health_checks(self):
    """
    LEADER ONLY: Send AppendEntries RPCs to all followers
    This replicates log entries to majority
    """

    with self.lock:
        if self.raft_terms.state != RaftState.leader:
            return
        peers = dict(self.raft_terms.peers)

    # Send to all followers IN PARALLEL
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(peers)) as executor:
        for peer_id, peer_value in peers.items():
            if peer_id == self.raft_terms.id:
                continue

            # Each follower gets a separate RPC thread
            executor.submit(self._send_health_check_to_peer, peer_id, peer_value)
Enter fullscreen mode Exit fullscreen mode

When majority ACKs:

# After followers respond...
with self.lock:
    # Update match_index for each peer that ACKed
    # Example: match_index = {A: 5, B: 5, C: 5}

    # Count how many nodes have replicated up to index 5
    replicated_count = sum(
        1 for peer_id in self.raft_terms.peers
        if self.match_index.get(peer_id, -1) >= log_index
    )
    replicated_count += 1  # Count self

    # Check if majority has it
    total_servers = len(self.raft_terms.peers) + 1
    majority = (total_servers // 2) + 1

    if replicated_count >= majority:
        # NOW it's safe to commit
        self.raft_terms.commit_index = log_index
        self._apply_committed_entries()  # Apply to KV store
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Entry added to leader's log
  • Leader sends to all followers in parallel
  • Wait for majority ACKs
  • Only then advance commit_index
  • If leader crashes before majority ACKs: Entry is lost, but client wasn't told "persisted"
  • If leader crashes after majority ACKs: New leader has entry (majority has it), replicates it

Guarantee: If client gets success with commit_index >= index, data survives any single failure.

Real-World Impact

Without this: Bank deposits might be lost even though customer got confirmation. With this: Confirmation is only sent when data can't be lost.


Failure #3: Log Divergence - After Crashes, Followers Have Conflicting Logs

The Problem

Network partition scenario:

  • Leader A gets 2 writes: [entry1, entry2]
  • Followers B and C don't get entry2 (partition)
  • Network heals
  • Now logs diverge:
    • A: [entry1, entry2]
    • B: [entry1]
    • C: [entry1]

Who's right? How do they reconcile?

How My Code Prevents It

The answer is log consistency check with prev_log_index and prev_log_term. Leader sends entries with context.

def handle_health_check_request(self, health_check_arguments: HealthCheckArguments):
    """
    FOLLOWER: Process AppendEntries from leader

    CRITICAL: Check that previous entry matches before appending new ones
    This prevents log divergence
    """

    with self.lock:
        # Rule 1: Reject stale leader
        if health_check_arguments.current_term < self.raft_terms.current_term:
            return {"success": False, "term": self.raft_terms.current_term}

        # Rule 2: Reset election timer (valid leader)
        if health_check_arguments.current_term >= self.raft_terms.current_term:
            self.raft_terms.current_term = health_check_arguments.current_term
            self.raft_terms.state = RaftState.follower
            self.reset_election_timer()

        # Rule 3: LOG CONSISTENCY CHECK ← KEY
        # Check if we have the "previous" entry that leader is building on
        if health_check_arguments.prev_log_index >= 0:

            # Do we have enough entries?
            if health_check_arguments.prev_log_index >= len(self.raft_terms.logs):
                print(f"Log too short: have {len(self.raft_terms.logs)}, "
                      f"need {health_check_arguments.prev_log_index + 1}")
                return {"success": False, "term": self.raft_terms.current_term}

            # Does the previous entry's term match?
            prev_entry = self.raft_terms.logs[health_check_arguments.prev_log_index]
            if prev_entry["term"] != health_check_arguments.prev_log_term:
                # Term mismatch! Our logs diverge
                print(f"Log conflict at index {health_check_arguments.prev_log_index}")

                # ← KEY: Truncate diverging entries
                self.raft_terms.logs = self.raft_terms.logs[:health_check_arguments.prev_log_index]
                self._truncate_log_file(health_check_arguments.prev_log_index)

                return {"success": False, "term": self.raft_terms.current_term}

        # Rule 4: Logs match! Append new entries
        if health_check_arguments.entries:
            insert_index = health_check_arguments.prev_log_index + 1
            self.raft_terms.logs = self.raft_terms.logs[:insert_index]

            for entry in health_check_arguments.entries:
                self.raft_terms.logs.append(entry)
                self._persist_log_entry(entry)

            print(f"Follower appended {len(health_check_arguments.entries)} entries")

        return {"success": True, "term": self.raft_terms.current_term}
Enter fullscreen mode Exit fullscreen mode

How it resolves conflicts:

Scenario: Follower has [entry1_term2, entry2_term2] but leader says prev_log_index=0, prev_log_term=3

Leader says: "Before appending new entries, check your entry at index 0"
Follower checks: "My entry at index 0 has term=2"
Leader claims: "term=3"
Mismatch! Follower realizes: "My logs diverged from leader"
Follower truncates: logs = [] (delete entry2)
Follower responds: "No, previous entry doesn't match"
Leader retries: "OK, try with earlier index"
Eventually leader finds match point (or follower has nothing)
Leader then replicates new entries
All logs converge
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • Leader knows its own log history exactly
  • Sends prev_log_index and prev_log_term with each RPC
  • Follower checks before appending
  • If mismatch: follower deletes conflicting entries
  • Leader backs up and retries with earlier indices
  • Eventually logs converge (all followers have leader's log)
  • Guarantee: All servers eventually have identical logs

Real-World Impact

Without this: After network partition, some nodes have different data. With this: Logs converge automatically, consistency restored.


Failure #4: Uncommitted Data Loss - Node Crashes Before Applying to KV Store

The Problem

Data gets committed (majority replicated), but:

  • Follower receives it
  • Follower crashes before applying to KV store
  • Follower restarts
  • Did it recover the committed entry? Or is it lost?

If lost: KV store on this node diverges from others. Inconsistency.

How My Code Prevents It

The answer is persist logs to disk before applying, and recover logs on restart.

def _persist_log_entry(self, log_entry):
    """Persist a single log entry to the LOG file (append-only)"""
    try:
        log_index = len(self.raft_terms.logs) - 1

        with open(self.raft_terms.logs_file_path, 'a') as f:
            entry = {
                "index": log_index,
                "term": log_entry["term"],
                "command": log_entry["command"]
            }
            f.write(json.dumps(entry) + '\n')  # ← Persisted to disk

    except Exception as e:
        print(f"Error persisting log entry: {e}")
Enter fullscreen mode Exit fullscreen mode

When node restarts:

def _load_persistent_state(self):
    """Load persistent state from disk"""
    if not os.path.exists(self.raft_terms.logs_file_path):
        return

    # Load logs from disk
    with open(self.raft_terms.logs_file_path, 'r') as f:
        for line in f:
            entry = json.loads(line)
            self.raft_terms.logs.append({
                "term": entry["term"],
                "command": entry["command"]
            })

    # Recover metadata
    if self.raft_terms.logs:
        self.raft_terms.last_log_index = len(self.raft_terms.logs) - 1
        self.raft_terms.last_log_term = self.raft_terms.logs[-1]["term"]
Enter fullscreen mode Exit fullscreen mode

Then recover the state machine:

def _recover_state_machine_from_log(self):
    """
    After loading logs from disk, re-apply all committed entries
    to state machine (KV store)
    """
    # Re-apply all committed entries
    while self.raft_terms.last_applied < self.raft_terms.commit_index:
        self.raft_terms.last_applied += 1
        log_entry = self.raft_terms.logs[self.raft_terms.last_applied]

        # Re-apply to KV store
        if self.state_machine_applier:
            self.state_machine_applier.apply(log_entry["command"])
Enter fullscreen mode Exit fullscreen mode

Why this works:

Scenario: Follower crashes with 10 committed entries

Before crash:
  logs (disk): [entry1, entry2, ..., entry10]
  KV store: {data from applied entries}
  commit_index: 9 (committed up to entry9)
  last_applied: 7 (only applied up to entry7)

Node crashes

Node restarts:
  Load logs from disk: logs = [entry1, entry2, ..., entry10]
  Load commit_index: commit_index = 9
  last_applied = 7

  Recovery loop:
    While last_applied (7) < commit_index (9):
      last_applied = 8
      Apply logs[8] to KV store
      last_applied = 9
      Apply logs[9] to KV store

  KV store now has entries 1-9 applied
  Matches the other nodes
Enter fullscreen mode Exit fullscreen mode

Guarantee: Even if node crashes mid-way through applying, it recovers and completes on restart.

Real-World Impact

Without this: After node crash, some replicas have different state. With this: All replicas recover to identical state automatically.


Results: What I Built

The 3-Node Cluster

3-Node Cluster Configuration

Raft Dashboard

Production Features Implemented

KV Store with Advanced Features:

# Basic operations
set(key="user1", field="name", value="Alice", timestamp=1234567890, ttl=3600)
get(key="user1", field="name", timestamp=1234567890)
delete(key="user1", field="email", timestamp=1234567890)

# TTL (Time-To-Live) Support
# Fields expire after TTL seconds
def is_alive(self, timestamp: int) -> bool:
    if self.ttl is None:
        return True
    return timestamp <= self.created_at + self.ttl

# Scan operations
scan(key="user1", timestamp=1234567890)  # Get all fields for a key
scan_by_prefix(key="user1", prefix="email_", timestamp=1234567890)  # Prefix matching
Enter fullscreen mode Exit fullscreen mode

Real-Time Dashboard:

  • Shows 3-node cluster status
  • Live node state (LEADER/FOLLOWER)
  • Term and vote information
  • Log entries with commit status
  • Elections in progress
  • Heartbeat activity
  • State machine changes

WebSocket Visualization:

# Real-time events broadcast to connected browsers
await ws_manager.broadcast_heartbeat(...)       # Leader sending heartbeats
await ws_manager.broadcast_election_result(...) # New leader elected
await ws_manager.broadcast_log_entry(...)       # Entry appended
await ws_manager.broadcast_entries_committed(...) # Majority replicated
await ws_manager.broadcast_kv_store_update(...)  # Applied to KV
Enter fullscreen mode Exit fullscreen mode

Getting Started: Run This Locally

Want to see RAFT in action on your machine?

Prerequisites

# Python 3.8+
python --version

# Install dependencies
python3 -m venv raft-venv
source raft-venv/bin/activate
pip install rpyc fastapi uvicorn python-multipart
Enter fullscreen mode Exit fullscreen mode

All dependencies from requirements.txt:

rpyc           # Remote procedure calls
fastapi        # API server
uvicorn        # ASGI server
python-multipart  # Form handling
Enter fullscreen mode Exit fullscreen mode

Step 1: Start the RAFT Cluster

# In terminal 1
python3 start_cluster.py
Enter fullscreen mode Exit fullscreen mode
# In terminal 2
cd client/raft-visualization
npm run dev
Enter fullscreen mode Exit fullscreen mode

This starts:

  • 3-node RAFT cluster (A, B, C)
  • FastAPI server on http://localhost:8765
  • WebSocket server on ws://localhost:8765/ws

You'll see:

RAFT Cluster with WebSocket Server
WebSocket Port: 8765
React should connect to: ws://localhost:8765/ws - http://localhost:5173/

RAFT Cluster Startup - Phase 1: Initialize RPC Servers
...
RAFT Cluster Startup - Phase 2: Begin Leader Election
...
Node A elected as LEADER in term 1
Enter fullscreen mode Exit fullscreen mode

Step 2: Write Data (Make Requests)

# In terminal 2, use curl to write data

# Set a key-value pair
curl -X POST http://localhost:8765/kv-store \
  -H "Content-Type: application/json" \
  -d '{
    "type": "set",
    "command": "SET",
    "key": "user1",
    "field": "name",
    "value": "Alice"
  }'

Response:
{
  "success": true,
  "message": "KV store entry submitted for replication",
  "key": "user1",
  "field": "name",
  "value": "Alice",
  "timestamp": 1704067890000000,
  "log_index": 0,
  "term": 1
}

# Read it back
curl "http://localhost:8765/kv-store?key=user1&field=name"

Response:
{
  "success": true,
  "message": "Data retrived successfully",
  "value": "Alice"
}

# Delete
curl -X DELETE "http://localhost:8765/kv-store?key=user1&field=name"
Enter fullscreen mode Exit fullscreen mode

Step 3: Visualize in Real-Time

Connect the included React dashboard to ws://localhost:8765/ws and watch:

  • Heartbeats flowing (every 3sec)
  • Elections triggering (when leader dies)
  • Log entries replicating (client writes)
  • State machine applying (entries commit)

Key Learnings & Lessons

After building this, here's what I discovered:

1. Consensus is Harder Than It Looks

The RAFT paper is 13 pages. I thought: "How hard can it be?"

Very hard. The subtleties:

  • Log matching property: Followers must check prev_log_index and prev_log_term before appending. Without this, logs diverge.
  • Commit safety: Only apply when majority replicates. One node applying uncommitted data = consistency violation.
  • Tie-breaking: Randomized election timeouts prevent split votes. This one line of randomness (5000-10000ms) is critical.

2. Small Bugs Cascade into Catastrophic Failures

I had a bug once: not resetting the election timer when becoming follower. Result?

  • Node gets heartbeat from leader
  • Becomes follower
  • But doesn't reset timer
  • Timer fires after 3000ms
  • Node starts election
  • Leader split-brain between old and new leader

This bug showed me: RAFT is elegant because it prevents these cascades. The algorithm is designed so small mistakes don't compound.

3. Threading Requires Discipline

I released locks before blocking operations (RPC calls, disk I/O). Why?

If I held the lock during RPC:

  • Node sends RequestVote to peer
  • Peer is slow (network latency)
  • Lock is held for seconds
  • Other threads (heartbeat, RPC handlers) starve
  • Everything slows down

By releasing the lock:

  • Lock held for microseconds (just state modification)
  • RPC happens without lock
  • Other threads proceed
  • System stays responsive

This pattern critical section, then blocking, then critical section again is subtle but essential.

4. Persistence Prevents Disasters

I persist to disk before:

  • Voting in elections (can't vote twice)
  • Appending entries (can't lose data on crash)

Without this: Node crashes, restarts, votes again in same term = election breaks.

5. Real-Time Visualization is a Superpower

Building the WebSocket dashboard made RAFT visible. Watching:

  • Leader send heartbeats
  • Followers append entries
  • CommitIndex advance
  • State machine apply

...made the abstract algorithm concrete. Problems jumped out immediately.


Resources & References

If you want to dive deeper into RAFT and distributed systems, here are the resources that shaped this project:

My Implementation

  • Complete RAFT Implementation: GitHub Repository

    • Full source code for the 3-node cluster, KV store, tests, and WebSocket visualization. Feel free to fork, study, or build upon it.
  • Video Walkthrough: Click here to view a short video of my implementation

    • A 1-minute demo showing the cluster in action: leader election, log replication, applying to state machine

Learning Materials

  • MIT 6.824: Distributed Systems (YouTube Playlist): Watch the course
    • This course was my primary inspiration. Professor Robert Morris explains distributed systems with clarity and depth. If you're serious about understanding consensus, replication, and fault tolerance, this is mandatory viewing.

Technical References

  • The RAFT Consensus Algorithm Paper: Read the paper
    • The original paper by Diego Ongaro and John Ousterhout. It's remarkably readable compared to other consensus algorithms. The visualizations alone are worth studying.

Let's Discuss

Got questions about RAFT? Spotted an edge case I missed? Want to discuss consensus algorithms, distributed systems, or database architecture?

I'd love to hear from you. Drop a comment below or reach out on LinkedIn.


One last thing: Building RAFT taught me that learning never truly ends in software engineering. There's always a deeper level to understand, a new failure scenario to handle, another optimization to explore.

So here's to the next challenge. See you there.

Top comments (0)