How I built a horizontally scalable chat server in Go.
It's not about sending messages fast.
It's about doing three things at once:
— deliver in real time
— preserve order + durability
— scale without sticky sessions
Project: Signal-Flow
GitHub: https://github.com/abhiram-karanth-core/signal-flow
Demo: https://global-chat-app-web.vercel.app/
Here's how I broke it down
The key insight: stop treating a chat message as one thing.
Every message has a lifecycle with very different requirements at each stage. Collapsing them into one system is where most chat servers break down at scale.
The lifecycle:
- User sends a message (Next.js)
- Go WebSocket server receives it
- Made durable (Kafka)
- Fanned out to other users (Redis)
- Written to queryable history (Postgres)
Each step needs a different system. Each system does exactly one job.
The first big lesson: real-time delivery and durable storage have opposing needs.
Real-time = sub-millisecond latency
Durable = safety, ordering, replayability
Forcing one system to do both means compromising on both. So don't.
Kafka is the source of truth. Every message hits Kafka first — before fan-out, before anything.
But Kafka doesn't deliver messages to users. It exists so the system can survive failures and recover state.
Correctness lives here. Latency does not.
Why partition by room?
Message ordering matters — but only within a room, not globally.
Kafka guarantees ordering per partition, so messages are partitioned by room_id.
Each room maps deterministically to a single partition, preserving strict ordering
for that room while allowing horizontal throughput across rooms.
Per partition → ~5k msg/sec
N partitions → ~N × 5k msg/sec across independent rooms
This keeps chat semantics correct and scales throughput without global bottlenecks.
Redis makes it feel instant.
It sits on the hot path — performing room-based fan-out so that each WebSocket connection independently receives messages for its subscribed room.
Redis is intentionally ephemeral. If it drops a message: Kafka still has it. Clients recover on reconnect. History stays correct.
Postgres makes it usable.
Kafka consumers write to Postgres asynchronously. This means:
— DB slowness never blocks ingestion
— State can be rebuilt from Kafka replay
— Frontend reloads always return consistent history
Postgres is the materialized view users actually interact with.
The most important architectural decision: heavy decoupling between consumers.
Kafka → Redis (real-time fan-out)
Kafka → Postgres (durable writes)
Producer publishes once. Doesn't care who consumes. Each consumer fails and restarts independently.
No cascading failures.
Why it scales horizontally:
— Go servers are stateless (no sticky sessions)
— Real-time delivery is decoupled from durability
— Each component has one job
— Any component can rebuild from Kafka replay
Scaling becomes an ops concern. Not an app rewrite.
The frontend never sees any of this.
WebSockets → real-time updates via Redis
HTTP APIs → message history from Postgres
Kafka stays internal. Clean boundary.
Don't build a chat server. Build three systems that each do one thing well:
Kafka → correctness
Redis → latency
Postgres → usability
Everything else is trade-offs.
Top comments (0)