Josef Smetanka

Posted on Feb 1

Why your system can be 100% up and still completely broken

#distributedsystems #observability #reliability #sre

Your monitoring is green.
99.99% uptime.
Health checks passing.
No alerts.

Then support starts forwarding screenshots from users:

“I paid, but my order says cancelled.”
“The price changed after checkout.”
“It said in stock. Then refund.”

Welcome to a harsh truth engineers eventually learn:

Uptime measures server liveness.
Users care about state correctness.

And those are very different things.

The illusion of “up”

Most systems monitor process health:

HTTP 200 OK

But a distributed system can respond perfectly while being completely wrong.

Examples:

API returns 200 with stale data
Writes succeed but never reach downstream systems
Auth works, but data permissions are wrong
Checkout returns success, but payment never captured
Stock shows available, but orders already consumed it elsewhere

The system is alive.
The truth inside it is dead.

The real failure class: state drift

Most “it's up but broken” incidents are not crashes.
They’re state divergence problems.

Systems look healthy because:

CPU OK
DB reachable
Services responding

But internally:

caches out of sync
queues lagging
partial writes
retries overwriting newer state
external APIs delayed
eventual consistency biting you

Your monitoring says “system operational”.
Reality says “state is no longer trustworthy.”

That’s not downtime.
That’s silent correctness failure — much worse.

Why uptime is the wrong mental model

Uptime answers:

“Is the machine alive?”

Users ask:

“Did the system do the correct thing?”

Those are different layers:

Layer	What uptime measures	What users experience
Infra	Processes running	Irrelevant
Network	Requests succeed	Still irrelevant
App	Endpoint returns	Still not enough
State	Correct data & side-effects	This is what matters

Most outages today are not infrastructure failures.
They are correctness failures in distributed state.

What you should actually measure

1. Correctness SLIs

Not just response success — result validity.

Did the order actually get created?
Did payment get captured?
Did inventory decrement once, not twice?
Is user data consistent across services?

If the side-effect didn’t happen, the request was a failure — even if it returned 200.

2. End-to-end invariants

Every system has truths that must always hold:

Stock never negative
Order cannot be paid twice
One user = one identity
Total debits = total credits

These invariants breaking is worse than downtime.

3. User-journey success rate

Not “endpoint success”.
Journey success:

Login → Browse → Add to cart → Pay → Order confirmed

If this drops from 98% to 85%, you're broken.
Even if uptime is 100%.

4. Lag and staleness metrics

Distributed systems rot from delay:

queue depth
replication lag
cache age
sync delay between services

Lag is future inconsistency waiting to explode.

The mindset shift

Stop asking:

“Is the system up?”

Start asking:

“Is the system still telling the truth?”

Because modern outages look like this:

No errors
No crashes
No alerts

Just:

refunds
data mismatches
user confusion
support chaos

The worst failures are quiet.

The bottom line

A system that responds but lies is worse than a system that’s down.

Downtime is visible.
Incorrect state is invisible — until money, trust, or data integrity is gone.

Uptime is liveness.
Users care about correctness.

Those are not the same metric.

What’s the worst “everything green, everything wrong” incident you’ve seen?

Top comments (1)

👾 FrancisTRDev 👾 • Feb 3

This is a great post! Even though it is working, it doesn't mean it is truly working as intended! Great work!