Your monitoring is green.
99.99% uptime.
Health checks passing.
No alerts.
Then support starts forwarding screenshots from users:
“I paid, but my order says cancelled.”
“The price changed after checkout.”
“It said in stock. Then refund.”
Welcome to a harsh truth engineers eventually learn:
Uptime measures server liveness.
Users care about state correctness.
And those are very different things.
The illusion of “up”
Most systems monitor process health:
HTTP 200 OK
But a distributed system can respond perfectly while being completely wrong.
Examples:
- API returns 200 with stale data
- Writes succeed but never reach downstream systems
- Auth works, but data permissions are wrong
- Checkout returns success, but payment never captured
- Stock shows available, but orders already consumed it elsewhere
The system is alive.
The truth inside it is dead.
The real failure class: state drift
Most “it's up but broken” incidents are not crashes.
They’re state divergence problems.
Systems look healthy because:
- CPU OK
- DB reachable
- Services responding
But internally:
- caches out of sync
- queues lagging
- partial writes
- retries overwriting newer state
- external APIs delayed
- eventual consistency biting you
Your monitoring says “system operational”.
Reality says “state is no longer trustworthy.”
That’s not downtime.
That’s silent correctness failure — much worse.
Why uptime is the wrong mental model
Uptime answers:
“Is the machine alive?”
Users ask:
“Did the system do the correct thing?”
Those are different layers:
| Layer | What uptime measures | What users experience |
|---|---|---|
| Infra | Processes running | Irrelevant |
| Network | Requests succeed | Still irrelevant |
| App | Endpoint returns | Still not enough |
| State | Correct data & side-effects | This is what matters |
Most outages today are not infrastructure failures.
They are correctness failures in distributed state.
What you should actually measure
1. Correctness SLIs
Not just response success — result validity.
- Did the order actually get created?
- Did payment get captured?
- Did inventory decrement once, not twice?
- Is user data consistent across services?
If the side-effect didn’t happen, the request was a failure — even if it returned 200.
2. End-to-end invariants
Every system has truths that must always hold:
- Stock never negative
- Order cannot be paid twice
- One user = one identity
- Total debits = total credits
These invariants breaking is worse than downtime.
3. User-journey success rate
Not “endpoint success”.
Journey success:
Login → Browse → Add to cart → Pay → Order confirmed
If this drops from 98% to 85%, you're broken.
Even if uptime is 100%.
4. Lag and staleness metrics
Distributed systems rot from delay:
- queue depth
- replication lag
- cache age
- sync delay between services
Lag is future inconsistency waiting to explode.
The mindset shift
Stop asking:
“Is the system up?”
Start asking:
“Is the system still telling the truth?”
Because modern outages look like this:
- No errors
- No crashes
- No alerts
Just:
- refunds
- data mismatches
- user confusion
- support chaos
The worst failures are quiet.
The bottom line
A system that responds but lies is worse than a system that’s down.
Downtime is visible.
Incorrect state is invisible — until money, trust, or data integrity is gone.
Uptime is liveness.
Users care about correctness.
Those are not the same metric.
What’s the worst “everything green, everything wrong” incident you’ve seen?
Top comments (1)
This is a great post! Even though it is working, it doesn't mean it is truly working as intended! Great work!