DEV Community

Cover image for Why your system can be 100% up and still completely broken
Josef Smetanka
Josef Smetanka

Posted on

Why your system can be 100% up and still completely broken

Your monitoring is green.
99.99% uptime.
Health checks passing.
No alerts.

Then support starts forwarding screenshots from users:

“I paid, but my order says cancelled.”
“The price changed after checkout.”
“It said in stock. Then refund.”

Welcome to a harsh truth engineers eventually learn:

Uptime measures server liveness.
Users care about state correctness.

And those are very different things.


The illusion of “up”

Most systems monitor process health:

HTTP 200 OK
Enter fullscreen mode Exit fullscreen mode

But a distributed system can respond perfectly while being completely wrong.

Examples:

  • API returns 200 with stale data
  • Writes succeed but never reach downstream systems
  • Auth works, but data permissions are wrong
  • Checkout returns success, but payment never captured
  • Stock shows available, but orders already consumed it elsewhere

The system is alive.
The truth inside it is dead.


The real failure class: state drift

Most “it's up but broken” incidents are not crashes.
They’re state divergence problems.

Systems look healthy because:

  • CPU OK
  • DB reachable
  • Services responding

But internally:

  • caches out of sync
  • queues lagging
  • partial writes
  • retries overwriting newer state
  • external APIs delayed
  • eventual consistency biting you

Your monitoring says “system operational”.
Reality says “state is no longer trustworthy.”

That’s not downtime.
That’s silent correctness failure — much worse.


Why uptime is the wrong mental model

Uptime answers:

“Is the machine alive?”

Users ask:

“Did the system do the correct thing?”

Those are different layers:

Layer What uptime measures What users experience
Infra Processes running Irrelevant
Network Requests succeed Still irrelevant
App Endpoint returns Still not enough
State Correct data & side-effects This is what matters

Most outages today are not infrastructure failures.
They are correctness failures in distributed state.


What you should actually measure

1. Correctness SLIs

Not just response success — result validity.

  • Did the order actually get created?
  • Did payment get captured?
  • Did inventory decrement once, not twice?
  • Is user data consistent across services?

If the side-effect didn’t happen, the request was a failure — even if it returned 200.


2. End-to-end invariants

Every system has truths that must always hold:

  • Stock never negative
  • Order cannot be paid twice
  • One user = one identity
  • Total debits = total credits

These invariants breaking is worse than downtime.


3. User-journey success rate

Not “endpoint success”.
Journey success:

Login → Browse → Add to cart → Pay → Order confirmed
Enter fullscreen mode Exit fullscreen mode

If this drops from 98% to 85%, you're broken.
Even if uptime is 100%.


4. Lag and staleness metrics

Distributed systems rot from delay:

  • queue depth
  • replication lag
  • cache age
  • sync delay between services

Lag is future inconsistency waiting to explode.


The mindset shift

Stop asking:

“Is the system up?”

Start asking:

“Is the system still telling the truth?”

Because modern outages look like this:

  • No errors
  • No crashes
  • No alerts

Just:

  • refunds
  • data mismatches
  • user confusion
  • support chaos

The worst failures are quiet.


The bottom line

A system that responds but lies is worse than a system that’s down.

Downtime is visible.
Incorrect state is invisible — until money, trust, or data integrity is gone.

Uptime is liveness.
Users care about correctness.

Those are not the same metric.


What’s the worst “everything green, everything wrong” incident you’ve seen?

Top comments (1)

Collapse
 
francistrdev profile image
👾 FrancisTRDev 👾

This is a great post! Even though it is working, it doesn't mean it is truly working as intended! Great work!