DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Reliability vs Uptime: Why Availability Fails at Scale

Reliability vs Uptime: Why Availability Fails at Scale

5
Comments 1
3 min read
A Measurable Snapchat Proxy Validation Mini Lab You Can Run This Week

A Measurable Snapchat Proxy Validation Mini Lab You Can Run This Week

Comments
6 min read
API Error Handling: Patterns That Actually Work

API Error Handling: Patterns That Actually Work

Comments
9 min read
Why your system can be 100% up and still completely broken

Why your system can be 100% up and still completely broken

3
Comments 1
2 min read
Part 5 — Cost, Latency, and Failure Are the Design

Part 5 — Cost, Latency, and Failure Are the Design

Comments
1 min read
I Lost 25% of a Form by Scrolling Up. Nobody Noticed.

I Lost 25% of a Form by Scrolling Up. Nobody Noticed.

2
Comments
3 min read
Architecture Under Load #2 - Scalability, Performance, and Reliability Don’t Break the Same Way

Architecture Under Load #2 - Scalability, Performance, and Reliability Don’t Break the Same Way

6
Comments
3 min read
Vibe Coding vs AI-Driven Development: The Contracts Problem (and GS-TDD)

Vibe Coding vs AI-Driven Development: The Contracts Problem (and GS-TDD)

1
Comments
4 min read
Autosave works. Until it doesn’t.

Autosave works. Until it doesn’t.

Comments
4 min read
The Future of SRE: Why AI is the "Force Multiplier" Your Infrastructure Needs

The Future of SRE: Why AI is the "Force Multiplier" Your Infrastructure Needs

Comments
3 min read
The Samurai Server: Why "Heroic" Systems Always Die

The Samurai Server: Why "Heroic" Systems Always Die

Comments
4 min read
How when AWS was down, we were not

How when AWS was down, we were not

18
Comments 2
37 min read
This PHD from AWS Might Save Your Weekend!

This PHD from AWS Might Save Your Weekend!

1
Comments 1
5 min read
The Real State of Helm Chart Reliability (2025): Hidden Risks in 100+ Open‑Source Charts

The Real State of Helm Chart Reliability (2025): Hidden Risks in 100+ Open‑Source Charts

Comments
23 min read
Architecting for system reliability and scalability demands clean foundational code.

Architecting for system reliability and scalability demands clean foundational code.

Comments
4 min read
When Everything Is Instrumented, and You Still Don't Know What's Broken

When Everything Is Instrumented, and You Still Don't Know What's Broken

Comments
2 min read
Why Top Developers Prioritize Failure Management

Why Top Developers Prioritize Failure Management

Comments
4 min read
WTF is Site Reliability Engineering?

WTF is Site Reliability Engineering?

1
Comments
3 min read
Designing AI Applications: Principles from Distributed Systems Applicable in a New AI World

Designing AI Applications: Principles from Distributed Systems Applicable in a New AI World

Comments
8 min read
Building Durable Cloud Control Systems with Temporal

Building Durable Cloud Control Systems with Temporal

Comments
5 min read
Unleashing Resilience: 15+ Essential Chaos Engineering Tools for Robust Systems

Unleashing Resilience: 15+ Essential Chaos Engineering Tools for Robust Systems

Comments
6 min read
Nomadic Infrastructure Design for AI Workloads

Nomadic Infrastructure Design for AI Workloads

Comments
15 min read
We're making our availability metrics public

We're making our availability metrics public

Comments
3 min read
Microservices Reliability Playbook, Part 6 - Multi-Service Patterns

Microservices Reliability Playbook, Part 6 - Multi-Service Patterns

Comments
4 min read
Microservices Reliability Playbook, Part 7 - Call Patterns

Microservices Reliability Playbook, Part 7 - Call Patterns

Comments
6 min read
loading...