Chapter 3: The Things That Actually Break Real Systems

← Chapter 2: Systems Fail According to Incentives | Shared Responsibility Accountability →

Reliability books spend chapters describing what should happen. This one is different.

This chapter describes what actually happens. The gap between those two things kills more systems than any architectural flaw.

The hard truths

Availability and control-path failures

1. Identity failures can become system-wide outages

Not optional services. Everything.

When identity breaks in a critical path, large portions of the system can become unavailable to users and operators.

You cannot fix identity failure with workarounds. You cannot degrade gracefully. You cannot retry. The authentication layer is upstream of everything else.

The trap: Most teams invest in compute redundancy, database replication, multi-region failover. Then they route all identity through a single control-plane path to a third-party provider with an SLA, no failover, and a customer support queue.

What happens in real incidents:

Your identity provider experiences a transient token issue
Your caching layer misses the update
For 7 minutes, every login fails globally
Nobody can access anything
You have no fallback

This pattern is not theoretical. It happens to enterprise scale systems regularly.

2. Latency is failure

Most teams distinguish between:

Service up (good)
Service down (bad)

They skip the middle: Service slow (worse).

When a service gets slow enough:

Users retry
Load increases
System gets slower
Retry storms start
System collapses
Now it looks like a crash

It started as latency.

Why latency is worse than outage:

An outage is visible immediately
Latency can look like normal variance until it is not
Customers assume their connection is the problem, not the service
Retry storms escalate the problem before you notice

The common miss: You monitor P99 latency at 500ms, feel fine, and do not notice that P99.9 is 30 seconds until the platform is cascading.

3. Partial failure is the norm, not the exception

You already know this intellectually. You do not internalize it operationally.

Assume that right now, in production:

One dependency is 200ms slower than baseline
One region sees higher error rates
One thin critical path is more fragile than you think
One monitoring blind spot exists
One cache is stale

These are not failures. They are the default operating conditions of a complex system.

The mistake: Designing for the happy path and hoping the sad path never happens at scale. It will. It is.

Detection and response failures

4. Monitoring fails when you need it most

The reality is brutal but true:

When the system is under extreme stress and you need observability most, your observability itself becomes unreliable.

Why:

Monitoring traffic adds load during overload
Telemetry systems become the bottleneck
Dashboards become unreachable
Logs are dropped because the logging system cannot keep up
Alarms fire but you cannot query why

What you have left: Your runbooks, your heuristics, and the 30 lines of code you wrote 8 months ago that nobody remembers.

Real incident pattern: System starts degrading → you go to check dashboards → dashboards are slow → you cannot tell what is happening → you guess → you make it worse.

5. Humans slow down recovery

Every decision in an incident takes time:

“Should we roll back?” (5 minutes)
“Who can approve this?” (10 minutes)
“Is this the right fix?” (15 minutes)
“Are we sure we should do this?” (10 minutes)

In a high-stress incident, that is 40 minutes of delay for what feels like a fast response.

The multiplier: If multiple people need to coordinate, approval chains break, or ownership is unclear, you add hours.

What you cannot see: The mental overhead. During a bad incident, smart engineers make bad decisions. Not because they are not smart. Because their cognitive load is maxed.

This is not soft skills. This is infrastructure. If your decision-making process requires synchronous consensus during an incident, your incident response time is not 4 hours. It is 4 hours plus however long it takes to reach consensus.

Most teams do not have a documented decision cascade. “The VP will decide” is not a decision cascade during a 2 AM incident.

6. Failover is harder than you think

You have tested regional failover. Good.

You have not tested it:

While your database is under replication load
While your identity provider is degraded
While your support channel is overloaded
While you are under attack or cost pressure
While your team is split between time zones
While your team is tired from the previous incident

The gap: Lab tests look great. Production failover is messier. You have dependencies on the failing region you forgot about. Your DNS TTL was set to an hour. Your failover procedure assumes a component that is currently offline. Your backup is older than you thought.

What actually happens: Failover starts. You discover issues. You stop. You troubleshoot. You restart. 45 minutes in, you are still not fully over.

Correctness and dependency failures

7. Data corruption is worse than downtime

A system that is down for 2 hours is a problem.

A system that silently corrupts data for 2 hours is a career-ending problem.

Why teams miss this:

They focus on uptime metrics
They assume databases maintain consistency
They think transactions solve it
They do not test recovery from partial writes

Where it comes from:

Incomplete writes during failover
Out-of-order events processed in wrong order
Duplicate writes that should be idempotent but are not
Cascading updates that leave data in inconsistent state
Logical corruption that looks fine to infrastructure

The consequence: You restore the service quickly. You spend weeks validating that data is actually correct and untangling duplicates or reversals.

8. Dependency chains are longer than you think

You know you depend on a cloud provider. You also depend on:

Your DNS provider
Your CDN
Your payment processor
Your email provider
Your identity provider
Your logging platform
Your monitoring platform
Your incident communication tool

That is eight different SLAs and eight different failure modes. The combined availability is the product of all of them.

If each is 99.9%, the combined is 99.1%. You are one decimal place away from SLA breach before any of your code fails.

What teams do: Assume each SLA in isolation. Do not model the chain.

9. Capacity is a first-class failure domain

You can have perfect architecture and still hit a wall: quota exhaustion.

Region capacity exhausted (cannot spawn new instances)
Database connection pool exhausted (new requests queue, then timeout)
DNS query rate limit hit (lookups fail)
API rate limit hit (external service cuts you off)
Disk quota hit (cannot write logs, cannot deploy)
Network bandwidth limit (traffic shaped, service slowdown)

None of these are architecture failures. All of them will take your system down.

The trap: Most teams discover these during growth or after a major incident when load spikes. By then, it is too late.

Economic and governance failures

10. Reliability work gets cut first

During budget cycles, reliability work gets cut because:

It produces no visible feature
The system is “working fine” right now
The cost savings seem safe

Then an incident happens six months later that costs more than the annual reliability budget.

The meta-problem: Organizations are optimized to cut costs immediately and pay for incidents eventually. That is the incentive structure. No amount of reliability technology fixes misaligned incentives.

What this means

You cannot engineer your way around these hard truths. You must:

Accept that identity is critical infrastructure. Do not route it through single points of failure. Have a fallback. Test it.
Monitor latency, not just uptime. P99.9 matters. Tail latencies are where failure hides.
Design for partial failure as the default. Not as an edge case. As the baseline operating condition.
Do not trust your monitoring during incidents. Have runbooks that work without dashboards. Have decision cascades that work without consensus.
Document the decision chain before you need it. Who decides during an incident? What is the escalation path? What is non-negotiable?
Test failover under load. Not in the lab. Inject real incidents and measure how long recovery actually takes.
Validate data correctness. Measure corruption indicators (duplicates, out-of-order, stale) alongside availability.
Map the full dependency chain. Calculate the combined availability. Plan what happens if one node fails.
Treat capacity as architecture. Not as an operations detail. Model it. Plan for it. Test it.
Make reliability funding non-discretionary. Tie it to risk, not to budget cycles. If you cut it, the risk goes up. Make that explicit.

Chapter index

Chapter	Topic
Chapter 1	Opening thesis: reliability as economic decision
Chapter 2	Incentives and organizational failure
Chapter 3	The things that actually break
Chapter 4	The financial model
Chapter 5	Provider failures and status page reality
Chapter 6	Partial failures and degraded-state design
Chapter 7	Hidden cost of observability tooling
Chapter 8	Trade-offs: on-call, FinOps, and human cost
Chapter 9	Governance system
Chapter 10	Execution and the next quarter
Chapter 12	Reliability pricing and the SaaS margin trap
Appendix	Operating artifacts and policy templates
Chapter 13	Maturity and organizational adoption

I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.