Chapter 7b: How You See (and Miss) Reality

← Silent Outages | Change as Primary Failure Source →

Your team has never had a major outage.

Therefore, your system is reliable.

It is not logic. It is survivorship bias.

This chapter is about the illusions that create false confidence, because confidence is when you are most vulnerable.

Illusion 1: The SLA theater

Your provider has 99.9% SLA. You have 99.5% SLA. You are good.

What you are actually measuring:

Availability = “is the system responding?”
Uptime = “did we hit our SLA?”

What you are not measuring:

Correctness (is the data right?)
Latency (are responses fast enough?)
Cascade failures (when one service fails, how many go down?)
Partial failures (some users down, others up)

Your SLA can be perfect while your customers experience continuous problems.

Real example:

Your SLA: “API responds in < 5 seconds”
Your provider SLA: “99.9% availability”
Your system is up 99.9% of the time
But when a downstream service is slow, your API times out after 5 seconds
You return 503, fail the SLA check, customer experience is bad
But your availability metrics show you are meeting SLA because you are returning a response (albeit an error)

The theater is not lying. It is measuring something that does not correlate with reliability.

The SLA measurement problem

Most SLAs measure:

“Did the service respond?”

They do not measure:

“Did the response contain correct data?”
“Was the response fast enough to be useful?”
“Did the response trigger a cascade failure elsewhere?”

You can be 99.9% “available” while being 50% useful.

The SLA ceiling problem

Your SLA promises 99.9% uptime. Your infrastructure provider has the same SLA.

Combined: 99.9% * 99.9% = 99.801%.

Your SLA is higher than what is mathematically possible if you depend on them for anything.

The rescue fantasy: “We will have redundancy.”

Great. You have two providers, both 99.9%.

If they are independent: 1 - (0.001 * 0.001) = 99.9999%.

They are not independent. They are both in the cloud. They are both using the same DNS provider. They are both relying on the same BGP routes. During the big outage, both go down together.

You thought you had redundancy. You had correlation you did not measure.

Illusion 2: The untested recovery

You have a disaster recovery plan.

You have never actually run the full recovery.

You have done “tabletop exercises” where you read through the steps. You have not done a full “production failover” where you actually switch to the DR environment while keeping production running.

Why this matters:

Recovery steps are outdated (people who wrote them left the company)
Recovery steps are incomplete (one crucial dependency was forgotten)
Recovery steps take 3 hours instead of the 30 minutes you planned
Recovery steps fail at step 7 of 15 and nobody knows what to do next

You do not know if your DR plan works until you test it.

You probably will not test it until you need it.

You find out it does not work at that point.

Illusion 3: Alert fatigue masquerading as rigor

You have 200 alerts configured.

You get paged 3 times a week on average.

Therefore, your monitoring is good.

What is actually happening:

197 alerts are tuned so loosely that they fire on normal operation
You get paged constantly and stop responding to pages
The 3 real issues slip through while you are dismissing false alerts

Alert fatigue is the state where you have so many alerts that you ignore pages.

The outcome: the one time it matters, you miss it because you have trained yourself to ignore pages.

The monitoring threshold problem

You set alert thresholds to avoid false positives:

CPU > 80% (alert on high utilization)
Errors > 1% (alert on high error rate)
Latency p95 > 2 seconds (alert on slow responses)

These are reasonable looking thresholds.

However, what if:

Your system peaks at 75% CPU normally. 80% is actually fine. You never alert.
Your error rate is 0.8% during normal operation. 1% is not an anomaly, it is Tuesday. You never alert.
Your latency p95 is 1.8 seconds. 2 seconds is not a problem. You never alert.

Meanwhile:

At 77% CPU, your system starts dropping requests (not 80%, but 77%)
At 0.85% error rate, you have a repeating pattern of failures
At 1.9 second p95, you are one slow query away from cascade failure

Your alerts are tuned for “loud failures” (crash, go offline). They miss “slow degradation” (gradually getting worse until catastrophe).

There is a second trap inside the same illusion: metrics age.

The dashboard can stay green because the system is healthy, or because the dashboard still measures what mattered six months ago. A threshold that was useful before a traffic shift, architecture change, or product change can quietly become theater. The graph still moves. The signal is gone.

Stale observability is not neutral for this reason. It actively creates false confidence.

Illusion 4: The confidence of the untested

Your system has not failed in 18 months.

You have 99.99% uptime in production.

Therefore, your system is rock solid.

What might actually be true:

You have not had a major incident in 18 months because you have not changed anything
Traffic has been below the failure threshold
The right failure mode just has not triggered yet
You have been lucky

The most reliable indicator of future reliability is not past uptime. It is past incidents.

Systems that have had incidents, debugged them, and fixed them are usually more reliable than systems that have never failed.

Why? Because they have found and fixed failure modes. Systems that have never failed are just untested failure modes waiting.

The untested failure mode

Your system handles partial failures gracefully.

You know this because:

You read the code
You did a code review
It looks right

You have never tested it because that would require:

Breaking your database connection
Running production traffic
Verifying the system fell back correctly
Verifying the fallback did not cause other failures

So you have not tested it. You have just assumed it works.

When a real partial failure happens, it turns out:

The fallback code has a bug
The fallback path was not deployed to all instances
The fallback creates a different problem (cascade failures)

You now have a major incident.

Illusion 5: The confidence of explicit knowledge

You know exactly how your system works.

You have:

Architecture documentation
Runbooks
Disaster recovery procedures
Team knowledge

What you might not know:

How three different teams coordinate during an incident
What happens when both the database and the cache fail (are they in the same data center?)
Whether that service you depend on will route traffic to a degraded instance or reject entirely
Whether your rollback procedure actually works for the last 10 releases

The gap between “we have documented this” and “we actually know this” is huge.

Most teams have documented their system and never tested the documentation against reality.

Illusion 6: The efficiency that causes fragility

You have been running lean.

You eliminated:

Redundant services (too expensive)
Failover capacity (we can just scale up)
Regional redundancy (DNS failover is fast enough)
Standing observation (automated monitoring catches everything)

Your system is efficient.

It is also brittle.

Efficiency and resilience are in tension:

Efficiency: minimize resources, eliminate waste, optimize utilization
Resilience: build redundancy, maintain fallbacks, keep extra capacity

You cannot have both at full strength simultaneously.

Most teams optimize for efficiency in peacetime and get destroyed when a failure comes.

What you should be doing

1. Measure usefulness, not just availability

Your SLA should include:

Response time (p95, p99)
Correctness (data verification)
Cascade detection (did one failure cause others?)
Partial failure behavior (some users down vs all users down)

Not just: “did we respond?“

2. Test your DR plan quarterly

Not tabletop. Full failover.

Fail over to DR environment
Run production traffic against DR
Verify nothing breaks
Fail back to primary
Verify everything still works

Document what broke. Fix it. Test again next quarter.

3. Actively tune alert thresholds

Once a quarter:

Look at your alert history
Calculate false positive rate
For alerts with > 10% false positive rate, retune or remove
For critical alerts with zero firings in 3 months, investigate why (is the failure not happening or is the alert broken?)

The goal is not zero false positives. It is to understand which alerts matter.

4. Run chaos engineering to find untested failure modes

Regularly:

Fail the database and watch what happens
Fail the cache and watch what happens
Fail a regional provider and watch what happens
Slow down a critical dependency and watch what happens
Drop 5% of network packets and watch what happens

Document what breaks. Fix it. Run the experiment again in 6 months.

5. Explicitly model the efficiency vs resilience tradeoff

For each system, decide:

Are we optimizing for cost?
Or are we optimizing for reliability?

You cannot have both. Make it explicit:

Tier 1 (mission-critical): we will pay for redundancy
Tier 2 (standard): we will accept some efficiency for resilience
Tier 3 (non-critical): we will optimize for cost

Then architect accordingly. Do not end up with Tier 3 efficiency and Tier 1 SLA.

6. Know what you do not know

For each critical system:

List 5 things you are not sure about
Plan to test them
Schedule a day to do it

This is not an audit. It is a structured discovery of unknown unknowns.

7. Define reliability in terms your business understands

Do not tell the business: “We have 99.9% uptime.”

Tell them: “Customers can complete transactions 99.9% of the time.”

Or: “Customers see current data 99.8% of the time.”

Or: “User queries return in < 2 seconds 99% of the time.”

These metrics are related to SLA, but they are in business terms.

The uncomfortable truth

The teams that have the most confidence in their reliability are usually the ones who are most vulnerable.

They have not had an incident recently. They think their system is solid.

What they do not know is the failure modes they have never tested.

The day someone finds one of those, their confidence becomes a liability.

Key architecture principle

Confidence is the absence of testing, not the result of it.

Teams that maintain high reliability do not do so by assuming their systems work. They do so by actively finding and fixing failures before they become problems.

The team that says “we are so reliable we never test” is the team that is about to have a catastrophic outage.

Chapter index

Chapter	Topic
Chapter 1	Opening thesis: reliability as economic decision
Chapter 2	Incentives and organizational failure
Chapter 3	The things that actually break
Shared Responsibility	Shared responsibility and accountability vacuum
Chapter 4	The financial model
Chapter 5	Provider failures and status page reality
Chapter 6	Partial failures and degraded-state design
Chapter 5 (Alt)	Identity as a Tier-0 failure domain
Chapter 6 (Alt)	Silent outages and data corruption
Chapter 7 (Alt)	Reliability illusions and false confidence
Chapter 7	Hidden cost of observability tooling
Chapter 8	Trade-offs: on-call, FinOps, and human cost
Chapter 9	Governance system
Chapter 10	Execution and the next quarter
Chapter 12	Reliability pricing and the SaaS margin trap
Appendix	Operating artifacts and policy templates
Chapter 13	Maturity and organizational adoption

I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.