← Silent Outages | Change as Primary Failure Source →
Your team has never had a major outage.
Therefore, your system is reliable.
It is not logic. It is survivorship bias.
This chapter is about the illusions that create false confidence, because confidence is when you are most vulnerable.
Illusion 1: The SLA theater
Your provider has 99.9% SLA. You have 99.5% SLA. You are good.
What you are actually measuring:
- Availability = “is the system responding?”
- Uptime = “did we hit our SLA?”
What you are not measuring:
- Correctness (is the data right?)
- Latency (are responses fast enough?)
- Cascade failures (when one service fails, how many go down?)
- Partial failures (some users down, others up)
Your SLA can be perfect while your customers experience continuous problems.
Real example:
- Your SLA: “API responds in < 5 seconds”
- Your provider SLA: “99.9% availability”
- Your system is up 99.9% of the time
- But when a downstream service is slow, your API times out after 5 seconds
- You return 503, fail the SLA check, customer experience is bad
- But your availability metrics show you are meeting SLA because you are returning a response (albeit an error)
The theater is not lying. It is measuring something that does not correlate with reliability.
The SLA measurement problem
Most SLAs measure:
- “Did the service respond?”
They do not measure:
- “Did the response contain correct data?”
- “Was the response fast enough to be useful?”
- “Did the response trigger a cascade failure elsewhere?”
You can be 99.9% “available” while being 50% useful.
The SLA ceiling problem
Your SLA promises 99.9% uptime. Your infrastructure provider has the same SLA.
Combined: 99.9% * 99.9% = 99.801%.
Your SLA is higher than what is mathematically possible if you depend on them for anything.
The rescue fantasy: “We will have redundancy.”
Great. You have two providers, both 99.9%.
If they are independent: 1 - (0.001 * 0.001) = 99.9999%.
They are not independent. They are both in the cloud. They are both using the same DNS provider. They are both relying on the same BGP routes. During the big outage, both go down together.
You thought you had redundancy. You had correlation you did not measure.
Illusion 2: The untested recovery
You have a disaster recovery plan.
You have never actually run the full recovery.
You have done “tabletop exercises” where you read through the steps. You have not done a full “production failover” where you actually switch to the DR environment while keeping production running.
Why this matters:
- Recovery steps are outdated (people who wrote them left the company)
- Recovery steps are incomplete (one crucial dependency was forgotten)
- Recovery steps take 3 hours instead of the 30 minutes you planned
- Recovery steps fail at step 7 of 15 and nobody knows what to do next
You do not know if your DR plan works until you test it.
You probably will not test it until you need it.
You find out it does not work at that point.
Illusion 3: Alert fatigue masquerading as rigor
You have 200 alerts configured.
You get paged 3 times a week on average.
Therefore, your monitoring is good.
What is actually happening:
- 197 alerts are tuned so loosely that they fire on normal operation
- You get paged constantly and stop responding to pages
- The 3 real issues slip through while you are dismissing false alerts
Alert fatigue is the state where you have so many alerts that you ignore pages.
The outcome: the one time it matters, you miss it because you have trained yourself to ignore pages.
The monitoring threshold problem
You set alert thresholds to avoid false positives:
- CPU > 80% (alert on high utilization)
- Errors > 1% (alert on high error rate)
- Latency p95 > 2 seconds (alert on slow responses)
These are reasonable looking thresholds.
However, what if:
- Your system peaks at 75% CPU normally. 80% is actually fine. You never alert.
- Your error rate is 0.8% during normal operation. 1% is not an anomaly, it is Tuesday. You never alert.
- Your latency p95 is 1.8 seconds. 2 seconds is not a problem. You never alert.
Meanwhile:
- At 77% CPU, your system starts dropping requests (not 80%, but 77%)
- At 0.85% error rate, you have a repeating pattern of failures
- At 1.9 second p95, you are one slow query away from cascade failure
Your alerts are tuned for “loud failures” (crash, go offline). They miss “slow degradation” (gradually getting worse until catastrophe).
There is a second trap inside the same illusion: metrics age.
The dashboard can stay green because the system is healthy, or because the dashboard still measures what mattered six months ago. A threshold that was useful before a traffic shift, architecture change, or product change can quietly become theater. The graph still moves. The signal is gone.
Stale observability is not neutral for this reason. It actively creates false confidence.
Illusion 4: The confidence of the untested
Your system has not failed in 18 months.
You have 99.99% uptime in production.
Therefore, your system is rock solid.
What might actually be true:
- You have not had a major incident in 18 months because you have not changed anything
- Traffic has been below the failure threshold
- The right failure mode just has not triggered yet
- You have been lucky
The most reliable indicator of future reliability is not past uptime. It is past incidents.
Systems that have had incidents, debugged them, and fixed them are usually more reliable than systems that have never failed.
Why? Because they have found and fixed failure modes. Systems that have never failed are just untested failure modes waiting.
The untested failure mode
Your system handles partial failures gracefully.
You know this because:
- You read the code
- You did a code review
- It looks right
You have never tested it because that would require:
- Breaking your database connection
- Running production traffic
- Verifying the system fell back correctly
- Verifying the fallback did not cause other failures
So you have not tested it. You have just assumed it works.
When a real partial failure happens, it turns out:
- The fallback code has a bug
- The fallback path was not deployed to all instances
- The fallback creates a different problem (cascade failures)
You now have a major incident.
Illusion 5: The confidence of explicit knowledge
You know exactly how your system works.
You have:
- Architecture documentation
- Runbooks
- Disaster recovery procedures
- Team knowledge
What you might not know:
- How three different teams coordinate during an incident
- What happens when both the database and the cache fail (are they in the same data center?)
- Whether that service you depend on will route traffic to a degraded instance or reject entirely
- Whether your rollback procedure actually works for the last 10 releases
The gap between “we have documented this” and “we actually know this” is huge.
Most teams have documented their system and never tested the documentation against reality.
Illusion 6: The efficiency that causes fragility
You have been running lean.
You eliminated:
- Redundant services (too expensive)
- Failover capacity (we can just scale up)
- Regional redundancy (DNS failover is fast enough)
- Standing observation (automated monitoring catches everything)
Your system is efficient.
It is also brittle.
Efficiency and resilience are in tension:
- Efficiency: minimize resources, eliminate waste, optimize utilization
- Resilience: build redundancy, maintain fallbacks, keep extra capacity
You cannot have both at full strength simultaneously.
Most teams optimize for efficiency in peacetime and get destroyed when a failure comes.
What you should be doing
1. Measure usefulness, not just availability
Your SLA should include:
- Response time (p95, p99)
- Correctness (data verification)
- Cascade detection (did one failure cause others?)
- Partial failure behavior (some users down vs all users down)
Not just: “did we respond?“
2. Test your DR plan quarterly
Not tabletop. Full failover.
- Fail over to DR environment
- Run production traffic against DR
- Verify nothing breaks
- Fail back to primary
- Verify everything still works
Document what broke. Fix it. Test again next quarter.
3. Actively tune alert thresholds
Once a quarter:
- Look at your alert history
- Calculate false positive rate
- For alerts with > 10% false positive rate, retune or remove
- For critical alerts with zero firings in 3 months, investigate why (is the failure not happening or is the alert broken?)
The goal is not zero false positives. It is to understand which alerts matter.
4. Run chaos engineering to find untested failure modes
Regularly:
- Fail the database and watch what happens
- Fail the cache and watch what happens
- Fail a regional provider and watch what happens
- Slow down a critical dependency and watch what happens
- Drop 5% of network packets and watch what happens
Document what breaks. Fix it. Run the experiment again in 6 months.
5. Explicitly model the efficiency vs resilience tradeoff
For each system, decide:
- Are we optimizing for cost?
- Or are we optimizing for reliability?
You cannot have both. Make it explicit:
- Tier 1 (mission-critical): we will pay for redundancy
- Tier 2 (standard): we will accept some efficiency for resilience
- Tier 3 (non-critical): we will optimize for cost
Then architect accordingly. Do not end up with Tier 3 efficiency and Tier 1 SLA.
6. Know what you do not know
For each critical system:
- List 5 things you are not sure about
- Plan to test them
- Schedule a day to do it
This is not an audit. It is a structured discovery of unknown unknowns.
7. Define reliability in terms your business understands
Do not tell the business: “We have 99.9% uptime.”
Tell them: “Customers can complete transactions 99.9% of the time.”
Or: “Customers see current data 99.8% of the time.”
Or: “User queries return in < 2 seconds 99% of the time.”
These metrics are related to SLA, but they are in business terms.
The uncomfortable truth
The teams that have the most confidence in their reliability are usually the ones who are most vulnerable.
They have not had an incident recently. They think their system is solid.
What they do not know is the failure modes they have never tested.
The day someone finds one of those, their confidence becomes a liability.
Key architecture principle
Confidence is the absence of testing, not the result of it.
Teams that maintain high reliability do not do so by assuming their systems work. They do so by actively finding and fixing failures before they become problems.
The team that says “we are so reliable we never test” is the team that is about to have a catastrophic outage.
Chapter index
| Chapter | Topic |
|---|---|
| Chapter 1 | Opening thesis: reliability as economic decision |
| Chapter 2 | Incentives and organizational failure |
| Chapter 3 | The things that actually break |
| Shared Responsibility | Shared responsibility and accountability vacuum |
| Chapter 4 | The financial model |
| Chapter 5 | Provider failures and status page reality |
| Chapter 6 | Partial failures and degraded-state design |
| Chapter 5 (Alt) | Identity as a Tier-0 failure domain |
| Chapter 6 (Alt) | Silent outages and data corruption |
| Chapter 7 (Alt) | Reliability illusions and false confidence |
| Chapter 7 | Hidden cost of observability tooling |
| Chapter 8 | Trade-offs: on-call, FinOps, and human cost |
| Chapter 9 | Governance system |
| Chapter 10 | Execution and the next quarter |
| Chapter 12 | Reliability pricing and the SaaS margin trap |
| Appendix | Operating artifacts and policy templates |
| Chapter 13 | Maturity and organizational adoption |
I work at Microsoft. The views expressed here are my own and based solely on publicly available information. This content is for educational purposes and does not represent official Microsoft guidance or commitments.