2026-06-06

The Reliability Survival Guide - Complete Table of Contents

The complete structure of the Reliability Survival Guide. Choose your reading path based on your role, or read sequentially for the full philosophy.

cloud-architecturereliabilityguide

The Reliability Survival Guide

A philosophy for building systems that survive.

This guide contains everything you need to understand reliability economics, design for failure, respond to incidents, and build operational excellence.

Book overview: Reliability Survival Guide


Quick Start: Choose Your Path

Not sure where to begin? Pick your role:

RoleStart HereTime
On-Call EngineerAppendix A: Crisis CardsAppendix C: PlaybooksChapter 02-3 hrs
ArchitectChapter 1Chapter 3Chapters 5-74-5 hrs
CTO / LeaderChapter 1Chapter 2Chapter 95-6 hrs
SRE / PlatformFull reading path6-7 hrs
Team LeadFull reading path3-4 hrs

→ See Reading Paths for detailed role-based guides.


PART 1: The Economics Foundation

Why you invest in reliability at all.

Chapter 1: Reliability is an Economic Decision

Start here. Reliability is not virtuous or aspirational. It is an investment with a clear cost-benefit analysis. This chapter shows you how to do it.

  • Why reliability costs money
  • How downtime loses money
  • The break-even threshold
  • When to invest, when to skip

Chapter 2: Systems Fail According to Incentives

Your team is not building unreliable systems. Your incentives are rewarding unreliable systems. This chapter reveals the structural problems.

  • How incentive misalignment breeds failure
  • Common organizational anti-patterns
  • What to measure
  • How to align for reliability

Chapter 4: The Reliability Equation—A Financial Model

Put numbers to reliability. Calculate ROI.

  • Revenue per minute of uptime
  • Cost of different SLAs
  • Multi-region economics
  • When to take the bet

PART 2: What Breaks and Why

The failure modes you will actually encounter.

Chapter 3: The Things That Actually Break

Not theories. Real incidents. Real outages.

  • Silent failures
  • Cascade failures
  • Hidden dependencies
  • Provider failures

Chapter 5: Identity—The System Kill Switch

Identity failures disable everything downstream. Most teams miss identity as a primary failure domain.

  • Why identity is Tier-0
  • Token refresh cascades
  • Fallback architectures
  • Detection and monitoring
  • 2-hour degraded mode implementation

Chapter 6: Silent Outages—When Data Corruption Looks Like Success

Your system returns 200 OK. Your data is corrupted.

  • Six silent failure patterns
  • 50+ detection queries (PostgreSQL, MySQL, DynamoDB, Redis, SQS)
  • Consistency verification
  • Real incident examples
  • Monitoring frequency models (Tier 1/2/3)

Chapter 7a: How You See (and Miss) Reality

Your monitoring is incomplete. Here is why and what to do.

  • Observability vs. reliability
  • Common blind spots
  • Metric design patterns
  • Alarm thresholds

Chapter 7: Change—The Failure You Deploy Yourself

Most outages are caused by change. Deployments, config updates, migrations. Design for it.

  • Safe deployment checklist (3-phase, 40+ checks)
  • Rollback decision matrix (by deployment age vs. metrics)
  • Deployment kill switches (blue-green, canary, rolling, feature flag)
  • 4 deployment patterns with examples
  • Detection metrics that matter

PART 3: Incidents and Crises

When things go wrong, use these immediately.

CRISIS REFERENCE: Chapter 0 – The First 24 Hours

Your North Star for incident response.

Use this when the pager goes off.

  • 2-minute triage tree (yes/no questions)
  • Minutes 0-10 procedures
  • Incident roles and authority (IC, Tech Lead, Comms)
  • Decision matrices (rollback, failover)
  • Escalation procedures
  • Common playbooks
  • Recovery validation checklist

ADVANCED: Chapter 11 – Incident Triage & Response Protocols

Advanced incident response patterns.

  • OODA loop application (Observe → Orient → Decide → Act)
  • Medical triage classification (RED/YELLOW/GREEN/BLACK)
  • Escalation criteria
  • Detailed decision matrices with postmortem examples
  • Role clarity in practice
  • Communication patterns

APPENDIX A: Crisis Reference Cards

Print these. Laminate them. Put them on your desk.

11 one-page quick-reference cards for war room use:

  1. Incident triage tree
  2. Escalation ladder
  3. Deployment decision tree
  4. Service failure triage
  5. Latency triage
  6. Traffic loss diagnosis
  7. Data corruption response
  8. Cascade failure handling 9-11. Role cards (IC, Tech Lead, Comms)

Plus symptom index and recovery validation checklist.


APPENDIX C: Field Playbooks—Scenario-Specific Response

Five specific failure scenarios with step-by-step procedures.

  1. Identity System Down → degraded mode activation, recovery validation
  2. Database Replication Lag → blockage detection, kill long queries
  3. Cascade Failure → identify root cause, isolate, fix bottom-up
  4. Bad Deployment → rollback decision matrix, execution, validation
  5. Provider Regional Outage → failover decision, DNS cutover, gradual failback

PART 4: Building Operational Excellence

How to make reliability automatic, not aspirational.

Chapter 8: Tradeoffs—On-Call Burden, FinOps, and When to Invest

The trade-offs you have to make.

  • On-call costs
  • FinOps and reliability alignment
  • Sunk cost fallacies
  • When NOT to invest

Chapter 9: Governance, ADRs, and Risk Ledgers

How to make reliability decisions stick.

  • Architecture decision records (ADRs)
  • Risk audit logs
  • Early warning indicators
  • Governance that scales

Chapter 10: Quarterly Execution

How to weave reliability into your sprint planning.

  • Quarterly planning framework
  • Reliability work prioritization
  • Metrics that matter
  • Tracking and review

APPENDIX B: Operational Artifacts & Templates

Copy-paste ready templates for building your operational runbooks.

  1. Service Runbook – Quick facts, overview, diagnosis trees, incident procedures, escalation
  2. Dependency Map – Inbound/outbound services, failure modes, cascades, deployment implications
  3. SPOF Inventory – Current single points of failure, redundancy status, ROI analysis
  4. Silent Failure Detection Checklist – 7 SQL queries, daily schedule, alert conditions
  5. Post-Incident Recovery Validation – 10-min technical, 30-min data integrity, 60-min operational checks
  6. Economics Decision Card – ROI matrix for infrastructure investments
  7. Escalation Contact Card – Print & laminate, all contacts and procedures

READING ORDER BY ROLE

For On-Call Engineers:

  1. Appendix A: Crisis Cards (30 min)
  2. Appendix C: Field Playbooks (1 hour)
  3. Chapter 0: First 24 Hours (1.5 hours)

Total: 2-3 hours to master incident response.


For Architects:

  1. Chapter 1: Economics (1 hour)
  2. Chapter 3: Things That Actually Break (45 min)
  3. Chapter 5b: Identity (1 hour)
  4. Chapter 6b: Silent Outages (1.5 hours)
  5. Chapter 7b: Change (1.5 hours)

Total: 4-5 hours to design reliable systems.


For CTOs & Leaders:

  1. Chapter 1: Economics (1 hour)
  2. Chapter 2: Incentives (45 min)
  3. Chapter 4: Financial Model (1 hour)
  4. Chapter 9: Governance (1 hour)
  5. Chapter 10: Quarterly Execution (45 min)

Total: 5-6 hours to set organizational strategy.


For SRE Leads & Platform Engineers:

Read the full book over 1-2 weeks. Focus on Appendices A/B/C for operational artifacts.


For Team Leads:

  1. Chapter 1: Economics
  2. Chapter 2: Incentives
  3. Chapter 3: Things That Break
  4. Chapter 9: Governance
  5. Chapter 10: Quarterly Execution

Total: 3-4 hours to coach your team.


COMPLETE CHAPTER LIST

#TitleFilePurpose
0The First 24 Hours—Incident Triage000036Immediate action procedures
1Reliability Is an Economic Decision000017Foundational thesis
2Systems Fail According to Incentives000019Organizational alignment
3The Things That Actually Break000031Real failure modes
3bShared Responsibility, Accountability Vacuum000020Accountability structure
4The Reliability Equation000021Financial modeling
5aProvider Failures as System Constraints000022External dependency risk
5bIdentity—The System Kill Switch000032Tier-0 architecture
6aPartial Failure and Control Plane Failures000023Failure behavior
6bSilent Outages—Data Corruption000033Detection patterns
7aHow You See (and Miss) Reality000034Observability blind spots
7bChange—The Failure You Deploy000035Safe deployment
7cThe Hidden Cost of Reliability Tooling000024Tooling economics
8Reliability Trade-offs000025Cost, burnout, and reliability
9Reliability Governance000026Organizational systems
10Reliability Execution000027Planning and tracking
11Incident Triage and Response Protocols000037Advanced procedures
12Reliability Pricing and the SaaS Margin Trap000029Commercial model
13Reliability Maturity and Organizational Adoption000030Adoption strategy

APPENDICES

NameFilePurpose
ACrisis Reference CardsWar room quick reference
BOperational ArtifactsTemplates for your services
CField PlaybooksStep-by-step procedures

SUPPLEMENTARY GUIDES

NamePurpose
Reading PathsRole-based entry points and sequencing
Table of ContentsThis document

How to Use This Guide

If you have 30 minutes:

If you have 1 hour:

If you have 3 hours:

  • Follow your role’s reading path (see above)

If you have a week:

  • Read the entire guide sequentially
  • Your reliability thinking will fundamentally change

Getting Started

  1. Pick your role (on-call, architect, CTO, SRE, team lead)
  2. Go to Reading Paths
  3. Follow the suggested order
  4. Share with your team

The Reliability Survival Guide © 2026 Zach Olinski. All rights reserved.