Library
Writing
Long-form essays, operating frameworks, and the full Reliability Survival Guide in reading order.
Book
The Reliability Survival Guide
Core chapters, crisis appendices, glossary, and reading paths grouped in one place.
The Reliability Survival Guide - Complete Table of Contents
The complete structure of the Reliability Survival Guide. Choose your reading path based on your role, or read sequentially for the full philosophy.
Chapter 0: The First 24 Hours—Incident Triage and Immediate Response
When your system is down. Crisis decision trees, role definitions, and the exact sequence that determines whether you recover in minutes or hours. Read this before your pager goes off.
The Reliability Survival Guide
How to keep your systems alive when everything is working against you. A field guide for architects and leaders.
Chapter 2: Systems Fail According to Incentives
Shows why many outages begin in planning cycles, team goals, and ownership splits rather than in infrastructure itself.
Chapter 3: The Things That Actually Break Real Systems
A brutal look at what actually kills production reliability. Not theory. Not best practices. The specific failure modes that leave engineers explaining to executives at 2 AM.
Chapter 3: Shared Responsibility Is an Accountability Vacuum
Explains why the cloud shared responsibility model is operationally weak unless teams explicitly model who owns failure above and below the provider boundary.
Chapter 4: The Reliability Equation and the Financial Model
Defines the reliability stack, the SLO-RTO-RPO-BR model, failure domains, and error budget as a financial construct.
Chapter 5: Provider Failures as System Constraints
Your system is bounded by external systems that fail in predictable ways. This chapter shows why treating provider failures as external constraints—not aberrations—is the foundation of resilient architecture.
Chapter 6: Partial Failure and Control Plane Failures
Shows why degraded behavior is the default cloud state and why control plane impairments create business outages even when workloads look healthy.
Chapter 5: Identity – The System Kill Switch
Identity failures disable everything downstream. Yet most teams treat identity as infrastructure and third-party SLAs as sufficient. This chapter shows why identity must be a primary failure domain with explicit resilience architecture.
Chapter 6: Silent Outages—When Data Corruption Looks Like Success
Not all failures are loud. The most dangerous failures are silent: data corruption, inconsistency, and partial writes that leave your system appearing operational while integrity degrades. This chapter shows how to detect and prevent the failures that do not wake you up at 3 AM.
Chapter 7b: How You See (and Miss) Reality
Before diving deeper into failure domains, this chapter breaks reader confidence. It deconstructs the illusions that create false confidence: SLA theater, untested recovery, alert fatigue, documentation equals knowledge. Each illusion is a structural vulnerability disguised as safety.
Chapter 7: Change – The Failure You Deploy Yourself
Most outages are not caused by infrastructure failures. They are caused by change: deployments, configuration updates, and operational decisions made under pressure. This chapter shows why change is the primary failure domain and how to design for safe change instead of hoping for perfect decisions.
Chapter 7: The Hidden Cost of Reliability Tooling
Explains the observability cost ceiling, redundancy economics, and why second-order costs often limit reliability before architecture does.
Chapter 8: Reliability Trade-offs, FinOps Tension, and On-Call Economics
Explains where reliability decisions fail in practice when spend, human sustainability, and operational capability are forced to compete.
Chapter 9: Reliability Governance, ADRs, Debt, and Leading Indicators
Turns reliability from an aspiration into a governed system using tiering, ADRs, debt ledgers, review triggers, and leading indicators.
Chapter 10: Reliability Execution and the Next Quarter
Closing chapter of the Reliability Survival Guide. Converts doctrine into a reliability week, a quarterly operating plan, and an executive conversation model.
Chapter 11: Incident Triage and Response Protocols
Advanced incident response patterns. How to make decisions under pressure when information is incomplete. Decision frameworks grounded in real postmortems showing what works and what fails.
Appendix: Reliability Operating Artifacts and Policy Templates
Drop-in templates for SLO and SLI specifications, error budget policy, tiering criteria, CUJ measurement, ADRs, debt ledger, provider incident playbook, and board scorecard.
Chapter 12: Reliability Pricing and the SaaS Margin Trap
When the cost of reliability upgrades eats the margin they were meant to protect, the upgrade is not a solution. This chapter provides the break-even churn model, tiered reliability packaging, and the go-to-market conversation required to price reliability without destroying the business.
Chapter 13: Reliability Maturity and Organizational Adoption
A practical playbook for implementing the reliability system in real organizations. Shows how to start, what maturity looks like at each stage, how to handle resistance, and how to sequence investments without organizational chaos.
Appendix A: Crisis Reference Cards
One-page laminate-able guides for incident response. Print these. Put them in your war room. Use them at 2 AM when your system is failing.
Appendix B: Operational Artifacts and Templates
Copy-paste ready templates for building your operational runbooks, dependency maps, SPOF inventories, detection queries, and recovery checklists. Use these as starting points for your service.
Appendix C: Field Playbooks – Scenario-Specific Response
Five critical failure scenarios with step-by-step playbooks. When your system fails in a specific way, start here for the exact actions to take in order.
Reading Paths – Start Here Based on Your Role
Not everyone needs to read the entire book. Pick your role. Follow the reading path. Get the insights that matter for your job.
Glossary: Reliability Terms and Definitions
Quick reference for technical terms used throughout the Reliability Survival Guide. Look up unfamiliar words here.