Library

Writing

Long-form essays, operating frameworks, and the full Reliability Survival Guide in reading order.

Book

The Reliability Survival Guide

Core chapters, crisis appendices, glossary, and reading paths grouped in one place.

2026-06-06 BOOK

The Reliability Survival Guide - Complete Table of Contents

The complete structure of the Reliability Survival Guide. Choose your reading path based on your role, or read sequentially for the full philosophy.

2026-06-06 BOOK

Chapter 0: The First 24 Hours—Incident Triage and Immediate Response

When your system is down. Crisis decision trees, role definitions, and the exact sequence that determines whether you recover in minutes or hours. Read this before your pager goes off.

2026-06-06 BOOK

The Reliability Survival Guide

How to keep your systems alive when everything is working against you. A field guide for architects and leaders.

2026-06-06 BOOK

Chapter 2: Systems Fail According to Incentives

Shows why many outages begin in planning cycles, team goals, and ownership splits rather than in infrastructure itself.

2026-06-06 BOOK

Chapter 3: The Things That Actually Break Real Systems

A brutal look at what actually kills production reliability. Not theory. Not best practices. The specific failure modes that leave engineers explaining to executives at 2 AM.

2026-06-06 BOOK

Chapter 3: Shared Responsibility Is an Accountability Vacuum

Explains why the cloud shared responsibility model is operationally weak unless teams explicitly model who owns failure above and below the provider boundary.

2026-06-06 BOOK

Chapter 4: The Reliability Equation and the Financial Model

Defines the reliability stack, the SLO-RTO-RPO-BR model, failure domains, and error budget as a financial construct.

2026-06-06 BOOK

Chapter 5: Provider Failures as System Constraints

Your system is bounded by external systems that fail in predictable ways. This chapter shows why treating provider failures as external constraints—not aberrations—is the foundation of resilient architecture.

2026-06-06 BOOK

Chapter 6: Partial Failure and Control Plane Failures

Shows why degraded behavior is the default cloud state and why control plane impairments create business outages even when workloads look healthy.

2026-06-06 BOOK

Chapter 5: Identity – The System Kill Switch

Identity failures disable everything downstream. Yet most teams treat identity as infrastructure and third-party SLAs as sufficient. This chapter shows why identity must be a primary failure domain with explicit resilience architecture.

2026-06-06 BOOK

Chapter 6: Silent Outages—When Data Corruption Looks Like Success

Not all failures are loud. The most dangerous failures are silent: data corruption, inconsistency, and partial writes that leave your system appearing operational while integrity degrades. This chapter shows how to detect and prevent the failures that do not wake you up at 3 AM.

2026-06-06 BOOK

Chapter 7b: How You See (and Miss) Reality

Before diving deeper into failure domains, this chapter breaks reader confidence. It deconstructs the illusions that create false confidence: SLA theater, untested recovery, alert fatigue, documentation equals knowledge. Each illusion is a structural vulnerability disguised as safety.

2026-06-06 BOOK

Chapter 7: Change – The Failure You Deploy Yourself

Most outages are not caused by infrastructure failures. They are caused by change: deployments, configuration updates, and operational decisions made under pressure. This chapter shows why change is the primary failure domain and how to design for safe change instead of hoping for perfect decisions.

2026-06-06 BOOK

Chapter 7: The Hidden Cost of Reliability Tooling

Explains the observability cost ceiling, redundancy economics, and why second-order costs often limit reliability before architecture does.

2026-06-06 BOOK

Chapter 8: Reliability Trade-offs, FinOps Tension, and On-Call Economics

Explains where reliability decisions fail in practice when spend, human sustainability, and operational capability are forced to compete.

2026-06-06 BOOK

Chapter 9: Reliability Governance, ADRs, Debt, and Leading Indicators

Turns reliability from an aspiration into a governed system using tiering, ADRs, debt ledgers, review triggers, and leading indicators.

2026-06-06 BOOK

Chapter 10: Reliability Execution and the Next Quarter

Closing chapter of the Reliability Survival Guide. Converts doctrine into a reliability week, a quarterly operating plan, and an executive conversation model.

2026-06-06 BOOK

Chapter 11: Incident Triage and Response Protocols

Advanced incident response patterns. How to make decisions under pressure when information is incomplete. Decision frameworks grounded in real postmortems showing what works and what fails.

2026-06-06 BOOK

Appendix: Reliability Operating Artifacts and Policy Templates

Drop-in templates for SLO and SLI specifications, error budget policy, tiering criteria, CUJ measurement, ADRs, debt ledger, provider incident playbook, and board scorecard.

2026-06-06 BOOK

Chapter 12: Reliability Pricing and the SaaS Margin Trap

When the cost of reliability upgrades eats the margin they were meant to protect, the upgrade is not a solution. This chapter provides the break-even churn model, tiered reliability packaging, and the go-to-market conversation required to price reliability without destroying the business.

2026-06-06 BOOK

Chapter 13: Reliability Maturity and Organizational Adoption

A practical playbook for implementing the reliability system in real organizations. Shows how to start, what maturity looks like at each stage, how to handle resistance, and how to sequence investments without organizational chaos.

2026-06-06 BOOK

Appendix A: Crisis Reference Cards

One-page laminate-able guides for incident response. Print these. Put them in your war room. Use them at 2 AM when your system is failing.

2026-06-06 BOOK

Appendix B: Operational Artifacts and Templates

Copy-paste ready templates for building your operational runbooks, dependency maps, SPOF inventories, detection queries, and recovery checklists. Use these as starting points for your service.

2026-06-06 BOOK

Appendix C: Field Playbooks – Scenario-Specific Response

Five critical failure scenarios with step-by-step playbooks. When your system fails in a specific way, start here for the exact actions to take in order.

2026-06-06 BOOK

Reading Paths – Start Here Based on Your Role

Not everyone needs to read the entire book. Pick your role. Follow the reading path. Get the insights that matter for your job.

2026-06-06 BOOK

Glossary: Reliability Terms and Definitions

Quick reference for technical terms used throughout the Reliability Survival Guide. Look up unfamiliar words here.