Chapter 9: Reliability Governance, ADRs, Debt, and Leading Indicators

If reliability is real, it has to be governable.

Good intentions are not enough. Teams need artifacts that preserve trade-offs, expose deferred risk, and make leadership confront the consequences of what was funded and what was not.

This chapter is the operating system chapter of the book.

Reliability governance loop

Tiering is where governance starts

Every service should not receive the same reliability posture.

What matters is whether the business importance, resilience controls, and funding model match.

Tiering does that work. It forces the company to ask which products protect revenue, trust, operations, or only convenience, and then fund them accordingly.

ADRs make the trade-off durable

Reliability-significant decisions should not live only in planning memory.

The practical rule is simple: extend the existing ADR system with reliability fields rather than inventing a parallel process.

If a service is 99.9% instead of 99.99%, someone should be able to explain:

why that target was chosen
what higher or lower targets were rejected
what resilience premium was accepted
what conditions should trigger a review later

If the team cannot do that in one page, the target is probably aspirational rather than governed.

Reliability debt deserves its own ledger

Feature debt and security debt are often tracked explicitly. Reliability debt is usually remembered informally until an outage turns it into a board topic.

That is too late.

Deferred resilience work should be recorded with:

description
service tier
amplified failure mode
estimated outage cost if triggered
estimated remediation cost
named owner
committed review quarter

Each debt item should also include:

likelihood score from 1 to 5
impact score from 1 to 5
severity score = likelihood x impact
escalation tier based on severity

That is how cost reduction stops looking risk-neutral when it is not.

When total Tier 1 debt severity rises for two consecutive monthly reviews, escalate funding and remediation decisions to the executive reliability forum.

Leading indicators matter more than incident summaries

Lagging indicators describe what already happened. Leading indicators improve the odds that the next incident is smaller or avoided.

The most useful ones are usually:

error budget burn rate
change failure rate
mean time to detect
mean time to mitigate
runbook freshness
alert noise ratio
simulation pass rate
backup operator coverage

Practical model

Artifact	What it does
tiering table	connects business importance to funded controls
reliability ADR	records why the target exists
debt ledger	makes deferred risk visible
leading-indicator dashboard	surfaces drift before incidents compound

What to do this quarter

Refresh the service tier list.
Write one reliability ADR for the most consequential target in the estate.
Create the first five entries in a reliability debt ledger.
Review leading indicators monthly, not only incident summaries quarterly.

Bottom line

Reliability becomes a system only when the organization can fund it, record it, revisit it, and measure whether it is drifting before production tells the truth.

Chapter bridge

Chapter 10 closes the guide with a practical quarterly execution rhythm.