Chapter 11: Incident Triage and Response Protocols

← Chapter 10: Execution and the Next Quarter | Chapter 12: Reliability Pricing →

Note: This chapter uses technical terms. See the Glossary for definitions.

The Problem This Chapter Solves

Chapter 0 teaches you the immediate response. This chapter teaches you why the decisions work and what goes wrong when they fail.

Fast recovery is not luck. It is decision quality under pressure.

Most reliability books assume you have time to think. Real incidents do not give you that. This chapter is about making good decisions in minutes with incomplete information.

Part 1: Decision-Making Under Pressure

The OODA Loop: Why Speed Beats Perfection

A military pilot named John Boyd discovered this pattern. It applies to incident response too:

1. OBSERVE: What is happening right now?
2. ORIENT: What do I know about situations like this?
3. DECIDE: What is my next action?
4. ACT: Do it
5. Loop back to OBSERVE: What happened?

The team that cycles through this loop fastest recovers fastest. Speed matters more than perfect decisions.

Why:

A fast decision that is 80% right works better than a perfect decision that comes 5 minutes too late
You learn from observing results. Waiting for perfect information wastes learning time
Situations change. If you wait 10 minutes to decide, the situation has changed

How to apply:

Observe for 90 seconds
Make the best decision you can
Act on it
Observe what happened
Adjust if needed

What fails:

Spending 15 minutes investigating before acting (too slow)
Acting without observing results (you do not learn)
Making one decision and refusing to change it (you ignore new information)

The Medical Triage Model

Emergency rooms classify patients in 60 seconds into four categories:

RED (Immediate):     Life-threatening, treat now
YELLOW (Urgent):     Serious, treat soon
GREEN (Stable):      Injured but stable, treat later
BLACK (Expectant):   Unlikely to survive

This is NOT permanent. As information arrives, patients move between categories.

Why this works:

Fast decision, not perfect
Resources go to the highest impact cases first
Improves overall outcome

Apply to incidents:

RED (Immediate):     Customers losing money right now, all hands
YELLOW (Urgent):     Customers notice degradation, escalate
GREEN (Stable):      Only internal users affected, monitor
BLACK (Unknown):     Might not be real, needs investigation

Not every incident is RED. Treating everything as RED creates alert fatigue and slows decisions.

When to Override Your Runbook

Runbooks are for 90% of failures. The other 10% require judgment.

Example: Your runbook says “Restart the database.” Usually correct. But what if:

Restart takes 10 minutes and recovery is happening naturally in 2 minutes?
Restart will lose in-flight transactions?
Restart will trigger replication re-sync that cascades the problem?

Runbooks cannot account for all context. That is why incident commanders exist.

Rule:

Runbook says X
You observe Y
Conditions unusual? → Override runbook, report to team lead
Conditions normal? → Execute runbook as written

Part 2: The Escalation Decision Criteria

Not all incidents need escalation. Escalating unnecessarily burns trust and slows decision-making.

Escalate when:

1. Customer revenue impact is actively happening
   Example: "Orders not processing" → Escalate immediately

2. Decision is above your authority
   Example: "Should we fail over to backup provider?" → Escalate
   
3. Recovery is taking longer than expected
   Example: 15 minutes in with no progress → Escalate
   
4. You are unsure and the stakes are high
   Example: "Should we rollback this schema change?" → Escalate
   
5. Multiple teams are affected
   Example: "API down because database is down" → Escalate

6. The decision is non-reversible or risky
   Example: "Should we delete the corrupted data?" → Escalate

Do NOT escalate when:

1. You are just scared
   Example: "The error rate is high but I do not know what to do"
   → Instead: Run the decision tree, make a decision, report result

2. It is your job to handle this severity
   Example: "Single service down" and you are the oncall
   → Instead: Follow the runbook, manage the incident

3. You want someone else to make a hard decision
   Example: "I fixed it but something else broke, should we continue?"
   → Instead: Make the decision based on customer impact

Part 3: The Rollback Decision in Detail

Rollback seems simple. It is actually dangerous.

Why Rollback Is Risky

Scenario 1: Other services depend on your new code

Timeline:
  14:32 — You deploy new code
  14:35 — Another team deploys code that depends on your new code
  14:38 — Your code breaks
  14:40 — You rollback your code to the old version
  
Result: The other team's code is calling functions that no longer exist. 
Their code breaks too. Cascade failure.

Scenario 2: You changed the database structure

Timeline:
  14:32 — You deploy code + database change (remove a column)
  14:35 — Everything looks good
  14:38 — New code breaks due to a different bug
  14:40 — You rollback the code
  
Result: Old code expects that column to exist. Writes fail.
Database is changed but code expects old structure.
Rollback did not fix the problem.

Scenario 3: Rollback takes longer than fixing

The bug fix: 5 minutes to write, test, deploy
Rollback: 12 minutes (rebuild, canary test, verify)
Decision: Fix the bug, do not rollback

The Rollback Decision Matrix (Detailed)

Use this matrix. It forces you to think through the actual risks.

QUESTION 1: How recent was the deployment?
├─ < 2 minutes
│  └─ Error rate spiked? → Rollback immediately (timing is too clear)
│
├─ 2-10 minutes
│  └─ Error rate is very high? → Probably rollback
│  └─ Error rate is moderate? → Investigate 2 more minutes
│  └─ Specific service failing? → Could be coincidental, investigate
│
├─ 10-30 minutes
│  └─ Unlikely to be the cause (something else happened)
│
└─ > 30 minutes
   └─ Unlikely unless there is a delayed trigger (queue backlog, etc.)

QUESTION 2: What was deployed?
├─ Code only? → Rollback is safe (no state change)
│
├─ Database schema?
│  └─ New column? → Rollback safe (schema change is backward compatible)
│  └─ Column deleted? → Rollback risky (old code will fail on deleted column)
│  └─ Type changed? → Rollback risky (old queries might fail)
│
└─ Configuration/infrastructure?
   └─ Feature flag? → Rollback is safe (toggle it off)
   └─ Scale change? → Rollback is safe
   └─ Traffic routing? → Rollback is safe

QUESTION 3: Are other services depending on the new behavior?
├─ YES → Rollback will cascade the failure
│  └─ Do NOT rollback, fix forward
│
├─ MAYBE → Check git logs for recent deploys by dependent services
│  └─ If deployed AFTER this change → Rollback will cascade
│  └─ If deployed BEFORE this change → Safe to rollback
│
└─ NO (you checked) → Rollback is safe

QUESTION 4: Will rollback cause data loss?
├─ YES → Do NOT rollback (recovery is better than data loss)
│
└─ NO → Rollback is safe

FINAL DECISION:
├─ Safe rollback + error rate bad + deployment recent?
│  └─ ROLLBACK IMMEDIATELY
│
├─ Rollback is risky OR unclear?
│  └─ Escalate to VP, let them decide
│
├─ Rollback will take > 30 minutes?
│  └─ Fix forward if fix is faster
│  └─ Roll back if fix is unknown
│
└─ Still unsure?
   └─ Escalate, do not guess

Real Postmortem: Why Rollback Backfired

Incident: PaymentService erroring 10% of requests

Timeline:

14:32 — Deployment A: PaymentService v2.1 (code change)
14:35 — Deployment B: ReceiptService v1.4 (depends on PaymentService new API)
14:38 — Error rate on PaymentService spikes to 10%
14:39 — Oncall engineer: “Recent deployment, rollback immediately”
14:41 — Rollback of PaymentService to v2.0
14:43 — ReceiptService calling non-existent endpoint, cascading failure
14:50 — ReceiptService now at 50% error rate
14:55 — Realized the cascade, rolled back ReceiptService too

What should have happened:

14:40 — Check: “Did any service depend on this deployment?” (YES)
14:41 — Check git logs: ReceiptService deployed 3 minutes AFTER PaymentService
14:42 — Decision: Do NOT rollback PaymentService, fix forward instead
14:45 — Found bug in PaymentService, deployed fix
14:46 — Error rate back to baseline

Time lost due to wrong rollback decision: 16 minutes

Part 4: The Failover Decision in Detail

Failover is less risky than rollback but still has hidden dangers.

When to Failover

QUESTION 1: Is primary actually down (unreachable)?
├─ YES → Failover immediately
├─ NO → Does not respond to pings but might recover?
│  └─ Wait 30 seconds, then decide
└─ UNSURE → Escalate

QUESTION 2: Is failover faster than recovery?
├─ Primary recovery < 2 minutes?
│  └─ Wait for recovery (less risky)
├─ Primary recovery > 5 minutes?
│  └─ Failover (faster)
└─ Primary recovery 2-5 minutes?
   └─ Failover (prevents further customer impact while primary comes up)

QUESTION 3: Will failover cause issues?
├─ Replication lag is high (> 5 minutes)?
│  └─ Failover will cause data loss, escalate
├─ Failover has never been tested?
│  └─ Escalate (unfamiliar procedure, might make things worse)
├─ Failover time is unknown (> 10 minutes)?
│  └─ Wait for primary recovery (failover is slower than recovery)
└─ Failover is well-tested and fast (< 2 minutes)?
   └─ Safe to failover

QUESTION 4: Is the issue on the primary or upstream?
├─ Primary is fine but traffic not reaching it?
│  └─ Do NOT failover (issue is upstream, failover will not help)
│
├─ Primary is actually broken?
│  └─ Failover OK if primary is gone
│
└─ UNSURE?
   └─ Escalate

Post-Failover Validation

Do NOT assume failover succeeded. Validate:

After failover completes:
1. [ ] New primary responding to traffic
2. [ ] No data loss (validate recent writes exist)
3. [ ] Replication lag acceptable (if replicated)
4. [ ] Dependent services working (test a customer journey)
5. [ ] Old primary actually down (do not try to use it)

If any check fails:
└─ Failback to original primary (more likely to work than continuing on broken secondary)

Part 5: Communication During Incident (The Underestimated Skill)

Most incident guides skip communication. It is actually the second-most impactful decision after the technical fix.

The Status Page Update Timing

Chapter 0 already gives you the exact minute-by-minute status page template. Use that script during the incident.

This chapter focuses on why that script works:

Cadence beats detail. Customers need predictable updates more than deep technical detail.
Specific action beats vague intent. “Restarting database” builds more trust than “investigating”.
Honest uncertainty beats false certainty. “We do not know yet” is better than a wrong ETA.
Validation beats celebration. Do not declare all clear until data integrity and customer journeys are verified.

If you need the literal wording in the middle of an outage, use the quick template in Chapter 0.

The Customer Impact Honesty Rule

Lying about impact delays customer response, triggers unnecessary escalations, and destroys trust.

Honest communication during incident:
├─ Quantify impact: "Affecting X% of users on feature Y"
├─ Acknowledge severity: "We know this is critical, our team is working"
├─ Give time horizon: "We estimate 15 more minutes OR we do not know yet"
└─ Update before silence: "No change in status, but still working"

Impact honesty speeds up customer response and preserves trust.

Part 6: Role Clarity (The Source of Most Failures)

When roles are unclear, incidents take 2-3x longer to resolve.

Who Decides What?

INCIDENT COMMANDER:
├─ Decides: Rollback vs. fix forward
├─ Decides: Failover or wait for recovery
├─ Decides: Escalate or continue
├─ Has authority over: Engineering team during incident
├─ Does NOT have authority over: Business/product decisions during incident
└─ Does NOT decide: Root cause (that is post-incident)

TECH LEAD:
├─ Decides: Technical triage (is this cascade or single service?)
├─ Decides: Which team member investigates what
├─ Executes: IC's decisions (rollback, restart, failover)
├─ Reports to: IC every 3 minutes
├─ Escalates: If IC decision appears technically wrong
└─ Does NOT decide: Business impact or next-steps strategy

COMMUNICATIONS LEAD:
├─ Decides: What and when to communicate
├─ Decides: Whether to escalate to executives
├─ Reports to: IC on customer panic level
├─ Does NOT decide: Technical response
└─ Does NOT have authority over: Engineering choices (that is IC's role)

EXECUTIVE (if involved):
├─ Decides: Business response (offer credits? pause billing? etc.)
├─ Decides: Customer communication timing
├─ Receives: Regular briefing from IC
├─ Does NOT have authority over: Technical decisions (that is IC's)
└─ Role: Listen, provide context, execute business decisions

Common Failure: Unclear Authority

Scenario from real incident:

14:40 — IC says: "Let's rollback"
14:41 — Tech lead says: "Wait, I think I see the issue, give me 2 minutes"
14:42 — VP joins: "What is the ETA?"
14:43 — IC says: "Depends on tech lead's assessment"
14:44 — VP says: "I think we should fail over"
14:45 — Tech lead: "But failover will cause cascade..."
14:46 — Everyone talking over each other, no decision made
14:48 — Customer impact worsening, now 50% error rate

What went wrong:
- IC did not decide (deferred to tech lead)
- VP tried to override IC (two bosses)
- No one had clear authority to make the call

How to fix it:
14:40 — IC: "Tech lead, what is your assessment?"
14:41 — Tech lead: "I see the issue, need 2 minutes"
14:42 — IC: "You have 2 minutes, then we rollback"
14:43 — Tech lead: "I need 4 minutes for safety"
14:44 — IC: "OK, 4 minutes. VP, I will keep you updated"
14:45 — Tech lead fixes issue
14:46 — Error rate recovering

Clear authority compressed decision time from 6 minutes to 6 minutes, but with a successful outcome.

Part 7: When Things Go Wrong (Incident Adjustments)

Incidents rarely follow the script. You will make decisions that are wrong.

When that happens:

Course Correction Sequence

STEP 1: Observe that the decision is not working
Example: "We rolled back, error rate still bad"

STEP 2: Do NOT wait for full validation (that takes time)
Instead: Make a corrective decision immediately

STEP 3: Report the course correction to the team
Example: "Rollback did not help, we are now fixing forward"

STEP 4: Escalate if needed
If the new decision is riskier than the original, escalate to VP

STEP 5: Continue monitoring
Watch for second-order effects from the correction

TIME SAVED BY QUICK COURSE CORRECTION: 15-30 minutes

Incident Post-Mortems: Learning from Wrong Decisions

Bad decisions during incidents are not failures. They are learning opportunities.

The question to ask is not “Why did we make the wrong decision?” (pressure, incomplete information).

The question is “What structure would have prevented this decision?”

Example:

Incident: Rolled back, which cascaded failure to dependent service

Post-Mortem questions:
"Why did the IC decide to rollback?" (He had incomplete information)
✓ "What runbook or decision matrix would have surfaced the cascade risk?"

Outcome: Add pre-rollback check to decision matrix:
"Before rollback, verify that no service depends on new behavior"

Part 8: Building Judgment (Incident Experience Over Time)

The best incident commanders are not the smartest engineers. They are the engineers who have responded to enough incidents to recognize patterns.

Judgment is learned. Here is how to get there:

Incident Retrospectives (the Learning Loop)

IMMEDIATELY AFTER INCIDENT (30 minutes):
└─ Postmortem meeting with full team
   ├─ What happened (timeline, symptoms, decisions)
   ├─ What worked well (specific decisions, not "good communication")
   ├─ What could be better (runbooks, alerts, decision criteria)
   └─ Follow-ups (implement what could be better)

48 HOURS LATER:
└─ Follow-up on action items
   ├─ Which were completed? (usually: 50-70%)
   ├─ Which are still needed?
   └─ Which are blocked?

QUARTERLY REVIEW:
└─ Category the incident type
   ├─ "This was our 3rd cascade failure in 6 months"
   ├─ "This was our 1st silent data corruption"
   └─ "This pattern keeps happening, needs architectural fix"

Building the Decision Library

Over time, you will see patterns. Document them.

After 10 incidents on your team, you have probably seen:
├─ 2 rollback decisions (1 good, 1 bad)
├─ 1-2 failover decisions
├─ 2-3 cascade failures
├─ 1 data corruption incident
└─ 3-4 "unknown" incidents that turned out to be X

After 50 incidents, your IC can recognize and decide faster because
they have lived through similar situations.

This is not genius. It is pattern recognition through experience.

Summary: Decisions That Matter

This chapter teaches three things:

Decision frameworks (what to decide, what the options are)
Decision discipline (making decisions under pressure without panicking)
Decision learning (getting better at decisions over time)

The teams that recover fastest are not lucky. They are trained in decision-making under uncertainty.

That training is learnable. Use the frameworks in this chapter. Practice with the playbooks. Learn from your own incidents.

Over time, you will build the judgment to make good decisions when information is incomplete and time is running out.

That judgment is worth more than any runbook.

← Chapter 10: Execution and the Next Quarter | Chapter 12: Reliability Pricing →