Chapter 0: The First 24 Hours
Incident Triage and Immediate Response
Your system is failing. You have minutes, not hours, to make the right calls.
This chapter is not theory. It is procedure. Read it now, before your pager goes off at 2 AM. Print Appendix A and laminate it.
Note: This chapter uses technical terms. See the Glossary for definitions of unfamiliar words.
The Central Thesis
Incident response fails for one reason: teams skip triage and jump straight to finding the root cause.
Finding the root cause is important. But that is not what the first 10 minutes are for.
The first 10 minutes are for stopping the bleeding. Stop the immediate damage.
The next hour is for understanding what happened. Figure out what went wrong.
Everything after that is for preventing it again. Make sure it does not happen next time.
This chapter teaches the first 10 minutes.
The Architecture of a Good Incident Response
Every incident follows the same sequence. The sequence never changes. Only the details change.
MINUTES 0-2: TRIAGE (what is actually broken?)
MINUTES 2-10: IMMEDIATE ACTION (stop the problem from spreading)
MINUTES 10-30: STABILIZATION (get the system partially working)
MINUTES 30-60: ROOT CAUSE HYPOTHESIS (why did this happen?)
HOUR 1-4: FULL RECOVERY (everything working again)
HOUR 4+: VALIDATION (is it actually safe?)
Teams that follow this sequence recover faster. Teams that skip to finding the root cause often make things worse.
The sequence feels wrong when you are panicking. Your first instinct is to find the problem immediately. But fighting that instinct is the only way to recover fast.
Part 1: The Triage Decision Tree (First 2 Minutes)
The decision tree below answers: What kind of failure is this?
Use it step-by-step. Ask yes/no questions. You make better decisions based on facts, not emotion.
What does “cascade” mean? When one service breaks, it causes other services to break, which breaks more services. Like dominoes falling. See Glossary: Cascade
┌─────────────────────────────────────────────────────────┐
│ YOUR SYSTEM IS DOWN: FOLLOW THIS TREE (2 MINUTES) │
└─────────────────────────────────────────────────────────┘
QUESTION 1: Is the problem visible?
├─ Can you see dashboards/metrics?
│ ├─ NO → MONITORING IS DOWN
│ │ └─ Escalate immediately (Chapter 11, escalation criteria)
│ │ └─ Cannot triage without visibility
│ │ └─ Page infrastructure lead + monitoring oncall
│ │
│ └─ YES → Continue to Question 2
QUESTION 2: What is the geographic scope?
├─ Are ALL regions/zones affected?
│ ├─ YES → PROVIDER REGIONAL EVENT (or fundamental control-plane)
│ │ └─ Go to "Provider Event Triage" (below)
│ │ └─ Escalate immediately
│ │
│ └─ NO → Single region/zone or single service affected
│ └─ Go to Question 3
QUESTION 3: Did you deploy something in the last 5 minutes?
├─ YES → DEPLOYMENT FAILURE (likely)
│ └─ Go to "Deployment Triage" (below)
│ └─ Page the deployer's team lead
│
└─ NO → Continue to Question 4
QUESTION 4: What is the visible symptom?
├─ "Error rate spiking" / "Timeouts"
│ ├─ Is one service down (dependent services error)?
│ │ ├─ YES → Dependency failure
│ │ │ └─ Go to "Dependency Triage" (below)
│ │ │ └─ Page the failed service's team lead
│ │ │
│ │ └─ NO → Cascade failure
│ │ └─ Go to "Cascade Failure Triage" (below)
│ │ └─ Escalate immediately, multiple team leads
│ │
│ ├─ "Latency increasing"
│ │ └─ Go to "Latency Triage" (below)
│ │ └─ Likely database or queue saturation
│ │
│ ├─ "No traffic getting through"
│ │ ├─ Did we deploy? → Rollback decision
│ │ └─ Recent changes to traffic routing/LB? → Undo
│ │
│ └─ "Data looks wrong"
│ └─ GO DO NOT ROLLBACK YET (Chapter 6)
│ └─ Go to "Silent Failure Triage" (below)
│ └─ Page database team, escalate before any remediation
Provider Event Triage
Your cloud provider (AWS, Azure, GCP) is having an outage.
Quick questions:
-
Is your system actually down, or is the provider just slow?
- Provider status page says “everything is fine” but you are down? → Your problem, not theirs
- Provider status page confirms incident? → Wait or switch to backup
-
Do you have a backup in another region? (Cross-region failover means traffic switches to a backup automatically)
- YES → Switch to backup NOW (decision takes < 5 minutes)
- NO → Wait for provider to fix it
-
How long until the provider says it will be fixed?
- < 10 minutes → Wait
-
10 minutes → Switch to backup or reduced mode
- Unknown → Assume 30 minutes, start switching to backup
Deployment Triage
You deployed new code in the last 5 minutes and the system broke.
What does “canary” mean? A small test. You send new code to 1% of users first, then 10%, then 100%. If something breaks, only 1% of users see it.
Quick questions:
-
How long did the deploy take?
- < 2 minutes → Probably safe, keep watching
- 2-5 minutes → Watch carefully
-
5 minutes → Something is stuck, check deployment logs
-
Did the system break immediately or did it get worse slowly?
- Immediately after deploy (< 1 second) → Rollback immediately
- Gradually (1-5 minutes after) → Could be related, consider rollback
- Hours later → Probably not related to this deploy
-
Are you running a canary (new code on only some servers)?
- YES → Route users away from the new servers
- NO → Rollback all servers to the old version
Decision:
- System is broken because of this deploy? → Rollback (undo the deploy)
- Not sure if this deploy caused it? → Watch for 2 minutes, then rollback if it gets worse
- System is stable? → Keep the new code, investigate elsewhere
Dependency Triage
One service is down or unhealthy.
Identify the dependency:
1. Which service is erroring?
2. Is it a known critical dependency?
- Auth / Identity → Chapter 5, escalate immediately
- Database → Chapter 6, escalate immediately
- Cache → Can degrade (fall back to DB), monitor
- Queue → Can degrade (queue locally), monitor
Action:
├─ Can the dependency be restarted?
│ ├─ Restart time < 30 seconds → Restart, monitor
│ ├─ Restart time > 30 seconds → Failover if available
│
├─ Is this a known issue?
│ └─ Check runbook: scripts/runbooks/[service-name].md
│
└─ Is it a provider issue?
└─ Check provider status page
└─ Escalate if provider confirmed
Cascade Failure Triage
Multiple services are failing, hard to see which started it.
IMMEDIATE ACTION (first 2 minutes):
1. Stop looking for root cause. Stop that.
2. Instead: Identify the lowest-level service that started failing.
- Check deployment times (which was deployed first?)
- Check error rates (which has highest error rate?)
- Check timestamps (which started failing first?)
3. Once identified: apply dependency triage to that service
RULE FOR CASCADE FAILURES:
Always fix from the lowest level upward. Do not try to fix the top of the cascade.
Example:
- API → Failing
- API depends on: Queue
- Queue → Failing
- Queue depends on: Network
Fix network first, then queue, then API. Do not fix API.
Latency Triage
Everything is responding, but slowly.
Quick questions:
1. Is latency database latency?
- Check: SELECT query latency increasing? → Database issue
- Fix: Restart slow queries, add index, scale DB
- Go to: Chapter 6 (silent failures) and Chapter 7b (partial failures)
2. Is latency queue latency?
- Check: Messages backing up? → Queue issue
- Fix: Scale queue consumers, increase partition count
- Go to: Chapter 5
3. Is latency network latency?
- Check: All calls slow (not just one service)?
- Check: Regional breakdown (one region slow?)
- Go to: Provider event triage, network team
4. Is this cascading from one slow service?
- Check: Callers timing out waiting for slow service?
- Fix: Either speed up the service or circuit-break it
- Go to: Chapter 7b (partial failures)
ACTION:
├─ Immediate: Kill slow queries, drain queues, restart slow services
├─ 5 minutes: Assess if this is cascading (route around it)
└─ 30 minutes: Fix root cause
Silent Failure Triage (Data Corruption)
Your system is responding. But data looks wrong.
STOP. Do not rollback yet. Data corruption requires careful handling.
What does “replication lag” mean? Your backup database is slightly behind the main database. Changes take a few seconds to copy over.
Quick questions:
-
When did this start?
- Right after a deployment? → The new code changed data incorrectly
- Slowly over hours? → Changes were not copied to backups (replication lag)
- Discovered now, but old data? → How long ago did this start?
-
How much data is affected?
- A few records (< 100)? → Can be fixed manually
- Many records (> 1000)? → Need management approval before fixing
-
Can you safely rollback (undo the deploy) without losing good data that was written after the corruption?
- YES → Rollback is an option
- NO → Do not rollback
Decision:
- Rollback is safe? → Still check with the data team first
- Rollback is not safe? → Go to Chapter 6 (Silent Outages)
- Not sure? → Call management before doing anything
What to do NOW:
- Stop the bad data from spreading (stop writing)
- Measure how much data is affected (scope)
- Call the data team (do not fix this alone)
- Plan recovery (fix the data, do not just undo)
Part 2: Immediate Actions (Minutes 2-10)
Once you have classified the failure, take immediate action. Do not investigate further. Do not wait for root cause.
Action 1: Establish Command & Control (Minute 2)
Declare an incident. This triggers the response machinery.
Pagerduty / Incident platform:
├─ Create incident: [brief description of triage result]
├─ Set severity: Critical (customer impacted) or Warning (no customer impact yet)
├─ Assign Incident Commander (senior engineer who can make decisions)
├─ Assign Communications Lead (notifies status page and customers)
└─ Assign Tech Lead (executes the response)
If unsure about severity: declare Critical. Downgrade later if needed.
Action 2: Stop the Failure from Spreading (Minutes 3-5)
If you identified a failing dependency, stop using it to prevent more failures.
What does “circuit-break” mean? When a library is broken, stop checking it out to everyone. Give your users something else instead. Prevents one broken piece from breaking everything.
What to do (in order):
-
Can you stop using the broken service?
- YES → Stop using it immediately
- NO → Go to step 2
-
Can you serve users with reduced features (degraded mode)?
- Example: Use cached data instead of real-time data
- Example: Show “Service Temporarily Unavailable” instead of crashing
- YES → Switch to degraded mode
- NO → Go to step 3
-
Can you restart the broken service?
- YES → Restart it
- NO → Switch to backup if you have one
-
If none of the above work: Accept that you are down. Escalate. Tell customers when you expect recovery.
Action 3: Communicate Immediately (Minute 5)
Status page update. Not “we are investigating.” Not “we do not know yet.”
Template:
"We are aware of reports of [symptom]. Our team is triage and [specific action: restarting X / rolling back Y / failing over to Z]. Estimated time to restoration: [honest estimate or 'unknown']. More updates every [5 min / 10 min]."
Timing:
- First update: Before minute 5
- Follow-up updates: Every 5 minutes if unresolved, every 10 if recovering
- Under-promise, over-deliver on timing
Do NOT:
- Say "we fixed it" until it is actually fixed
- Explain root cause until you understand it
- Predict recovery until you have made progress
Action 4: Execute the Immediate Fix (Minute 5-10)
Based on your triage:
If deployment:
├─ Rollback deployment? (use decision matrix in Chapter 7b)
└─ Monitor error rate for 3 minutes post-rollback
If dependency failed:
├─ Restart it?
├─ Failover to replica?
└─ Circuit break, route around?
If cascade failure:
├─ Identify lowest-level failing service
└─ Restart/rollback that service first
If latency:
├─ Kill slow queries (do not wait for completion)
├─ Drain the queue (process locally if needed)
└─ Restart the slow service
If data corruption:
├─ Isolate corrupted data
├─ Stop writing to corrupted table
└─ Do NOT rollback without authorization
Part 3: Incident Roles (Who Decides What?)
The following roles exist for one reason: clarity under pressure.
Without clarity, teams second-guess each other. Decisions slow down. Recovery takes longer.
Role assignments are not permanent positions. They are assigned per-incident.
The Incident Commander
Authority: Final decision on major actions (rollback, failover, declare “all clear”)
Responsibilities:
- Calls the triage decision
- Decides between rollback vs. fix-forward
- Escalates when needed
- Declares the incident resolved
What they need:
- Full visibility into customer impact
- Tech lead’s assessment of what is happening
- Comms lead’s feedback on customer panic level
What they do NOT do:
- Execute technical fixes (that is Tech Lead’s job)
- Write status updates (that is Comms Lead’s job)
- Investigate root cause during the incident (that happens after)
The Tech Lead
Authority: Technical decision-making during the incident
Responsibilities:
- Executes IC’s decisions
- Owns technical assessment (is this data corruption? Cascading?)
- Directs engineering team (who investigates what)
- Reports progress to IC every 3 minutes
What they need:
- Access to systems and logs
- Authority to execute decisions without approval (kill queries, restart services)
- Clear direction from IC on what to do
What they do NOT do:
- Make the big decision (rollback or wait?) — IC does that
- Write status updates — Comms does that
- Investigate root cause during the incident — that is post-incident work
The Communications Lead
Authority: What customers hear and when
Responsibilities:
- Initial status page update within 2 minutes
- Updates every 5-10 minutes (depending on severity)
- Alert escalation (customers, leadership) if recovery takes long
- Post-incident communication (what we learned, what is changing)
What they need:
- Updates from IC on progress
- Honest assessment of recovery timeline
- Authority to post without waiting for legal or approval (speed matters)
What they do NOT do:
- Make technical decisions
- Investigate technical details
- Commit to specific fixes or timelines without IC approval
Part 4: The Decision Matrices (Should We Do This?)
Three decisions determine incident outcomes.
Decision 1: Should We Rollback?
Use this matrix. Do not use your gut.
| Condition | Decision |
|---|---|
| Deployed < 2 min ago AND error rate spiked immediately | YES, rollback immediately |
| Deployed < 5 min ago AND error rate high | PROBABLY, but verify |
| Deployed > 15 min ago | Unlikely to be the cause unless coincidental timing |
| Error rate is bad but canary traffic not affected | NO, rollback would hurt more than help |
| Database schema changed in deployment | NO, rollback will cause data loss |
| Other services depend on new behavior | NO, rollback will cascade to them |
| Error rate after rollback is still bad | Rollback is not the fix |
| You are unsure | Escalate to VP before rolling back |
Decision 2: Should We Failover?
Use this matrix.
| Condition | Decision |
|---|---|
| Primary is completely down (unreachable) | YES, failover immediately |
| Primary is running but degraded | Monitor for 2 min, see if it recovers |
| Failover time < 1 minute | Faster to failover than wait |
| Failover time > 5 minutes | See if primary recovers first |
| Failover will trigger replication lag issues | Wait, see if primary recovers |
| Data loss if we failover | DO NOT FAILOVER without VP approval |
| Failover has never been tested | DO NOT FAILOVER without VP approval |
Decision 3: Should We Fix Forward (Code Change)?
Use this matrix.
| Condition | Decision |
|---|---|
| Root cause is identified and fix is simple (< 5 min code change) | YES, fix forward |
| Root cause is unknown | NO, either rollback or degrade |
| Fix requires database migration | NO, too risky, too slow |
| Fix requires changes to 3+ services | NO, too risky |
| Fix has never been tested in production | NO, too risky |
| Customer impact continues while you develop fix | Rollback/failover is faster |
Part 5: The Escalation Ladder (When to Wake Up More People)
The goal is to escalate as little as possible while still solving the problem.
MINUTE 0-2: Oncall engineer runs triage
MINUTE 2-5: If triage shows:
├─ Regional provider event → Escalate immediately
├─ All services cascading → Escalate immediately
├─ Identity system down → Escalate immediately
├─ Data corruption → Escalate immediately (before action)
└─ Single service → Oncall may handle alone
MINUTE 5-30: If not resolved:
├─ After 5 min with no progress → Escalate
├─ Customer impact visible → Escalate
├─ Multiple teams needed → Escalate
└─ Still investigating → May not need escalation yet
MINUTE 30+: If still not resolved:
├─ Director+ escalation (board visibility)
├─ Executive communication
└─ Legal/compliance notification (if data impacted)
Part 6: Common Failure Modes and Their Playbooks
Playbook A: Identity System Down
Identity service is unreachable.
TRIAGE (1 min):
- Is it completely down? (all requests fail?)
- Or degraded? (token refresh slow?)
If completely down:
1. Can you restart it? → Restart, monitor 3 min
2. Is it a certificate issue? → Renew cert, restart
3. Is replication broken? → Failover to replica
4. Is the provider down? → Wait for provider recovery
If recovery > 15 minutes:
1. Activate manual approval workflow (pre-built)
2. Serve cached tokens (approved user list)
3. New users: manual approval only
4. Cannot exceed 1 hour in this mode
VALIDATION:
- New tokens issued successfully
- Recent logins work
- No support queue explosion
Playbook B: Database Replication Lag
Primary and replica out of sync.
TRIAGE (1 min):
- Replication lag < 5 seconds? → Normal, monitor
- Replication lag 5-30 seconds? → High, consider failover
- Replication lag > 5 minutes? → Critical, escalate
If lag > 30 seconds:
1. Is primary actually failing? → Failover to replica
2. Is network slow? → Check link health
3. Is write volume high? → Wait for it to drain
4. Is replication broken? → Escalate to DB team
VALIDATION:
- Lag returns to < 5 seconds
- Reads from replica succeed
- No data loss
Playbook C: Cascade Failure (Multiple Services Down)
Multiple services failing.
TRIAGE (2 minutes):
1. Identify the lowest-level failing service
2. Identify what that service depends on
3. Check if the dependency is down
Example:
- API is failing
- API calls Queue
- Queue is failing
- Fix Queue, then API will recover
RULE: Always fix from bottom up, not top down.
ACTION:
1. Fix the lowest-level dependency
2. Monitor for propagation (API should recover)
3. If API still bad → restart API
4. Do NOT try to fix API while its dependency is broken
Playbook D: Bad Deployment
You deployed something, now everything is failing.
DECISION MATRIX (use above):
├─ Deployed < 2 min ago? → Rollback immediately
├─ Deployed < 5 min ago? → Consider rollback
├─ Deployed > 15 min ago? → Probably not the cause
└─ Error rate after rollback still bad? → Not the cause
ROLLBACK EXECUTION:
1. Run: scripts/rollback.sh [deployment-id]
2. Monitor error rate (expect 3-min stabilization)
3. If still bad → Was not the deployment, investigate further
VALIDATION:
- Error rate returns to baseline
- Recent operations succeed
- No downstream impacts
Playbook E: Data Corruption Detected
System is responding, but data looks wrong.
DO NOT ROLLBACK YET.
TRIAGE (1 minute):
1. When did corruption start?
2. How many records affected?
3. Is corruption spreading (new records being created wrong)?
If spreading:
1. Stop writes to affected table
2. Page database team immediately
3. Do not make recovery decision without them
If localized:
1. Audit scope (how many records actually bad?)
2. Isolate corrupted records
3. Plan reconciliation (not rollback)
4. Validate recovery before announcing
RULE: Recovery is more important than speed. Do not lose good data trying to fix bad data.
Part 7: When You Do Not Know What to Do
This will happen. You will encounter a failure that does not match your playbooks.
When that happens:
Do not guess. Escalate.
If you are unsure:
1. Declare the incident (at the threat level you think it is)
2. Page the person who knows (if possible)
3. Tell them exactly what you see (symptoms, not your theory)
4. Ask: "Based on what you are hearing, is this critical?"
5. Let them lead if it is above your authority
Time spent escalating (2 minutes) is faster than time spent guessing (30 minutes).
Part 8: Post-Incident Checklist (First 60 Minutes After Recovery)
Once you have declared “all clear,” recovery is not done. You still need to validate.
Hour 0-1: Technical Validation
- [ ] Replication lag < 5 seconds
- [ ] Error rate < baseline + 0.1%
- [ ] All recent customer operations succeeded
- [ ] No new issues emerging in logs
- [ ] Database integrity checks passed
- [ ] All regions/zones responding
- [ ] Monitoring fully operational
Hour 1-4: Operational Validation
- [ ] Support team: no new anomalies
- [ ] Product team: feature flags working
- [ ] Finance: no stuck transactions
- [ ] Security team: no signs of exploitation
- [ ] Internal teams: no cascading issues discovered
Decision: Is It Safe to Declare “All Clear”?
If all validation checks pass:
1. Tech lead: "Technical systems nominal"
2. IC: "Incident resolved, declaring all clear"
3. Comms lead: Notify customers
4. Leadership: Brief leadership
If validation shows problems:
1. Declare incident re-open
2. Apply appropriate playbook
3. Do not declare all clear until validation passes
Hour 4+: Schedule the Postmortem
When: 48 hours after incident resolved (not 2 weeks)
Duration: 90 minutes
Attendees: Everyone who was in the incident + stakeholders
Purpose: Understand what happened, not who is to blame
Output: Document decision points and follow-up actions
Summary: The Discipline That Saves Time
This chapter teaches discipline. Discipline feels slower when you are panicking. It is actually faster.
Undisciplined incident response:
- 2 minutes: Start investigating everything
- 15 minutes: Still not sure what the problem is
- 30 minutes: Make a guess, rollback, things get worse
- 60 minutes: Finally isolate the real problem
- 90 minutes: Fix it
Disciplined incident response (using this chapter):
- 2 minutes: Run decision tree, classify the problem
- 5 minutes: Execute immediate action (rollback, restart, failover)
- 10 minutes: Know if it worked, escalate if not
- 30 minutes: Either fixed or escalated to the right people
- 60 minutes: Validated and declared
The discipline feels like a delay. It is actually a time saver.
Print Appendix A. Laminate it. Use it. This chapter saves time by removing the time you would spend panicking.