Skip to main content
友田 陽大
Observability & SRE
アーキテクチャ設計
AWS
TypeScript
サーバーレス

A practical guide to incident response 2026: designing Incident Commander, Runbooks, postmortems, and on-call the SRE way

Explaining how to build a team strong against production failures, faithful to Google SRE's official knowledge. From the Incident Commander model, severity design of SEV1–4, a detect→mitigate→verify→communicate Runbook template, blameless postmortems, on-call hygiene (reducing toil/alert fatigue), to MTTD/MTTR and error budgets, it shows the practical knowledge of designing with operations included, in real code and templates.

Published
Reading time
20 min read
Author
友田 陽大
Share

"When a failure occurs, somehow manage it on the spot" — this is not operations but luck-dependence. From the experience of single-handedly designing and implementing a system where money moves in production, and leading its reliability/observability layer, I assert: the quality of incident response is decided not after a failure occurs but by the "roles, procedures, and criteria" decided before it occurs.

This article, faithful to Google SRE's official knowledge (Managing Incidents / Postmortem Culture / Incident Response (Workbook) / Being On-Call), but with thicker judgment axes of "when, how, why" than the official, drops it into templates and code usable in the field starting tomorrow. I'll proceed with examples of the operations I actually assembled at the serverless payment platform in the environmental field (CloudWatch Alarms + Slack notifications + structured logs, DR with AWS Backup / Vault Lock / PITR).

The scope of this article: "beyond setting up monitoring." What to measure (metrics, traces, logs) I wrote in the observability-pillar article. This piece, on top of that, designs how people and the organization move on failures. If observability is "visualization," incident response is "who, and how, moves on what was made visible."

0. The big picture: incident response is made of "five parts"

Ad-hoc response breaks down because you judge from zero every time. A team strong in production prepares the following five parts in advance. Confusing them causes accidents.

PartRoleChapter of this article
Chain of commandWho decides, who acts, who communicates§1 (Incident Commander)
Severity criteriaA yardstick to instantly decide "is this an incident?"§2 (SEV1–4)
RunbookThe fixed procedure of detect→mitigate→verify→communicate§3
Detect→respond connectionThe wiring where an alert points to a Runbook§4
PostmortemTurns a failure into the organization's learning asset§5

In addition, the foundation that keeps these turning is on-call operation (§6) and the metrics that drive improvement (§7). The order has meaning. Decide roles, decide criteria, write procedures, connect to alerts, and learn when it's over. Below, I proceed in this flow.


1. The Incident Commander model: stop on-the-spot response, move by roles

Multiple people look at the same logs and each starts restarting or deploying on their own — this is the worst production response. Google SRE, in Managing Incidents, adopts the firefighting-derived Incident Command System (ICS). The core is the clear separation of responsibility that "everybody involved in the incident knows their role and doesn't stray onto someone else's turf."

1-1. The four roles

The SRE Workbook expresses good incident management as 3C (Coordinate / Communicate / Control). What handles these are the following roles.

RoleIn one phraseDoesDoesn't do
Incident Commander (IC)Command towerGrasping the overall situation, assigning roles, decision-making, removing obstacles. The IC doubles as any role not delegatedTouch the system themselves (don't act with their hands)
Operations Lead (Ops)The only executorThe actual work of mitigation/repair. "the only group modifying the system during an incident"Overall coordination, external communication
Communications Lead (Comms)The external facePeriodic updates to stakeholders, maintaining the accuracy of documentsSystem changes
Planning LeadRear supportLong-term issues, filing bugs, preparing handoff, tracking deviations from normalDirect mitigation work

In a small team (or solo), at first the IC doubling all roles is the correct form. What's important isn't the number of people but everyone recognizing "which hat am I wearing now." As the scale grows, delegate and separate.

1-2. The only truth: the Live Incident Document and the Command Post

The foundation on which ICS functions is two "single truths."

  • Recognized Command Post: a designated communication channel where stakeholders gather. With Slack, cut a dedicated channel. It makes the time to search "where are we conversing" zero.
  • Live Incident State Document: so central that the official says "most important responsibility is to keep a living incident document." Place it where multiple people can edit simultaneously (Google Docs, etc.).

This document becomes, as-is, the material for the postmortem in §5 later. Precisely because it's written in real time, the timeline remains accurate. This is the decisive difference from "recalling and writing later."

1-3. Handoffs with an "explicit handover"

In a long failure, the command changes hands. What the official emphasizes is prohibiting an ambiguous changeover. Explicitly declare "You're now the incident commander, okay?" and the predecessor leaves only after getting the other's confirmation. "Having sort of handed over" creates a command vacuum = double response or neglect.

Judgment axis: when to activate ICS? The Workbook's lesson is clear. "Declare early when in doubt." In one case study, "did not declare an incident when problems first appeared" invited a delay in resolution. The cost of declaring is about "cutting one Slack channel." Under-declaring is overwhelmingly more expensive.


2. Severity: have a yardstick that lets you "instantly decide" with SEV1–4

Debating "is this urgent?" on the spot every time makes the judgment itself a failure. Define severity in advance and create a state where the person who detects it can apply it in 5 seconds. Severity is for uniquely deciding "the speed of response, the people involved, the scope of communication."

SeverityDefinition (impact)Initial-response targetEscalationExample
SEV1Full outage / data loss / payment double-charge or missImmediate (page)IC appointed immediately + related team leads + management reportPayment webhooks all fail, all APIs 5xx from production DB connection exhaustion
SEV2Serious degradation of a major function (broad user impact)Within minutesPrimary on-call + IC if neededLogin impossible in some regions, charging goes through but receipts not issued
SEV3Limited / workaround exists / degradation of a single functionWithin business hoursPrimary on-call handles, file a ticket if neededAdmin-panel CSV export fails, non-critical job delay
SEV4Almost no user impact / internal onlyNext business dayTicket onlyMinor staging error, increased log noise

The numbers (response targets) are decided by working backward from your team's SLO. For example, at 99.99% availability, the allowed downtime per quarter is about 13 minutes. Being On-Call, to meet this level, gives the example of "paging response times of 5 minutes for important services and 30 minutes for less-urgent ones." Deciding "within how many minutes does a human start touching our SEV1" from the SLO's formula — this is the divergence from being ad-hoc.

The severity axis I designed at the payment platform was not the number of users but "the consistency of money." Rather than a display breaking, even one charge going wrong is higher priority. The severity yardstick should reflect, for that product, "the thing you must never break most."


3. Runbook design: detect → triage → mitigate → verify → communicate

A Runbook is "a procedure that even a just-woken on-call person can follow without thinking." A good Runbook requires no creativity. The Workbook's grand principle is "First responders must prioritize mitigation above all else." Hemostasis comes before root-cause analysis. The official priority is "Stop the bleeding, restore service, and preserve the evidence."

3-1. The skeleton of a Runbook (template)

Align all Runbooks to the same skeleton. When the format is aligned, the on-call person doesn't get lost on "where to look next."

# Runbook: <alert name / failure scenario>
Last updated: 2026-06-24 / Owner: <team> / Related SLO: <SLO name>

## 0. The symptom this Runbook covers
- Firing alert: <Alarm name / metric condition>
- Expected severity: SEV2 (escalates to SEV1 if the impact reaches all charging)

## 1. Detect — confirm what's happening
- Dashboard: <URL>
- Confirmation query/log: <the filter expression of structured logs>
- Judging "is it real": a single spike or continuous, what's the scope?

## 2. Triage — fix the severity and stand up an IC
- Judge the severity with the table in §2
- If SEV1/2: create the incident channel, declare an IC, open the Live Doc

## 3. Mitigate — first stop the bleeding (before root-cause analysis)
- [ ] First move (stop the impact fastest): e.g.) OFF with a feature flag / roll back the last deploy
- [ ] Second move: e.g.) scale out / rate limit / failover
- Write each step as "a command you can copy-paste and run"

## 4. Verify — is it really fixed
- The condition for judging recovery (the metric stays below the threshold for N minutes)
- Recovery of missed processing (reprocessing, compensating transaction)

## 5. Comms — who to tell what
- Initial-report template / frequency of periodic updates (SEV1 every 30 minutes, etc.) / recovery report
- Whether a status-page update is needed

## 6. Postmortem
- Whether a postmortem is needed (the criteria of §5) / link to the Live Doc

3-2. A real example: payment-webhook backlog stagnation (SEV2 → SEV1 candidate)

The webhook processing from the payment provider clogs, and charging events pile up unprocessed — one of the scariest symptoms in a system where money moves. The core of a concretized Runbook (mitigate to verify) becomes this.

## 3. Mitigate
- [ ] Confirm the queue depth (SQS ApproximateNumberOfMessages / DLQ count)
- [ ] Whether the consumer has crashed (ECS task count / exceptions in error logs)
- [ ] First move: scale out the consumer to drain the stagnation
- [ ] If it has crashed: roll back the last deploy (revert the causing commit)
- ⚠️ What you must never do: "delete all" events to resolve the stagnation
       → Since there's an idempotency key, "reprocess everything" is safe. Deleting misses them.

## 4. Verify
- [ ] The queue depth returns to the normal value and stays for 10 minutes
- [ ] Re-inject the DLQ events (double-charging structurally doesn't occur with the idempotency key)
- [ ] Reconcile the charging ledger and the provider-side count (confirm zero difference)

What works here is idempotency. If you give the payment event a deterministic idempotency key and the processing side absorbs duplicates, then "reprocess all stagnated events" becomes a safe mitigation. Precisely because this design existed, my payment platform could maintain 0 double charges in production. A Runbook's mitigation is safe only when supported by the system's design (idempotency, ease of rollback). Even if the procedure alone is splendid, without the foundation it could become "reprocess = double charge."

3-3. One more example: DB connection exhaustion (SEV1)

What's easily overlooked in a serverless configuration is DB connection exhaustion. When Lambda's concurrency spikes and each execution grabs a connection, it hits the limit and all APIs become 5xx.

## 3. Mitigate
- [ ] Whether RDS's DatabaseConnections metric is pinned at the limit
- [ ] First move: rate-limit the runaway caller / temporarily narrow the concurrency
- [ ] Second move: force-close idle connections, route via a connection proxy (RDS Proxy)
- Permanent measure (not mitigation / to the postmortem's AI):
     design of the connection-pool limit, a permanent proxy on the serverless premise

## 4. Verify
- [ ] The connection count returns to the safe zone, and the error rate returns within SLO and stays for 15 minutes

A Runbook is "perishable." An old Runbook is more dangerous than none. Always write an owner and a last-updated date on each Runbook, and update it on each postmortem. The mechanism that turns this are §5 and §6.


4. The detect→respond connection: an alert always points to a Runbook

Even if you set up observability (what and how to measure with OpenTelemetry, SRE implementation in an AWS environment), if the alert and the Runbook aren't connected, the on-call person freezes at "so, what do I do?" the moment it rings. There's one iron rule — every paging alert has a link to the corresponding Runbook.

Being On-Call warns of alert fatigue. When low-priority alerts flood, it causes "serious alerts to be treated with less attention than necessary." The aim is a "1:1 alert/incident ratio" — make only alerts that satisfy "a human should definitely move when it rings" into paging.

// CloudWatch アラームを「Runbook付き」で定義する(CDK / TypeScript)
// 意図: アラートに必ず runbookUrl を持たせ、検知と対応を構造的に接続する
import { Alarm, ComparisonOperator, TreatMissingData } from "aws-cdk-lib/aws-cloudwatch";
import { SnsAction } from "aws-cdk-lib/aws-cloudwatch-actions";

const webhookBacklog = new Alarm(this, "WebhookBacklogHigh", {
  metric: webhookQueueDepthMetric,
  threshold: 1_000,           // 平常値から逆算した「人が動くべき」閾値
  evaluationPeriods: 3,       // 単発スパイクで鳴らさない(誤検知でアラート疲れを招かない)
  comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
  treatMissingData: TreatMissingData.NOT_BREACHING,
  // ★ アラーム説明に Runbook と重大度を埋め込む。通知から1クリックで手順へ。
  alarmDescription: [
    "SEV2: 決済Webhookの滞留。全課金に波及するなら SEV1 へ昇格。",
    "Runbook: https://runbooks.internal/payment-webhook-backlog",
  ].join("\n"),
});
webhookBacklog.addAlarmAction(new SnsAction(oncallTopic)); // → Slack へ

In the Slack notification payload too, always include the three of severity, Runbook link, and the first move. The notification body becomes "the starting point of thinking," and the on-call person can act without searching. This is exactly the wiring I actually assembled at the payment platform — flow it as CloudWatch Alarm → SNS → Lambda → Slack with the context of structured logs attached. It drops "you can't respond to what you can't see" into a single line from detection to procedure.


5. The blameless postmortem: turn a failure into "the organization's learning"

The ROI of failure response is recovered in the postmortem. Postmortem Culture defines a postmortem as "a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring."

5-1. The true meaning of "blameless"

Blameless is not about "let's be kind." The official defines it as "assumes that everyone involved in an incident had good intentions and did the right thing with the information they had." Seek the cause in the system/process, not the individual. The reason is practical — "Removing blame from a postmortem gives people the confidence to escalate issues without fear." A team that hunts for culprits hides the next failure. This is fatal to reliability.

5-2. When to write it (decide the criteria in advance)

The official's most important message: "It is important to define postmortem criteria before an incident occurs." The criteria I use are as follows.

✅ Write a postmortem (if even one applies)
   ・A user-impacting outage/degradation exceeded the threshold (SEV1/SEV2 are mandatory in principle)
   ・Data loss of any kind occurred
   ・On-call did a manual intervention (rollback, traffic detour, etc.)
   ・The time to resolution exceeded the threshold
   ・Monitoring couldn't detect it and a human noticed (= a monitoring bug. The most important class)

❌ No need to write
   ・SEV4 internal-only / no impact (a ticket suffices)

The key is including "monitoring couldn't detect it and a human noticed" as a criterion. This is a defect in observability itself, and leaving it means you can't notice next time.

5-3. Postmortem template (fill-in-the-blank)

# Postmortem: <title> (SEV<N>)
Status: Draft / In Review / Final
Author: <name>  Reviewer: <name>  Date: 2026-06-24

## Summary (3 lines)
What, for how long, who was impacted.

## Impact
- Duration: <detected HH:MMrecovered HH:MM>
- User impact: <in numbers. measured, not guessed>
- Business impact: <e.g.: N charging delays. Data consistency maintained/lost>

## Timeline (transcribed from the Live Doc)
- HH:MM Alert fired (note auto / manual detection)
- HH:MM IC declared, mitigation started
- HH:MM Mitigation complete, verification started
- HH:MM Recovery confirmed

## Root Cause
"Why" 5 times. Write both the technical factor + the process factor.
※ Don't make an individual the subject ("not 'A deleted it' but 'there was no guard preventing deletion'")

## Detection
How you noticed. If it wasn't auto-detection, that itself is an action item.

## What went well / what went poorly in the response
- Went well: ~
- Went poorly: ~
- Got lucky: ~ ← next, into a mechanism that doesn't rely on luck

## Action items (by the criteria of §5-4)
| # | Content | Type (prevent/mitigate/process) | Owner | Due | Ticket |
| - | ------- | ------------------------------- | ----- | --- | ------ |

Explicitly writing "where we got lucky" is a good habit derived from the official. The point that happened to be saved this time is exactly the place that should be changed into a mechanism that doesn't rely on luck next time.

5-4. Good action items vs. bad action items

A postmortem's value is almost decided by the quality of the action items. A vague reflection changes nothing.

Viewpoint❌ Bad AI✅ Good AI
Specificity"be careful," "thoroughly pay attention""add a Runbook for auto-scaling to the webhook-backlog alert"
Owner/duewho does it by when is unknownowner = @yuta, due = 7/15, ticket = OPS-123
Root/symptomends at "this time we restarted by hand""permanently install an RDS Proxy to prevent connection exhaustion, and alert on the limit"
Verifiabilitythe judgment of completion is vague"reproduce it in a failure drill and confirm the auto-mitigation works"
Subjectblames the individual ("○○ is to blame")fixes the system/process ("add a guard")

A good action item is SMART (specific, measurable, assigned, realistic, time-bound). And it's one set only up to filing it, setting a due date, and tracking it. An undone postmortem is just an essay.

The official also lists devices to cultivate the culture: Postmortem of the month (company-wide sharing of good cases), Wheel of Misfortune (role-play training of past failures), and visible praise for excellent postmortems. A state where "the person who wrote up a failure is rewarded" prevents concealment.


6. On-call hygiene: design a sustainable on-call system

No matter how good the Runbooks and postmortems you prepare, if the on-call person is exhausted, none of it turns. Being On-Call preaches keeping on-call healthy from both "quantity" and "quality."

6-1. The upper limit of quantity: structurally prevent burnout

The official's concrete numbers are clear.

  • Secure at least 50% engineering time. Of the rest, "no more than 25% can be spent on-call." To meet this, a single-site 24/7 needs at least 8 people.
  • One incident response (root-cause analysis, repair, writing the postmortem, bug fix) takes about 6 hours. Therefore the sustainable upper limit is "2 per 12-hour on-call shift."

The reason these numbers matter is that you can judge "too much on-call" by data, not feeling. If 3 or more ring every shift, that's not a matter of human guts but a problem of the alert design or reliability itself.

6-2. Escalation policy: when in doubt, escalate

Define primary and secondary. The secondary is a "fall-through for the pages the primary on-call misses." A clear escalation path and "well-defined incident-management procedures" lower stress and prevent judgment errors in a state where stress hormones "impair cognitive functions." Don't ask a tired human for complex improvisation — this is the point of the design.

# escalation-policy.yaml — エスカレーションを「設定」で表現する例
service: payment-platform
severity_routing:
  SEV1:
    page: primary_oncall
    escalate_after: 5m   # 一次が応答しなければ
    then: [secondary_oncall, team_lead]
    notify: [eng_manager, status_page]
  SEV2:
    page: primary_oncall
    escalate_after: 15m
    then: [secondary_oncall]
  SEV3:
    notify: oncall_channel  # ページングはしない(営業時間内対応)
oncall:
  rotation: weekly
  handoff: "毎週月曜 10:00、未解決インシデントとRunbook更新を引き継ぐ"
  compensation: "time-off-in-lieu(公式準拠:当番を金銭/代休で補償)"

6-3. Toil reduction: eradicating manual work is itself cost reduction

Toil = operational work that's manual, repetitive, and automatable. Toil is a hard-to-see cost. If you restart by hand and aggregate logs by hand every time, you're paying a continuous subscription called labor cost. If a Runbook's procedure is a list of copy-paste commands, it's an automation candidate. The official even allows a team that's fallen into operational overload to temporarily "give back the pager" to development. Toil reduction is not a fringe benefit but the main thrust of cost optimization and reliability improvement.

Under-on-call is also a problem. The official recommends "every engineer to be on-call at least once or twice a quarter." Without on-call experience, you can't move when it matters.


7. MTTD / MTTR and error budgets: decide what to improve with numbers

If you talk about "did the response get better" by feeling, improvement stops. Decide where to invest with the following metrics.

MetricMeaningIf this is bad…Effective measure
MTTD (mean time to detect)failure occurs → detectionHoles in monitoring. Manual discovery becomes the normStrengthen observability, review alert thresholds
MTTR (mean time to recover)detection → recoveryMitigation is slow / person-dependentDevelop Runbooks, auto-mitigation, ease rollback

If MTTD is large, invest in observability; if MTTR is large, invest in Runbooks and automation — the metrics name the investment target for you. Note that these, being "means," are weak to outliers. Since they're dragged by a few SEV1s, look at them per severity or also use the median/percentiles.

You should use only measured values measured on your own system. A borrowed MTTR or a fabricated uptime blows away along with trust at the first failure. What I can disclose is only verifiable outcomes that I certainly achieved as a result of design — like 0 double charges in production.

And what connects metrics to the firing condition of response is the error budget. If the SLO is 99.9%, there's an allowed error = budget. When you use up the budget, stop new features and invest in reliability — a mechanism that decides "when to enter response mode" not by politics but by data. With this, the tug-of-war of "failure response vs. feature development" changes from emotional argument to a formula.


8. Common pitfalls

Let me list, with countermeasures, the failures I repeatedly see in the field.

  • ❌ Everyone looks at logs without deciding roles → multiple people deploy/restart simultaneously and widen the wound. ✅ Stand up an IC and unify system changes to Ops (§1).
  • ❌ Start from root-cause analysis → hemostasis is delayed and the damage expands. ✅ Mitigate first. The cause is in the postmortem.
  • ❌ The alert has no Runbook → the on-call person freezes the moment it rings. ✅ Embed a Runbook link and the first move in every paging alert (§4).
  • ❌ Ring too many alerts → miss serious alerts from alert fatigue. ✅ Make only "a human definitely moves when it rings" into paging (1:1 alert/incident).
  • ❌ Hunt for culprits in the postmortem → the next failure is concealed. ✅ Thoroughly blameless. The subject is the system/process (§5).
  • ❌ Don't file action items → the same failure recurs. ✅ Write them SMART and track with due date, owner, ticket.
  • ❌ Paste secrets in the incident channel → in the rush to recover, you paste an API key, token, or customer PII into Slack or the Live Doc. ✅ Hygiene in incident communication too. Don't paste credentials but show them by reference name, and minimize PII. Write on the premise that the channel/document will be widely shared later.
  • ❌ The Runbook is old → expand the damage with false procedures. ✅ Make the owner and last-updated date mandatory, and update on each postmortem.
  • ❌ The on-call rings continuously every shift → burnout & turnover. ✅ When it exceeds "2 per 12 hours," take it as a problem of reliability or alert design (§6).

What works cross-cuttingly is four non-functional requirements. Observability (you can't respond to what you can't see), reliability (idempotency and ease of rollback make mitigation safe), cost (toil is a continuous subscription; erase it with automation), and security (the hygiene of incident communication: don't paste secrets). Incident response is the place where these four intersect.


Summary: operations can be "designed"

A team strong against production failures is strong not by guts but by advance design. The key points in five lines at the end.

  1. Separate roles with the Incident Commander model. The IC decides, only Ops touches the system, Comms communicates. Make the Live Doc and the Command Post "the only truth." When in doubt, declare early.
  2. Have a yardstick that lets you instantly decide severity with SEV1–4. The response target is derived backward from the SLO.
  3. A Runbook is the fixed form of detect→triage→mitigate→verify→communicate. Hemostasis before root-cause analysis. The procedure is safe only when supported by the design of idempotency and rollback.
  4. Turn failures into a learning asset with blameless postmortems. Define criteria in advance, and track action items as SMART. The subject is the system.
  5. Decide where to improve with data, via on-call hygiene (2 per 12 hours, the 50/25% rule, toil reduction) and MTTD/MTTR / error budgets.

"Fast, cheap, and safe with one person × generative AI (Claude Code)" — I take on everything end-to-end, from the application to the infrastructure, and even operation design. The source of this article's knowledge is the serverless payment platform where I led the payment-reliability layer and achieved 0 double charges in production. For consultation on development with operations included, including incident response, Runbook development, and on-call design, please use contact.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading