# A practical guide to incident response 2026: designing Incident Commander, Runbooks, postmortems, and on-call the SRE way

> Explaining how to build a team strong against production failures, faithful to Google SRE's official knowledge. From the Incident Commander model, severity design of SEV1–4, a detect→mitigate→verify→communicate Runbook template, blameless postmortems, on-call hygiene (reducing toil/alert fatigue), to MTTD/MTTR and error budgets, it shows the practical knowledge of designing with operations included, in real code and templates.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: アーキテクチャ設計, AWS, TypeScript, サーバーレス
- URL: https://tomodahinata.com/en/blog/incident-response-runbook-postmortem-oncall-sre-guide
- Category: Observability & SRE
- Pillar guide: https://tomodahinata.com/en/blog/opentelemetry-observability-production-tracing-metrics-logs

## Key points

- The quality of incident response is decided not after a failure occurs but by the roles, procedures, and criteria decided in advance.
- With the Incident Commander model, separate roles, let only Ops touch the system, and declare early when in doubt.
- A Runbook is a fixed form of detect→triage→mitigate→verify→communicate, putting hemostasis before root-cause analysis.
- A blameless postmortem defines criteria in advance and tracks action items as SMART.
- Decide where to improve with data, via on-call hygiene (2 per 12 hours, the 50/25% rule, toil reduction) and MTTD/MTTR / error budgets.

---

"When a failure occurs, somehow manage it on the spot" — this is **not operations but luck-dependence.** From the experience of single-handedly designing and implementing a system where money moves in production, and leading its **reliability/observability layer,** I assert: the quality of incident response is decided not after a failure occurs but by **the "roles, procedures, and criteria" decided before it occurs.**

This article, faithful to Google SRE's official knowledge ([Managing Incidents](https://sre.google/sre-book/managing-incidents/) / [Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) / [Incident Response (Workbook)](https://sre.google/workbook/incident-response/) / [Being On-Call](https://sre.google/sre-book/being-on-call/)), but with **thicker judgment axes of "when, how, why"** than the official, drops it into templates and code usable in the field starting tomorrow. I'll proceed with examples of the operations I actually assembled at the [serverless payment platform in the environmental field](/case-studies/payment-platform-reliability) (CloudWatch Alarms + Slack notifications + structured logs, DR with AWS Backup / Vault Lock / PITR).

> The scope of this article: **"beyond setting up monitoring."** What to measure (metrics, traces, logs) I wrote in the [observability-pillar article](/blog/opentelemetry-observability-production-tracing-metrics-logs). This piece, on top of that, designs **how people and the organization move on failures.** If observability is "visualization," incident response is "who, and how, moves on what was made visible."

## 0. The big picture: incident response is made of "five parts"

Ad-hoc response breaks down because you judge from zero every time. A team strong in production prepares the following five parts **in advance.** Confusing them causes accidents.

| Part | Role | Chapter of this article |
| --- | --- | --- |
| **Chain of command** | Who decides, who acts, who communicates | §1 (Incident Commander) |
| **Severity criteria** | A yardstick to instantly decide "is this an incident?" | §2 (SEV1–4) |
| **Runbook** | The fixed procedure of detect→mitigate→verify→communicate | §3 |
| **Detect→respond connection** | The wiring where an alert points to a Runbook | §4 |
| **Postmortem** | Turns a failure into the organization's learning asset | §5 |

In addition, the foundation that **keeps these turning** is on-call operation (§6) and the metrics that drive improvement (§7). The order has meaning. **Decide roles, decide criteria, write procedures, connect to alerts, and learn when it's over.** Below, I proceed in this flow.

---

## 1. The Incident Commander model: stop on-the-spot response, move by roles

Multiple people look at the same logs and each starts restarting or deploying on their own — this is **the worst production response.** Google SRE, in [Managing Incidents](https://sre.google/sre-book/managing-incidents/), adopts the firefighting-derived **Incident Command System (ICS).** The core is the **clear separation of responsibility** that "**everybody involved in the incident knows their role and doesn't stray onto someone else's turf**."

### 1-1. The four roles

The [SRE Workbook](https://sre.google/workbook/incident-response/) expresses good incident management as **3C (Coordinate / Communicate / Control).** What handles these are the following roles.

| Role | In one phrase | Does | Doesn't do |
| --- | --- | --- | --- |
| **Incident Commander (IC)** | Command tower | Grasping the overall situation, assigning roles, decision-making, removing obstacles. **The IC doubles as** any role not delegated | Touch the system themselves (don't act with their hands) |
| **Operations Lead (Ops)** | The only executor | The actual work of mitigation/repair. "**the only group modifying the system during an incident**" | Overall coordination, external communication |
| **Communications Lead (Comms)** | The external face | Periodic updates to stakeholders, maintaining the accuracy of documents | System changes |
| **Planning Lead** | Rear support | Long-term issues, filing bugs, preparing handoff, tracking deviations from normal | Direct mitigation work |

In a small team (or solo), at first **the IC doubling all roles** is the correct form. What's important isn't the number of people but everyone recognizing "**which hat am I wearing now.**" As the scale grows, delegate and separate.

### 1-2. The only truth: the Live Incident Document and the Command Post

The foundation on which ICS functions is two "single truths."

- **Recognized Command Post**: a **designated communication channel** where stakeholders gather. With Slack, cut a dedicated channel. It makes the time to search "where are we conversing" zero.
- **Live Incident State Document**: so central that the official says "**most important responsibility is to keep a living incident document**." Place it where multiple people can edit simultaneously (Google Docs, etc.).

This document becomes, as-is, **the material for the postmortem** in §5 later. Precisely because it's written in real time, the timeline remains accurate. This is the decisive difference from "recalling and writing later."

### 1-3. Handoffs with an "explicit handover"

In a long failure, the command changes hands. What the official emphasizes is **prohibiting an ambiguous changeover.** Explicitly declare "**You're now the incident commander, okay?**" and the predecessor leaves **only after getting the other's confirmation.** "Having sort of handed over" creates a command vacuum = double response or neglect.

> **Judgment axis: when to activate ICS?**
> The Workbook's lesson is clear. "**Declare early when in doubt.**" In one case study, "**did not declare an incident when problems first appeared**" invited a delay in resolution. The cost of declaring is about "cutting one Slack channel." **Under-declaring is overwhelmingly more expensive.**

---

## 2. Severity: have a yardstick that lets you "instantly decide" with SEV1–4

Debating "is this urgent?" on the spot every time makes **the judgment itself a failure.** Define severity in advance and create a state where the person who detects it can **apply it in 5 seconds.** Severity is for uniquely deciding "the speed of response, the people involved, the scope of communication."

| Severity | Definition (impact) | Initial-response target | Escalation | Example |
| --- | --- | --- | --- | --- |
| **SEV1** | Full outage / data loss / **payment double-charge or miss** | Immediate (page) | IC appointed immediately + related team leads + management report | Payment webhooks all fail, all APIs 5xx from production DB connection exhaustion |
| **SEV2** | Serious degradation of a major function (broad user impact) | Within minutes | Primary on-call + IC if needed | Login impossible in some regions, charging goes through but receipts not issued |
| **SEV3** | Limited / workaround exists / degradation of a single function | Within business hours | Primary on-call handles, file a ticket if needed | Admin-panel CSV export fails, non-critical job delay |
| **SEV4** | Almost no user impact / internal only | Next business day | Ticket only | Minor staging error, increased log noise |

The numbers (response targets) are **decided by working backward from your team's SLO.** For example, at 99.99% availability, the allowed downtime per quarter is about 13 minutes. [Being On-Call](https://sre.google/sre-book/being-on-call/), to meet this level, gives the example of "**paging response times** of 5 minutes for important services and 30 minutes for less-urgent ones." **Deciding "within how many minutes does a human start touching our SEV1" from the SLO's formula** — this is the divergence from being ad-hoc.

> The severity axis I designed at the payment platform was not the number of users but **"the consistency of money."** Rather than a display breaking, **even one charge going wrong is higher priority.** The severity yardstick should reflect, for that product, "the thing you must never break most."

---

## 3. Runbook design: detect → triage → mitigate → verify → communicate

A Runbook is "**a procedure that even a just-woken on-call person can follow without thinking.**" A good Runbook requires no creativity. The Workbook's grand principle is "**First responders must prioritize mitigation above all else.**" **Hemostasis comes before root-cause analysis.** The official priority is "**Stop the bleeding, restore service, and preserve the evidence.**"

### 3-1. The skeleton of a Runbook (template)

Align all Runbooks to the same skeleton. When the format is aligned, the on-call person doesn't get lost on "where to look next."

```markdown
# Runbook: <alert name / failure scenario>
Last updated: 2026-06-24 / Owner: <team> / Related SLO: <SLO name>

## 0. The symptom this Runbook covers
- Firing alert: <Alarm name / metric condition>
- Expected severity: SEV2 (escalates to SEV1 if the impact reaches all charging)

## 1. Detect — confirm what's happening
- Dashboard: <URL>
- Confirmation query/log: <the filter expression of structured logs>
- Judging "is it real": a single spike or continuous, what's the scope?

## 2. Triage — fix the severity and stand up an IC
- Judge the severity with the table in §2
- If SEV1/2: create the incident channel, declare an IC, open the Live Doc

## 3. Mitigate — first stop the bleeding (before root-cause analysis)
- [ ] First move (stop the impact fastest): e.g.) OFF with a feature flag / roll back the last deploy
- [ ] Second move: e.g.) scale out / rate limit / failover
- Write each step as "a command you can copy-paste and run"

## 4. Verify — is it really fixed
- The condition for judging recovery (the metric stays below the threshold for N minutes)
- Recovery of missed processing (reprocessing, compensating transaction)

## 5. Comms — who to tell what
- Initial-report template / frequency of periodic updates (SEV1 every 30 minutes, etc.) / recovery report
- Whether a status-page update is needed

## 6. Postmortem
- Whether a postmortem is needed (the criteria of §5) / link to the Live Doc
```

### 3-2. A real example: payment-webhook backlog stagnation (SEV2 → SEV1 candidate)

The webhook processing from the payment provider clogs, and charging events pile up unprocessed — **one of the scariest symptoms in a system where money moves.** The core of a concretized Runbook (mitigate to verify) becomes this.

```markdown
## 3. Mitigate
- [ ] Confirm the queue depth (SQS ApproximateNumberOfMessages / DLQ count)
- [ ] Whether the consumer has crashed (ECS task count / exceptions in error logs)
- [ ] First move: scale out the consumer to drain the stagnation
- [ ] If it has crashed: roll back the last deploy (revert the causing commit)
- ⚠️ What you must never do: "delete all" events to resolve the stagnation
       → Since there's an idempotency key, "reprocess everything" is safe. Deleting misses them.

## 4. Verify
- [ ] The queue depth returns to the normal value and stays for 10 minutes
- [ ] Re-inject the DLQ events (double-charging structurally doesn't occur with the idempotency key)
- [ ] Reconcile the charging ledger and the provider-side count (confirm zero difference)
```

What works here is **idempotency.** If you give the payment event a deterministic idempotency key and the processing side absorbs duplicates, then "reprocess all stagnated events" becomes a **safe mitigation.** Precisely because this design existed, my payment platform could maintain **0 double charges in production.** A Runbook's mitigation is **safe only when supported by the system's design (idempotency, ease of rollback).** Even if the procedure alone is splendid, without the foundation it could become "reprocess = double charge."

### 3-3. One more example: DB connection exhaustion (SEV1)

What's easily overlooked in a serverless configuration is **DB connection exhaustion.** When Lambda's concurrency spikes and each execution grabs a connection, it hits the limit and **all APIs become 5xx.**

```markdown
## 3. Mitigate
- [ ] Whether RDS's DatabaseConnections metric is pinned at the limit
- [ ] First move: rate-limit the runaway caller / temporarily narrow the concurrency
- [ ] Second move: force-close idle connections, route via a connection proxy (RDS Proxy)
- Permanent measure (not mitigation / to the postmortem's AI):
     design of the connection-pool limit, a permanent proxy on the serverless premise

## 4. Verify
- [ ] The connection count returns to the safe zone, and the error rate returns within SLO and stays for 15 minutes
```

> **A Runbook is "perishable."** An old Runbook is more dangerous than none. Always write **an owner and a last-updated date** on each Runbook, and update it on each postmortem. The mechanism that turns this are §5 and §6.

---

## 4. The detect→respond connection: an alert always points to a Runbook

Even if you set up observability ([what and how to measure with OpenTelemetry](/blog/opentelemetry-observability-production-tracing-metrics-logs), [SRE implementation in an AWS environment](/blog/aws-observability-opentelemetry-sre-ecs)), **if the alert and the Runbook aren't connected,** the on-call person freezes at "so, what do I do?" the moment it rings. There's one iron rule — **every paging alert has a link to the corresponding Runbook.**

[Being On-Call](https://sre.google/sre-book/being-on-call/) warns of **alert fatigue.** When low-priority alerts flood, it causes "**serious alerts to be treated with less attention than necessary.**" The aim is a "**1:1 alert/incident ratio**" — make only alerts that satisfy "**a human should definitely move when it rings**" into paging.

```ts
// CloudWatch アラームを「Runbook付き」で定義する（CDK / TypeScript）
// 意図: アラートに必ず runbookUrl を持たせ、検知と対応を構造的に接続する
import { Alarm, ComparisonOperator, TreatMissingData } from "aws-cdk-lib/aws-cloudwatch";
import { SnsAction } from "aws-cdk-lib/aws-cloudwatch-actions";

const webhookBacklog = new Alarm(this, "WebhookBacklogHigh", {
  metric: webhookQueueDepthMetric,
  threshold: 1_000,           // 平常値から逆算した「人が動くべき」閾値
  evaluationPeriods: 3,       // 単発スパイクで鳴らさない（誤検知でアラート疲れを招かない）
  comparisonOperator: ComparisonOperator.GREATER_THAN_THRESHOLD,
  treatMissingData: TreatMissingData.NOT_BREACHING,
  // ★ アラーム説明に Runbook と重大度を埋め込む。通知から1クリックで手順へ。
  alarmDescription: [
    "SEV2: 決済Webhookの滞留。全課金に波及するなら SEV1 へ昇格。",
    "Runbook: https://runbooks.internal/payment-webhook-backlog",
  ].join("\n"),
});
webhookBacklog.addAlarmAction(new SnsAction(oncallTopic)); // → Slack へ
```

In the Slack notification payload too, always include **the three of severity, Runbook link, and the first move.** The notification body becomes "the starting point of thinking," and the on-call person can act without searching. This is exactly the wiring I actually assembled at the payment platform — flow it as **CloudWatch Alarm → SNS → Lambda → Slack** with the context of structured logs attached. It drops "**you can't respond to what you can't see**" into a single line from detection to procedure.

---

## 5. The blameless postmortem: turn a failure into "the organization's learning"

The ROI of failure response is recovered in the postmortem. [Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) defines a postmortem as "**a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.**"

### 5-1. The true meaning of "blameless"

Blameless is **not** about "let's be kind." The official defines it as "**assumes that everyone involved in an incident had good intentions and did the right thing with the information they had.**" Seek the cause in **the system/process, not the individual.** The reason is practical — "**Removing blame from a postmortem gives people the confidence to escalate issues without fear.**" **A team that hunts for culprits hides the next failure.** This is fatal to reliability.

### 5-2. When to write it (decide the criteria in advance)

The official's most important message: "**It is important to define postmortem criteria before an incident occurs.**" The criteria I use are as follows.

```text
✅ Write a postmortem (if even one applies)
   ・A user-impacting outage/degradation exceeded the threshold (SEV1/SEV2 are mandatory in principle)
   ・Data loss of any kind occurred
   ・On-call did a manual intervention (rollback, traffic detour, etc.)
   ・The time to resolution exceeded the threshold
   ・Monitoring couldn't detect it and a human noticed (= a monitoring bug. The most important class)

❌ No need to write
   ・SEV4 internal-only / no impact (a ticket suffices)
```

The key is including "monitoring couldn't detect it and a human noticed" as a criterion. This is **a defect in observability itself,** and leaving it means you can't notice next time.

### 5-3. Postmortem template (fill-in-the-blank)

```markdown
# Postmortem: <title> (SEV<N>)
Status: Draft / In Review / Final
Author: <name>  Reviewer: <name>  Date: 2026-06-24

## Summary (3 lines)
What, for how long, who was impacted.

## Impact
- Duration: <detected HH:MM – recovered HH:MM>
- User impact: <in numbers. measured, not guessed>
- Business impact: <e.g.: N charging delays. Data consistency maintained/lost>

## Timeline (transcribed from the Live Doc)
- HH:MM Alert fired (note auto / manual detection)
- HH:MM IC declared, mitigation started
- HH:MM Mitigation complete, verification started
- HH:MM Recovery confirmed

## Root Cause
"Why" 5 times. Write both the technical factor + the process factor.
※ Don't make an individual the subject ("not 'A deleted it' but 'there was no guard preventing deletion'")

## Detection
How you noticed. If it wasn't auto-detection, that itself is an action item.

## What went well / what went poorly in the response
- Went well: ~
- Went poorly: ~
- Got lucky: ~ ← next, into a mechanism that doesn't rely on luck

## Action items (by the criteria of §5-4)
| # | Content | Type (prevent/mitigate/process) | Owner | Due | Ticket |
| - | ------- | ------------------------------- | ----- | --- | ------ |
```

Explicitly writing "where we got lucky" is a good habit derived from the official. The point that happened to be saved this time is exactly the place that should be **changed into a mechanism that doesn't rely on luck next time.**

### 5-4. Good action items vs. bad action items

A postmortem's value is almost decided by **the quality of the action items.** A vague reflection changes nothing.

| Viewpoint | ❌ Bad AI | ✅ Good AI |
| --- | --- | --- |
| Specificity | "be careful," "thoroughly pay attention" | "add a Runbook for auto-scaling to the webhook-backlog alert" |
| Owner/due | who does it by when is unknown | owner = @yuta, due = 7/15, ticket = OPS-123 |
| Root/symptom | ends at "this time we restarted by hand" | "permanently install an RDS Proxy to prevent connection exhaustion, and alert on the limit" |
| Verifiability | the judgment of completion is vague | "reproduce it in a failure drill and confirm the auto-mitigation works" |
| Subject | blames the individual ("○○ is to blame") | fixes the system/process ("add a guard") |

A good action item is **SMART** (specific, measurable, assigned, realistic, time-bound). And it's one set only up to **filing it, setting a due date, and tracking it.** An undone postmortem is just an essay.

> The official also lists devices to cultivate the culture: **Postmortem of the month** (company-wide sharing of good cases), **Wheel of Misfortune** (role-play training of past failures), and **visible praise** for excellent postmortems. A state where "the person who wrote up a failure is rewarded" prevents concealment.

---

## 6. On-call hygiene: design a sustainable on-call system

No matter how good the Runbooks and postmortems you prepare, **if the on-call person is exhausted, none of it turns.** [Being On-Call](https://sre.google/sre-book/being-on-call/) preaches keeping on-call healthy from both "**quantity**" and "**quality**."

### 6-1. The upper limit of quantity: structurally prevent burnout

The official's concrete numbers are clear.

- **Secure at least 50% engineering time.** Of the rest, "**no more than 25% can be spent on-call.**" To meet this, a single-site 24/7 needs **at least 8 people.**
- One incident response (root-cause analysis, repair, writing the postmortem, bug fix) takes about 6 hours. Therefore the sustainable upper limit is "**2 per 12-hour on-call shift.**"

The reason these numbers matter is that **you can judge "too much on-call" by data, not feeling.** If 3 or more ring every shift, that's not a matter of human guts but **a problem of the alert design or reliability itself.**

### 6-2. Escalation policy: when in doubt, escalate

Define primary and secondary. The secondary is a "**fall-through for the pages the primary on-call misses.**" A clear escalation path and "**well-defined incident-management procedures**" lower stress and prevent judgment errors in a state where stress hormones "**impair cognitive functions.**" **Don't ask a tired human for complex improvisation** — this is the point of the design.

```yaml
# escalation-policy.yaml — エスカレーションを「設定」で表現する例
service: payment-platform
severity_routing:
  SEV1:
    page: primary_oncall
    escalate_after: 5m   # 一次が応答しなければ
    then: [secondary_oncall, team_lead]
    notify: [eng_manager, status_page]
  SEV2:
    page: primary_oncall
    escalate_after: 15m
    then: [secondary_oncall]
  SEV3:
    notify: oncall_channel  # ページングはしない（営業時間内対応）
oncall:
  rotation: weekly
  handoff: "毎週月曜 10:00、未解決インシデントとRunbook更新を引き継ぐ"
  compensation: "time-off-in-lieu（公式準拠：当番を金銭/代休で補償）"
```

### 6-3. Toil reduction: eradicating manual work is itself cost reduction

**Toil** = operational work that's manual, repetitive, and automatable. Toil is a hard-to-see cost. If you restart by hand and aggregate logs by hand every time, you're paying a **continuous subscription called labor cost.** If a Runbook's procedure is a list of copy-paste commands, it's an **automation candidate.** The official even allows a team that's fallen into operational overload to temporarily "**give back the pager**" to development. Toil reduction is not a fringe benefit but **the main thrust of cost optimization and reliability improvement.**

> **Under**-on-call is also a problem. The official recommends "**every engineer to be on-call at least once or twice a quarter.**" Without on-call experience, you can't move when it matters.

---

## 7. MTTD / MTTR and error budgets: decide what to improve with numbers

If you talk about "did the response get better" by feeling, improvement stops. Decide **where to invest** with the following metrics.

| Metric | Meaning | If this is bad… | Effective measure |
| --- | --- | --- | --- |
| **MTTD** (mean time to detect) | failure occurs → detection | Holes in monitoring. Manual discovery becomes the norm | Strengthen observability, review alert thresholds |
| **MTTR** (mean time to recover) | detection → recovery | Mitigation is slow / person-dependent | Develop Runbooks, auto-mitigation, ease rollback |

**If MTTD is large, invest in observability; if MTTR is large, invest in Runbooks and automation** — the metrics name the investment target for you. Note that these, being "means," are weak to outliers. Since they're dragged by a few SEV1s, **look at them per severity** or also use the median/percentiles.

> You should use **only measured values measured on your own system.** A borrowed MTTR or a fabricated uptime blows away along with trust at the first failure. What I can disclose is only verifiable outcomes that I certainly achieved as a result of design — like **0 double charges in production.**

And what connects metrics to **the firing condition of response** is the **error budget.** If the SLO is 99.9%, there's an allowed error = budget. When you use up the budget, stop new features and invest in reliability — **a mechanism that decides "when to enter response mode" not by politics but by data.** With this, the tug-of-war of "failure response vs. feature development" changes from emotional argument to a formula.

---

## 8. Common pitfalls

Let me list, with countermeasures, the failures I repeatedly see in the field.

- **❌ Everyone looks at logs without deciding roles** → multiple people deploy/restart simultaneously and widen the wound. **✅ Stand up an IC and unify system changes to Ops** (§1).
- **❌ Start from root-cause analysis** → hemostasis is delayed and the damage expands. **✅ Mitigate first.** The cause is in the postmortem.
- **❌ The alert has no Runbook** → the on-call person freezes the moment it rings. **✅ Embed a Runbook link and the first move in every paging alert** (§4).
- **❌ Ring too many alerts** → miss serious alerts from alert fatigue. **✅ Make only "a human definitely moves when it rings" into paging** (1:1 alert/incident).
- **❌ Hunt for culprits in the postmortem** → the next failure is concealed. **✅ Thoroughly blameless. The subject is the system/process** (§5).
- **❌ Don't file action items** → the same failure recurs. **✅ Write them SMART and track with due date, owner, ticket.**
- **❌ Paste secrets in the incident channel** → in the rush to recover, you paste an API key, token, or customer PII into Slack or the Live Doc. **✅ Hygiene in incident communication too. Don't paste credentials but show them by reference name, and minimize PII.** Write on the premise that the channel/document will be widely shared later.
- **❌ The Runbook is old** → expand the damage with false procedures. **✅ Make the owner and last-updated date mandatory, and update on each postmortem.**
- **❌ The on-call rings continuously every shift** → burnout & turnover. **✅ When it exceeds "2 per 12 hours," take it as a problem of reliability or alert design** (§6).

What works cross-cuttingly is four non-functional requirements. **Observability** (you can't respond to what you can't see), **reliability** (idempotency and ease of rollback make mitigation safe), **cost** (toil is a continuous subscription; erase it with automation), and **security** (the hygiene of incident communication: don't paste secrets). Incident response is the place where these four intersect.

---

## Summary: operations can be "designed"

A team strong against production failures is strong not by guts but by **advance design.** The key points in five lines at the end.

1. **Separate roles with the Incident Commander model.** The IC decides, only Ops touches the system, Comms communicates. Make the Live Doc and the Command Post "the only truth." When in doubt, **declare early.**
2. **Have a yardstick that lets you instantly decide severity with SEV1–4.** The response target is **derived backward from the SLO.**
3. **A Runbook is the fixed form** of detect→triage→mitigate→verify→communicate. **Hemostasis before root-cause analysis.** The procedure is safe only when supported by the design of idempotency and rollback.
4. **Turn failures into a learning asset with blameless postmortems.** Define criteria in advance, and track action items as **SMART.** The subject is the system.
5. Decide where to improve with data, via **on-call hygiene** (2 per 12 hours, the 50/25% rule, toil reduction) and **MTTD/MTTR / error budgets.**

"Fast, cheap, and safe with one person × generative AI (Claude Code)" — I take on everything end-to-end, from the application to the infrastructure, and **even operation design.** The source of this article's knowledge is the [serverless payment platform](/case-studies/payment-platform-reliability) where I led the payment-reliability layer and achieved **0 double charges in production.** For consultation on **development with operations included,** including incident response, Runbook development, and on-call design, please use [contact](/contact).