# Turning GuardDuty Findings into Automated Incident Response (SOAR) with EventBridge: A Production-Design Overview in Terraform / Step Functions / Python

> A production-design guide to turning GuardDuty findings into automated incident response (SOAR) via EventBridge → Step Functions. Numeric severity matching, notify broadly / contain narrowly with an allowlist / human approval for destructive actions, EC2/IAM/S3/EKS playbooks, plus idempotency, DLQ, and least privilege—all explained in real code.

- Published: 2026-06-27
- Author: 友田 陽大
- Tags: セキュリティ, AWS, GuardDuty, EventBridge, インシデント対応
- URL: https://tomodahinata.com/en/blog/aws-guardduty-eventbridge-automated-remediation-incident-response-guide
- Category: Amazon GuardDuty in production
- Pillar guide: https://tomodahinata.com/en/blog/aws-guardduty-threat-detection-multi-account-terraform-eventbridge-guide

## Key points

- All GuardDuty findings flow to EventBridge (source=aws.guardduty / detail-type=GuardDuty Finding). Since detail.severity is numeric, you can mechanically pick up 'High and above only' with {"numeric":[">=",7]}
- The latency trap: new findings arrive in about 5 minutes, but the delivery frequency of recurrences (updates) defaults to 6 hours. If you build automated response, set finding_publishing_frequency to FIFTEEN_MINUTES
- The iron rule of response design is 'notify broadly, contain narrowly with an allowlist, and interpose human approval for destructive operations.' Since EventBridge is at-least-once, make the finding id the idempotency key and start from reversible containment
- Orchestration is Step Functions: enrich → decide → contain → notify → ticket. Destructive steps wait for human approval with a task token. Split EC2/IAM/S3/EKS into per-resource-type playbooks
- Production quality is decided by the plumbing. DLQ + retries, a least-privilege responder role, structured logs (don't emit sensitive values), path verification with create-sample-findings, and provide a manual-trigger entry point with a Security Hub custom action too

---

"We can detect with GuardDuty. So **after** detecting, who does what?"—this is the question that always comes next after threat detection, and the one many organizations get stuck on.

[Designing threat detection in production with GuardDuty](/blog/aws-guardduty-threat-detection-multi-account-terraform-eventbridge-guide) lines up findings on a dashboard. But **a finding is a "red card," not a "response."** Even if `UnauthorizedAccess:EC2/MaliciousIPCaller.Custom` appears at 2 a.m., a human notices it, opens the console, identifies the instance, creates a quarantine SG, and swaps it—and in that interval the attacker exfiltrates data and moves laterally. **The only path to shrinking MTTR (mean time to respond) is running "plumbing" between detection and response.**

This article is a production-design guide to that plumbing—**receiving GuardDuty findings with Amazon EventBridge and turning them into automated incident response (so-called SOAR: Security Orchestration, Automation and Response).** Whereas the pillar article handled the EventBridge entrance (severity matching + isolation with a single Lambda), this article runs through what comes **after**—**multi-stage orchestration with Step Functions, per-resource-type playbooks, a human-approval gate, idempotency, DLQ, and a least-privilege responder role**—in real Terraform / ASL / Python code. As source material, I'll weave in my experience implementing IAM, observability, and DR across a [serverless payment platform](/case-studies/payment-platform-reliability) on multi-account AWS, where I **guaranteed "zero double charges" with idempotency on a foundation handling real money**—that thinking of "keeping side effects to once in an at-least-once world" is exactly the same as designing automated response.

> **The rules of this article**: specs, event structures, severity numeric bands, and delivery frequency are based on the **AWS official documentation (as of June 2026)**. Finding types, recommended responses, and APIs get revised, so always confirm the latest official info before going to production. And the iron rule of building automated response—**GuardDuty is detection, not defense, and automated response is just one part of defense-in-depth.** Automation should start from things that satisfy these four conditions: **① idempotent (one operation even if it arrives twice in an at-least-once world), ② scope it down (a type allowlist), ③ reversible, and ④ interpose human approval for destructive operations.** The biggest risk of automated response is dropping production with one false positive.

---

## 0. Mental model: automated response is "deciding plumbing," not "punching automation"

Before starting the design, let me fix it in one line.

> **Automated incident response = "deciding plumbing" that takes a finding (a structured threat notification) as input, enriches it with context, decides the response level (decide), and executes safe containment (contain), notification (notify), and ticketing (ticket) idempotently and reversibly. Destructive remediation interposes human approval.**

From here, three consequences run through all of the design.

1. **"Punch everything" is an accident.** The thing you most must not do in automated response is automatically issue instance termination or key deletion to every finding. GuardDuty has false positives too (Chapter 5 covers how to suppress them). If you drop a production instance with one false positive, your own automation causes failures more frequently than attacks do. So the design principle is **"notify broadly, contain narrowly, interpose a human for destruction."**
2. **EventBridge is at-least-once. So idempotency is a premise.** EventBridge can start the response again for the same finding. Just like preventing double charges on a payment foundation, **the response must also be "one operation's worth of side effect even if it arrives twice."** The idempotency key is the finding's `id` (+ `updatedAt`). This is the backbone of this article.
3. **Detection speed and response speed are separate.** GuardDuty delivers a new finding in about 5 minutes, but **the delivery of recurrences (updates) is delayed by 6 hours by default.** To raise the reaction speed of automated response, set `finding_publishing_frequency = FIFTEEN_MINUTES`—not knowing this lands you in the trap of "I automated it but the response is slow" (detailed in Chapter 3).

Grasp these three, and you see that automated-response design decomposes into three: **"① which findings to flow where (EventBridge routing) → ② how to decide and remediate (Step Functions + playbooks) → ③ how to run it without breaking (idempotency, DLQ, least privilege, observability)."** Let's build them in order.

---

## 1. Overall architecture: five verbs (enrich → decide → contain → notify → ticket)

Let me first put out the map of the finished form. Events flow rightward from GuardDuty.

```text
GuardDuty finding
   │  (source=aws.guardduty / detail-type="GuardDuty Finding" / detail.severity = numeric)
   ▼
EventBridge rule (numeric severity matching + type routing + archive exclusion)
   ├─▶ [always] SNS → Slack / email       … notify broadly (fast human triage)
   └─▶ Step Functions state machine        … response is curated (deciding plumbing)
          1. ENRICH   enrich the finding (tags, owner, related resources, known IoCs)
          2. DECIDE   decide "notify-only / contain / approval-required" by severity × type × allowlist
          3. CONTAIN  per-resource-type playbook Lambda (idempotent, reversible)
          4. APPROVE  for destructive operations only, wait for human approval via a task token (Slack/email)
          5. NOTIFY+TICKET  notify the response result and file a ticket in Jira/ServiceNow
          └─ on failure → DLQ (SQS) + retry. Don't drop anything
```

Also provide a manual-trigger entry point. Using a **Security Hub custom action**, an analyst selects a finding in the console and presses a "Quarantine" button = an EventBridge event fires, and the same Step Functions can be started (Chapter 8). **Consolidating "responses that happen automatically" and "responses a human pulls the trigger for" into the same plumbing** is the trick to keeping operations simple.

Let me table each component and the reason for its selection.

| Role | Service | Why this |
| --- | --- | --- |
| Detection source | GuardDuty | All findings flow automatically to EventBridge |
| Routing | EventBridge rule | Numeric severity matching, type-based branching, archive exclusion—serverless |
| Broad notification | SNS (→ Slack/email) | Can add destinations later. The starting point of human triage |
| Orchestration | **Step Functions** | Multi-stage, conditional branching, human-approval waiting—as a visualized state machine |
| Individual remediation | Lambda (per type) | **Split EC2/IAM/S3/EKS playbooks by SRP.** Can narrow least privilege per type |
| Reliability | SQS (DLQ) + retry_policy | At-least-once + **don't drop anything** against transient failures |
| Manual trigger | Security Hub custom action | An analyst starts the same plumbing from the console |

> **Why not a single giant Lambda (SRP)**: "do everything in one Lambda" makes the IAM role a giant permission holding `ec2:*`, `iam:*`, and `s3:*`, destroying least privilege. **Splitting decide and contain, and dividing contain per resource type**, lets you narrow each Lambda's permission to "EC2 only" or "IAM only." Even if that Lambda is compromised, you can localize the damage—this is a textbook dividend of least privilege.

---

## 2. EventBridge routing design: severity numeric matching, type branching, archive exclusion

All GuardDuty findings arrive at EventBridge in near real time. The shape of the arriving event is this (the official event schema).

```json
{
  "version": "0",
  "id": "cd2d702e-ab31-411b-9344-793ce56b1bc7",
  "detail-type": "GuardDuty Finding",
  "source": "aws.guardduty",
  "account": "111122223333",
  "region": "ap-northeast-1",
  "time": "2026-06-27T02:00:00Z",
  "detail": {
    "id": "1ab23c...",
    "type": "UnauthorizedAccess:EC2/MaliciousIPCaller.Custom",
    "severity": 7.5,
    "accountId": "111122223333",
    "region": "ap-northeast-1",
    "updatedAt": "2026-06-27T02:00:00.000Z",
    "service": { "archived": false },
    "resource": { "resourceType": "Instance", "instanceDetails": { "instanceId": "i-0abc..." } }
  }
}
```

There are three crucial design points.

### 2.1 severity is a "number"—so you can cut it mechanically by threshold

`detail.severity` is not a string but a **number.** EventBridge's **numeric matching** can express "only High and above (≥ 7) to Step Functions" in one line. The severity numeric bands (official) are these.

| Severity | Numeric band | Treatment in automated response |
| --- | --- | --- |
| **Critical** | **9.0 – 10.0** | An attack sequence may be in progress. Top priority for containment + immediate escalation |
| **High** | **7.0 – 8.9** | A resource is compromised and abuse is in progress. Consider containment (depending on the allowlist) |
| **Medium** | **4.0 – 6.9** | Deviant suspicious behavior. **Notify and let a human investigate** (no auto-containment) |
| **Low** | **1.0 – 3.9** | An attempt not reaching compromise (port scans, etc.). Record only |

> **`AttackSequence:*` is always Critical**: the "attack sequence" findings that [Extended Threat Detection](/blog/aws-guardduty-extended-threat-detection-attack-sequence-findings-guide) correlates over a 24-hour window from weak signals are, by their nature, all classified as **Critical (9.0–10.0).** Things like `AttackSequence:S3/...` and `AttackSequence:EKS/CompromisedCluster`. **Placing `AttackSequence:*` as the top-priority trigger** in automated response is the standard (because the context arrives as a line, it has the highest response value).

### 2.2 Don't flow archived findings

As official behavior, **findings auto-archived by a Suppression Rule are not sent to EventBridge** (for manually-archived ones, occurrences after archiving are sent according to the frequency). Still, to doubly reject noise, adding a filter on the rule side that picks up only those where `detail.service.archived` is `false` makes it robust.

### 2.3 The rule in Terraform: numeric matching + type routing + DLQ + retry

The wiring that flows "High and above (≥7) and not archived" to Step Functions while broadly notifying via SNS.

```hcl
# ── High 以上(severity >= 7) かつ未アーカイブの GuardDuty finding を捕捉 ──
resource "aws_cloudwatch_event_rule" "gd_high" {
  name        = "guardduty-high-to-soar"
  description = "Route GuardDuty findings (severity >= 7, not archived) to the SOAR pipeline"
  event_pattern = jsonencode({
    source        = ["aws.guardduty"]
    "detail-type" = ["GuardDuty Finding"]
    detail = {
      severity = [{ numeric = [">=", 7] }] # 数値マッチ。High(7.0-8.9) と Critical(9.0-10.0)
      service  = { archived = [false] }    # 抑制で自動アーカイブされたものは元々来ないが、二重に弾く
    }
  })
}

# ① 人間向け：必ず SNS（→ Slack/メール）へ。通知は広く。
resource "aws_cloudwatch_event_target" "to_sns" {
  rule      = aws_cloudwatch_event_rule.gd_high.name
  target_id = "notify-sns"
  arn       = aws_sns_topic.security_alerts.arn
}

# ② 機械向け：オーケストレーションする Step Functions ステートマシンへ。
resource "aws_cloudwatch_event_target" "to_sfn" {
  rule      = aws_cloudwatch_event_rule.gd_high.name
  target_id = "soar-state-machine"
  arn       = aws_sfn_state_machine.responder.arn
  role_arn  = aws_iam_role.eventbridge_invoke_sfn.arn # EventBridge が SFN を起動するためのロール

  # 一過性の失敗に備えてリトライ＆DLQ（取りこぼさない）。
  retry_policy {
    maximum_event_age_in_seconds = 3600 # 最大1時間まで再試行（古すぎる対応は意味がない）
    maximum_retry_attempts       = 4
  }
  dead_letter_config { arn = aws_sqs_queue.soar_dlq.arn }
}

# 配送に失敗したイベントを溜める DLQ。ここが空でないことを必ずアラート対象にする。
resource "aws_sqs_queue" "soar_dlq" {
  name                      = "soar-eventbridge-dlq"
  message_retention_seconds = 1209600 # 14日。調査の猶予を最大に取る
}
```

> **DLQ is not "build it and done"**: a message piling up in the DLQ = a finding existed for which automated response couldn't start. This is synonymous with a missed detection—the most dangerous silence. Make `ApproximateNumberOfMessagesVisible > 0` a CloudWatch alarm, and instantly notify on the very fact that the DLQ is not empty. In reliability design, "don't swallow failures" is the essence.

---

## 3. The latency trap: new is about 5 minutes, updates default to 6 hours

If you build automated response, this is a **must-read** pitfall. The delivery frequency to EventBridge behaves differently by the kind of finding (official).

- **A new finding (the first generation of a unique finding ID)**: delivered in **near real time (about 5 minutes).** This frequency can't be changed by default.
- **A recurrence of an existing finding (subsequent occurrences)**: aggregates recurrences of the same type into one event at a set interval and delivers it. That interval is `finding_publishing_frequency`, choosable from **`FIFTEEN_MINUTES` / `ONE_HOUR` / `SIX_HOURS` (default).** **The default is 6 hours.**

In other words—**"the first detection is fast, but the follow-up is delayed by up to 6 hours."** Waiting 6 hours for the follow-up of a finding whose attack is ongoing dilutes the meaning of automated response. So if you build automated response, **set `FIFTEEN_MINUTES`.**

```hcl
# detector に対し、再発 finding の EventBridge 配信頻度を 15 分に。
# 注意: この頻度を変えられるのは（委任）管理者アカウントのみ。
#       メンバーアカウントは自分用に変更できず、管理者の設定が全メンバーに波及する。
resource "aws_guardduty_detector" "this" {
  enable                       = true
  finding_publishing_frequency = "FIFTEEN_MINUTES" # 自動対応の反応速度を上げる
}
```

> **Implication for multi-account**: setting this frequency in the [delegated administrator](/blog/aws-guardduty-multi-account-organizations-delegated-administrator-terraform-guide) account **applies the same frequency to all member accounts.** Members can't change it individually. You can govern the whole organization's automated-response speed with one setting in the delegated administrator—this is both an advantage and a caveat that the change's blast radius is wide.

---

## 4. Principles of response design: notify broadly, contain narrowly with an allowlist, interpose a human for destruction

This is the heart of SOAR design. Split into three layers, **make the aggression of automation "proportional to risk."**

| Layer | What | On which findings | Reversibility |
| --- | --- | --- | --- |
| **NOTIFY** | Enrich and notify to SNS → Slack/email | **All High and above** (broadly) | — |
| **CONTAIN** | Quarantine-SG swap, public-access block, etc. | **Type allowlist ∩ High and above** (narrowly) | **Reversible operations only** |
| **DESTROY** | Disable keys, terminate instances | Same as above + **after human approval** | Hard to revert → so interpose a human |

Putting the design into words:

- **Notify (notify) broadly**: so humans can triage fast, flow all High and above to notification. Notification is low-cost and the risk of missing things is high, so "broadly" is correct.
- **Contain (contain) narrowly, with an allowlist**: limit auto-action to things with **few false positives and reversible operations.** Only findings on the "type allowlist" are targets of auto-containment. Types not on the allowlist, even if High, stop at notify + ticket.
- **Destructive operations (destroy) interpose human approval**: disabling keys or terminating instances is hard to revert when it later turns out to be legitimate use. So **automate to "awaiting approval" and don't execute until a human presses a button** (Step Functions' task token, §6.3).

> **Idempotency is a precondition**: any of the above operations **can be started twice** by EventBridge's at-least-once delivery. Containment must be "the same state whether quarantined once or twice." The idempotency key is the finding's **`id`.** Further, looking at "is this a newer `updatedAt` for the same finding?" lets you reject old resends. This is exactly the same as the thinking on a payment foundation—**charge once even if you receive the same payment event twice**—make `id` the idempotency key, record processed, and make the second a no-op.

---

## 5. Per-resource-type playbooks: EC2 / IAM / S3 / EKS

The contents of containment **differ completely by resource type.** Branch on the finding's `detail.resource.resourceType` (`Instance` / `AccessKey` / `S3Bucket` / `Kubernetes` ...) and pass to the per-type playbook. Each playbook conforms to the official "Remediating findings."

### 5.1 EC2 (`Instance`) → quarantine SG + cutting off the IAM role

The official EC2 remediation is clear: **create a dedicated Isolation security group** (allowing neither inbound nor outbound `0.0.0.0/0 (0-65535)`), **associate** it with the instance, and **remove all other SGs.** Automated response codifies this.

- **Create the quarantine SG (tightening egress too) in advance**, and at detection time just swap it (don't create the SG at runtime = fast, idempotent).
- **Existing tracked connections aren't severed by an SG change** (official note). Note that only future traffic is blocked. If immediate cutoff is needed, also use a NACL.
- If a **strong IAM role is attached** to the instance, prepare for credential exfiltration (`InstanceCredentialExfiltration`) by considering swapping the role to a "deny-only" policy / revoking the session (leans destructive = suited to the approval gate).
- **Don't auto-terminate**—take a snapshot for investigation and leave termination to human judgment.

### 5.2 IAM (`AccessKey`) → disabling/rotation is after approval

The official credentials remediation is "identify the involved IAM entity and APIs → confirm permissions → confirm legitimate use → rotate keys if leaked." Disabling a key is a **hard-to-revert destructive operation**, so in automated response:

- **Distinguish `AKIA` (long-term keys) and `ASIA` (STS temporary keys).** `ASIA` is short-lived and can't be disabled from the console (the fix is on the permission-stripping / session-revocation side).
- **Disabling a long-term key (`UpdateAccessKey` → `Inactive`) is after human approval.** Deletion (`DeleteAccessKey`) is even more cautious.
- In parallel, there's also the hand of **temporarily attaching a broad explicit Deny** to a suspected-leaked principal to stop the damage (it's reversible, so the approval bar can be lowered).

### 5.3 S3 (`S3Bucket`) → public-access block, policy review

The official S3 remediation is "identify the involved bucket, caller, and APIs → judge whether the access is legitimate → tighten with Block Public Access, etc." If `ANONYMOUS_PRINCIPAL` appears, it's a sign **the bucket is public.**

- **Enable S3 Block Public Access** (the four account/bucket-level settings). This is relatively safe and reversible, so easy to put on the allowlist.
- If an overly-loose bucket policy/ACL is detected, **notify + ticket** (auto-rewriting the policy has high misresponse risk).
- Evaluate the presence of sensitive data with **Macie**—attach "what leaked" to the ticket.

### 5.4 EKS (`Kubernetes`) → pod cordon / isolation

Containment against EKS compromise (a compromised pod, abuse of a privileged service-account token) is done at the Kubernetes layer.

- **Cordon** the target node (stop new scheduling) and apply **isolation via a network policy** (egress cutoff) to the compromised pod.
- **Rotate the token** of the suspected-compromised service account and tighten the related IRSA role.
- Since these are `kubectl`-equivalent API operations, you need a design that gives the responder Lambda **least-privilege RBAC to the cluster** (here too is a reason to divide the role per type).

> **Common principle**: every playbook follows the order "**first reversible containment (isolation, public-access block, cordon) → destructive remediation (key deletion, termination) after approval.**" Buy time with containment, and let a human decide on destruction—this is the structure that balances "speed" and "safety."

---

## 6. Orchestration: Step Functions (ASL) + an enrich/route Lambda (Python)

Build the multi-stage flow of deciding and remediating with **Step Functions**, not a single Lambda's `if` hell. The reason is clear—**state is visualized, you can hold the human-approval "wait," each step can be individually retried / DLQ'd, and the failure point is visible at a glance.**

### 6.1 The state machine (ASL sketch)

```json
{
  "Comment": "GuardDuty finding -> automated incident response (SOAR)",
  "StartAt": "Enrich",
  "States": {
    "Enrich": {
      "Comment": "finding を補強し、対応レベル(notify/contain/approve)を決定する",
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${EnrichRouterFunctionArn}",
        "Payload.$": "$"
      },
      "ResultSelector": { "decision.$": "$.Payload.decision", "finding.$": "$.Payload.finding" },
      "Retry": [{ "ErrorEquals": ["States.TaskFailed"], "IntervalSeconds": 2, "MaxAttempts": 3, "BackoffRate": 2.0 }],
      "Next": "RouteByDecision"
    },
    "RouteByDecision": {
      "Comment": "decision に応じて分岐。notify-only はそのまま通知へ",
      "Type": "Choice",
      "Choices": [
        { "Variable": "$.decision.mode", "StringEquals": "contain", "Next": "RouteByResourceType" },
        { "Variable": "$.decision.mode", "StringEquals": "approve", "Next": "WaitForHumanApproval" }
      ],
      "Default": "NotifyAndTicket"
    },
    "RouteByResourceType": {
      "Comment": "リソース種別ごとのプレイブックへ。封じ込めは冪等・取り消し可能なものだけ",
      "Type": "Choice",
      "Choices": [
        { "Variable": "$.decision.resourceType", "StringEquals": "Instance",   "Next": "ContainEC2" },
        { "Variable": "$.decision.resourceType", "StringEquals": "S3Bucket",   "Next": "ContainS3" },
        { "Variable": "$.decision.resourceType", "StringEquals": "Kubernetes", "Next": "ContainEKS" }
      ],
      "Default": "NotifyAndTicket"
    },
    "ContainEC2": { "Type": "Task", "Resource": "${ContainEc2FunctionArn}", "Next": "NotifyAndTicket",
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "NotifyAndTicket" }] },
    "ContainS3":  { "Type": "Task", "Resource": "${ContainS3FunctionArn}",  "Next": "NotifyAndTicket",
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "NotifyAndTicket" }] },
    "ContainEKS": { "Type": "Task", "Resource": "${ContainEksFunctionArn}", "Next": "NotifyAndTicket",
      "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "NotifyAndTicket" }] },

    "WaitForHumanApproval": {
      "Comment": "破壊的操作(鍵無効化・終了)は task token で人間の承認を待つ。承認まで実行しない",
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "${RequestApprovalFunctionArn}",
        "Payload": { "finding.$": "$.finding", "taskToken.$": "$$.Task.Token" }
      },
      "TimeoutSeconds": 3600,
      "Next": "ExecuteDestructive",
      "Catch": [{ "ErrorEquals": ["States.Timeout"], "Next": "NotifyAndTicket" }]
    },
    "ExecuteDestructive": {
      "Comment": "承認済みのときだけ破壊的アクションを実行（鍵無効化等）",
      "Type": "Task",
      "Resource": "${ExecuteDestructiveFunctionArn}",
      "Next": "NotifyAndTicket"
    },

    "NotifyAndTicket": {
      "Comment": "結果を通知し、チケットを起票して終了（常にここを通る）",
      "Type": "Task",
      "Resource": "${NotifyTicketFunctionArn}",
      "End": true
    }
  }
}
```

Key design points:

- **Always merge into `NotifyAndTicket` with `Catch`**—even if containment fails, always emit the notification and ticket a human can notice (don't create silence).
- **"Wait" for human approval with `waitForTaskToken`**—Step Functions can wait without billing until the task token is returned. The approval Lambda emits an approval link to Slack/email, and when a human presses it, resume with `SendTaskSuccess` (use this gate for destructive operations only).
- **Apply `Retry` thickly only to the idempotent Enrich**—make the containment tasks idempotent inside the Lambda (below), so double starts on the state-machine side are also safe.

### 6.2 The enrich + route dispatcher (Python, idempotent)

A **decision-only** dispatcher, separate from the pillar article's "isolate with a single Lambda." It **has no side effects**, just enriching the finding and returning a `decision` (mode and resourceType). Easy to test, and it gets by with least privilege (read-only + DynamoDB only).

```python
"""GuardDuty finding を補強し、対応レベルを決定する enrich/route ディスパッチャ。

設計原則:
  - 副作用なし: 判断だけを返す純粋寄りの Lambda。封じ込めは別 Lambda に委譲。
  - 冪等: finding.id を冪等キーに、updatedAt で『古い再送』を弾く（DynamoDB 条件付き書き込み）。
  - スコープを絞る: 自動封じ込めは CONTAIN_ALLOWLIST に載った型 + High 以上のみ。
  - 破壊的は承認待ちへ: DESTRUCTIVE_ALLOWLIST は 'approve' を返し、自動実行しない。
  - 可観測: 構造化ログ。機密値(認証情報・PII)は出さない。
"""
from __future__ import annotations

import json
import logging
import os
import time
from typing import Any, Final, TypedDict

import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

ddb = boto3.client("dynamodb")
IDEMPOTENCY_TABLE: Final[str] = os.environ["IDEMPOTENCY_TABLE"]
IDEMPOTENCY_TTL_SECONDS: Final[int] = 7 * 24 * 3600  # 7日で自動失効（TTL 属性で削除）

# 自動封じ込め（冪等・取り消し可能）を許す finding 型。
CONTAIN_ALLOWLIST: Final[frozenset[str]] = frozenset({
    "UnauthorizedAccess:EC2/MaliciousIPCaller.Custom",
    "CryptoCurrency:EC2/BitcoinTool.B!DNS",
    "Backdoor:EC2/C&CActivity.B!DNS",
    "Policy:S3/BucketPublicAccessGranted",
    "Policy:S3/BucketAnonymousAccessGranted",
})
# 破壊的操作（鍵無効化等）= 人間承認を挟む型。自動実行しない。
DESTRUCTIVE_ALLOWLIST: Final[frozenset[str]] = frozenset({
    "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS",
    "CredentialAccess:IAMUser/AnomalousBehavior",
})


class Decision(TypedDict):
    mode: str          # "notify" | "contain" | "approve"
    resourceType: str  # "Instance" | "S3Bucket" | "Kubernetes" | "AccessKey" | ...
    reason: str


def handler(event: dict[str, Any], _context: object) -> dict[str, Any]:
    detail = event["detail"]
    finding_id: str = detail["id"]
    updated_at: str = detail.get("updatedAt", detail.get("createdAt", ""))
    finding_type: str = detail["type"]
    severity: float = float(detail["severity"])
    resource_type: str = detail.get("resource", {}).get("resourceType", "Unknown")

    log = {"finding_id": finding_id, "type": finding_type,
           "severity": severity, "resourceType": resource_type}

    # ── 冪等ガード ──
    # 同じ finding_id を、同じ(以前の)updatedAt で2回処理しない。
    # at-least-once 配信や Step Functions の再試行で再入しても副作用は1回分。
    if not _claim_finding(finding_id, updated_at):
        logger.info(json.dumps({**log, "decision": "duplicate-skip"}))
        return {"finding": _safe_finding(detail), "decision": Decision(
            mode="notify", resourceType=resource_type, reason="duplicate (already processed)")}

    # ── 判断 ──
    if finding_type in DESTRUCTIVE_ALLOWLIST and severity >= 7.0:
        decision = Decision(mode="approve", resourceType=resource_type,
                            reason="destructive remediation requires human approval")
    elif finding_type in CONTAIN_ALLOWLIST and severity >= 7.0:
        decision = Decision(mode="contain", resourceType=resource_type,
                            reason="allowlisted reversible containment")
    else:
        # 許可リスト外・Medium 以下・型不明 → 通知＋チケットのみ。
        decision = Decision(mode="notify", resourceType=resource_type,
                            reason="notify-only (not in allowlist or below threshold)")

    logger.info(json.dumps({**log, "decision": decision["mode"], "reason": decision["reason"]}))
    return {"finding": _safe_finding(detail), "decision": decision}


def _claim_finding(finding_id: str, updated_at: str) -> bool:
    """冪等トークンを確保。新規 or より新しい updatedAt のときだけ True を返す。

    条件付き書き込みで『未処理 or 受信した updatedAt がより新しい』場合のみ上書き。
    既に同じ/より新しい版を処理済みなら False（＝重複）。
    """
    now = int(time.time())
    try:
        ddb.put_item(
            TableName=IDEMPOTENCY_TABLE,
            Item={
                "finding_id": {"S": finding_id},
                "updated_at": {"S": updated_at},
                "ttl": {"N": str(now + IDEMPOTENCY_TTL_SECONDS)},
            },
            # 未登録、または登録済みでも今回の updated_at の方が新しいときだけ書く。
            ConditionExpression="attribute_not_exists(finding_id) OR updated_at < :u",
            ExpressionAttributeValues={":u": {"S": updated_at}},
        )
        return True
    except ClientError as exc:
        if exc.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return False  # 同じ/古い再送 → スキップ
        raise


def _safe_finding(detail: dict[str, Any]) -> dict[str, Any]:
    """下流に渡す finding から、機密になり得るフィールドを落とす（ログ・通知に乗せない）。"""
    keep = ("id", "type", "severity", "accountId", "region", "title", "updatedAt")
    slim = {k: detail.get(k) for k in keep}
    res = detail.get("resource", {})
    slim["resourceType"] = res.get("resourceType")
    slim["instanceId"] = res.get("instanceDetails", {}).get("instanceId")
    slim["bucketName"] = (res.get("s3BucketDetails") or [{}])[0].get("name")
    return slim
```

Let me make this code's design explicit.

- **Idempotency is guaranteed by DynamoDB's conditional write**: keyed on `finding_id`, write only when "unprocessed or a newer `updatedAt`." From the second time on, it becomes `duplicate-skip` via `ConditionalCheckFailedException` and containment doesn't run. The record auto-expires via TTL. This is the same form of defense as payment's "process the same event twice but charge once."
- **Separating decide and remediate (SRP)**: this Lambda needs neither `ec2:*` nor `iam:*`. Its permission is **only put and read to DynamoDB.** The strong remediation permissions are given only to the per-type containment Lambdas.
- **Don't flow sensitive data downstream (observability × security)**: with `_safe_finding`, narrow to only the necessary fields so credentials and PII don't mix into notifications or logs.

### 6.3 Make the containment Lambda idempotent too

The per-type containment Lambda (e.g., EC2) is **itself idempotent** like the pillar article's isolation logic (guard with a quarantine tag, and the second time it's a no-op with `already-quarantined`). **Holding "the dispatcher's idempotency" and "the containment's idempotency" doubly** makes it safe wherever a resend/retry occurs. This is multi-layer idempotency defense.

---

## 7. Least privilege, observability, testing: production quality is decided by the plumbing

### 7.1 The responder role is "per type, minimal"

Automated-response IAM is an attractive target for attackers (compromise it and you can isolate and even delete keys). So **split per type and make each Lambda's permission minimal.** An example for the EC2 containment Lambda:

```hcl
# EC2 封じ込め Lambda の実行ロール。EC2 の隔離に必要な action だけ。
data "aws_iam_policy_document" "contain_ec2" {
  statement {
    sid       = "DescribeForQuarantine"
    actions   = ["ec2:DescribeInstances"]
    resources = ["*"] # Describe は arn 単位で絞れないため condition で補う
  }
  statement {
    sid    = "QuarantineActions"
    actions = [
      "ec2:ModifyNetworkInterfaceAttribute", # ENI を隔離SGに付け替え
      "ec2:CreateTags",                       # 冪等ガード用の隔離タグ
    ]
    resources = ["*"]
    condition {
      test     = "StringEquals"
      variable = "aws:RequestedRegion"
      values   = [var.region] # 対象リージョンに限定
    }
  }
  # 重要: terminate / delete 系は与えない。破壊的操作は別ロール＋承認後に限る。
}
```

> **Make the EventBridge → Step Functions role minimal too**: narrow `resources` so EventBridge can start **only that one state machine.** A role where "EventBridge can start any SFN" is excessive privilege.

### 7.2 Observability: structured logs, console deep links, zero sensitive data

- Emit `finding_id` / `type` / `severity` / `decision` / `action` in **structured logs (JSON)** so you can later mechanically trace "which finding got what response."
- Put a **console deep link** in notifications so a human can reach the finding in one click (`https://console.aws.amazon.com/guardduty/home?region=<region>#/findings?search=id%3D<id>`).
- **Don't emit sensitive values (credentials, PII, raw resource details) in logs or notifications** (§6.2's `_safe_finding`). Make observability and security compatible.

### 7.3 Testing: verify the path without waiting for a real attack

Here's where the "build the verification path first" principle comes in. GuardDuty can **intentionally generate sample findings.** Use this to run the EventBridge → Step Functions → containment path before a real attack.

```bash
# 各 finding 型のサンプルを発火させ、SOAR パイプラインが想定通り動くか検証する。
aws guardduty create-sample-findings \
  --detector-id "$DETECTOR_ID" \
  --finding-types "UnauthorizedAccess:EC2/MaliciousIPCaller.Custom" \
                  "Policy:S3/BucketPublicAccessGranted" \
                  "UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS"
```

The verification hierarchy:

1. **Unit tests**: cover `handler`'s decision logic exhaustively with a sample finding JSON as input. Stub the `boto3` client and hit all the `contain` / `approve` / `notify` / `duplicate-skip` branches (fast because there are no side effects).
2. **Idempotency test**: input the same finding twice and confirm the second becomes `duplicate-skip`.
3. **Path test (E2E)**: flow a real event with `create-sample-findings` and confirm in the Step Functions execution history that each state transitioned as expected. **Make the containment Lambda `DRY_RUN=true` at first** and hit production traffic in a state of "decide but don't act," verifying the decision-making first before enabling remediation.

> **Phased rollout**: `DRY_RUN` mode is a safety valve that suppresses the biggest anxiety of "putting automation into production"—misresponse. First, observe in DRY_RUN for several weeks "if I had automated, which findings would have gotten what response," confirm that misresponses are zero, and only then turn remediation ON. This is the same "observe then enforce" thinking as [raising the WAF from Count to Block](/blog/waf-defense-in-depth-aws-waf-cloud-armor-owasp-guide).

---

## 8. The manual-trigger entry point: a Security Hub custom action

For responses you're scared to fully automate (key deletion, production termination) and for types not on the auto-allowlist, provide an **entry point an analyst can pull the trigger for from the console.** A **Security Hub custom action** is that.

The mechanism: create a custom action in Security Hub (e.g., `Quarantine`), and when an analyst selects a finding in the console and executes that action, **an EventBridge event fires.** The shape is this (official).

```json
{
  "version": "0",
  "detail-type": "Security Hub Findings - Custom Action",
  "source": "aws.securityhub",
  "account": "111122223333",
  "region": "ap-northeast-1",
  "resources": [
    "arn:aws:securityhub:ap-northeast-1:111122223333:action/custom/quarantine"
  ],
  "detail": {
    "actionName": "quarantine",
    "actionDescription": "Quarantine the affected resource",
    "findings": [ { "...": "ASFF 形式の finding 1件" } ]
  }
}
```

The EventBridge rule that catches this event **matches on the custom action's ARN (`resources`)** and starts the same Step Functions.

```hcl
# Security Hub の手動カスタムアクション（"Quarantine"）を拾い、同じ SOAR 配管へ。
resource "aws_cloudwatch_event_rule" "sh_custom_action" {
  name = "securityhub-quarantine-custom-action"
  event_pattern = jsonencode({
    source        = ["aws.securityhub"]
    "detail-type" = ["Security Hub Findings - Custom Action"]
    resources     = ["arn:aws:securityhub:${var.region}:${var.account_id}:action/custom/quarantine"]
  })
}

resource "aws_cloudwatch_event_target" "sh_to_sfn" {
  rule     = aws_cloudwatch_event_rule.sh_custom_action.name
  arn      = aws_sfn_state_machine.responder.arn
  role_arn = aws_iam_role.eventbridge_invoke_sfn.arn
}
```

> **Design implication**: this **consolidates "responses that happen automatically (GuardDuty → EventBridge)" and "responses a human triggers (Security Hub → EventBridge)" into the same Step Functions.** Because the plumbing is one, logic duplication (a DRY violation) doesn't occur, and testing is done in one place. If you aggregate GuardDuty findings into Security Hub, you can offer **the same "Quarantine button"** for multiple detection sources (Macie, Inspector, third parties) too.

---

## 9. Summary: a GuardDuty automated-incident-response cheat sheet

A quick reference for when you're unsure.

- **What you build**: "deciding plumbing" that turns findings into **enrich → decide → contain → notify → ticket.** GuardDuty is detection; the value of response is decided by **MTTR.**
- **EventBridge routing**: `source=aws.guardduty` / `detail-type="GuardDuty Finding"`. **`detail.severity` is numeric** → pick up High and above mechanically with `{"numeric":[">=",7]}`. `AttackSequence:*` is always Critical = the top-priority trigger. Findings auto-archived by suppression don't reach EventBridge.
- **The latency trap**: new findings in about 5 minutes, **recurrences default to 6 hours.** For automated response, `finding_publishing_frequency = FIFTEEN_MINUTES` (set by the delegated administrator = ripples to all members).
- **Principles of response design**: **notify broadly (all High and above), contain narrowly with a type allowlist (reversible operations only), interpose human approval for destructive operations.**
- **Idempotency**: EventBridge is **at-least-once.** Make the finding's **`id` the idempotency key**, and reject old resends with `updatedAt` (conditional write). Doubly idempotent on the containment-Lambda side too. The same form as payment's "zero double charges."
- **Playbooks (per type)**: EC2 → quarantine-SG swap (existing connections aren't severed by the SG). IAM AccessKey → disabling/revocation is **after approval** (distinguish `AKIA`/`ASIA`). S3 → Block Public Access. EKS → cordon + pod isolation.
- **Orchestration**: **Step Functions.** Wait for human approval with `waitForTaskToken`, and always merge into NotifyAndTicket with `Catch` (don't create silence).
- **Production quality**: **alert on the DLQ not being empty**, the responder role is **least-privilege per type**, structured logs + deep links, zero sensitive data, **path verification with `create-sample-findings`**, and verify decision-making first in production with `DRY_RUN`.
- **Manual trigger**: an analyst starts the same plumbing with a **Security Hub custom action** (`detail-type="Security Hub Findings - Custom Action"`).

Whether you end GuardDuty at a "red dashboard" or make it "a mechanism where detection turns into immediate, safe response" is decided by whether you can build **plumbing that turns findings into idempotent, scope-narrowed, reversible automated response.** The greatest leverage is in the downstream plumbing, more than in detection itself.

On the multi-account [serverless payment platform](/case-studies/payment-platform-reliability), I **implemented IAM, observability, and DR across a foundation handling real money, carbon credits, and regional currencies**, and guaranteed "correctness" not by operational carefulness but by **the structure and idempotency of code**—the thinking of "charge once even if you receive the same payment event twice" in an at-least-once world transfers directly to automated incident response. I design GuardDuty automated response with the same philosophy—**① mechanically separate severity and type with EventBridge, ② visualize the "deciding plumbing" with Step Functions, and ③ make the response idempotent with the finding id as the key, allowlisted, reversible, and approval-gated.** I build the plumbing that turns detection into action all the way to a form that rides on operations. Combined with the operational design of incident response itself ([runbooks, on-call, postmortems](/blog/incident-response-runbook-postmortem-oncall-sre-guide)), you get a structure where automated response and human judgment mesh.

**"How far to entrust your GuardDuty to automated response, where to place the human-approval gate, and how to design it idempotent, least-privilege, and reversible"—from EventBridge routing through Step Functions orchestration, per-resource-type playbooks, testing, and phased rollout, I can accompany you fast and safely with one person × generative AI (Claude Code).** Feel free to reach out, even from the requirements-organizing stage.

---

### References (official documentation)

- [Processing GuardDuty findings with Amazon EventBridge](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_findings_cloudwatch.html) — the finding event schema (source/detail-type/detail), delivery frequency (new is near real-time, recurrences 15min/1hour/6hours default), numeric severity matching, the relationship of suppression archiving and EventBridge
- [Remediating detected GuardDuty security findings](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_remediate.html) — recommended responses per resource type (EC2/S3/AccessKey/ECS/EKS/RDS/Lambda)
- [Remediating a potentially compromised EC2 instance](https://docs.aws.amazon.com/guardduty/latest/ug/compromised-ec2.html) — Isolation SG (don't allow 0.0.0.0/0), remove other SGs, existing connections aren't severed by the SG
- [Remediating potentially compromised AWS credentials](https://docs.aws.amazon.com/guardduty/latest/ug/compromised-creds.html) — distinguishing `AKIA`/`ASIA`, confirming permissions, confirming legitimate use, rotation
- [Remediating a potentially compromised S3 bucket](https://docs.aws.amazon.com/guardduty/latest/ug/compromised-s3.html) — `ANONYMOUS_PRINCIPAL`, S3 Block Public Access, bucket-policy review
- [Severity levels of GuardDuty findings](https://docs.aws.amazon.com/guardduty/latest/ug/guardduty_findings-severity.html) — the definitions of Critical 9.0–10.0 / High 7.0–8.9 / Medium 4.0–6.9 / Low 1.0–3.9
- [Using EventBridge for automated response and remediation (Security Hub)](https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cloudwatch-events.html) — Security Hub → EventBridge, automatic/custom actions, starting Step Functions
- [EventBridge event formats for Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/securityhub-cwe-event-formats.html) — the event format of `Security Hub Findings - Custom Action` (actionName / findings / custom action ARN)