# Building Idempotent Async Processing with SQS + Lambda + EventBridge: Duplicate, Ordering, and DLQ Design on the At-Least-Once Premise

> An implementation guide for designing AWS serverless, event-driven async processing (SQS+Lambda+EventBridge) at production quality. Explained with real code: idempotent consumers due to at-least-once delivery, visibility timeout, DLQ and reprocessing, FIFO ordering/deduplication, and partial batch failure (ReportBatchItemFailures).

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: AWS, SQS, サーバーレス, 冪等性, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide
- Category: Reliability, async & real-time
- Pillar guide: https://tomodahinata.com/en/blog/transactional-outbox-pattern-reliable-event-publishing-guide

## Key points

- SQS is at-least-once delivery and duplicates are by spec; building an idempotent consumer is the starting point of all design
- With ReportBatchItemFailures, return only the failed ids in batchItemFailures, and don't throw an exception upward to fail the whole batch
- Idempotency is implemented with an idempotency key + a conditional insert via attribute_not_exists + TTL, or with Powertools' @idempotent
- If ordering or deduplication is critical, choose FIFO, but because redelivery on visibility-timeout expiry still happens, idempotency is needed even with FIFO
- Isolate poison messages to a DLQ with maxReceiveCount, fix the cause and then redrive, and alert immediately on a DLQ count of 1

---

"I want to offload heavy processing to a queue and do it asynchronously" — as a requirement it's one line. But the moment you try to put it into production, the things to decide explode at once. **What do you do if the same message comes twice? What if processing drags on and the visibility timeout expires? Where do you offload a "poison message" that fails no matter how many times you try? How do you protect processing that breaks if ordering collapses (balance updates, inventory allocation)?**

This article is an implementation guide for assembling AWS serverless, event-driven async processing into **production quality** with **SQS + Lambda + EventBridge + DLQ**. As the subject matter, I'll weave in design decisions from an environmental / carbon-credit / regional-currency multi-tenant payment platform I built as a core developer (one of three principal developers) ([the reliability design of a serverless payment platform](/case-studies/payment-platform-reliability)). It's a real example where I assembled employee-card **monthly batch billing and CO2 aggregation** as **order-guaranteed, idempotent, automatically-reprocessed via SQS FIFO + a dead-letter queue (DLQ)**, achieving **0 double charges / balance inconsistencies during production operation**.

> **The rule of this article**: The specs, parameter names, and defaults of SQS / Lambda / EventBridge / Powertools are based on the **AWS official documentation (as of June 2026)**. Because quotas and defaults get revised, always check the latest values on each official page before going to production. And the most important premise: **SQS is "at-least-once" delivery. The same message arriving twice is not an "anomaly" but the "spec."** So building the consumer to be **idempotent** is the starting point of all design.

---

## 0. Mental model: a queue = "buffer + retry," so idempotency is the iron rule

First, let me pin down the five mental models that run through this article. Get these to sink in, and all the implementation that follows is just their consequences.

- **A queue = a buffer + retry mechanism**. It temporally decouples the sender (producer) and the processor (consumer), and the message doesn't disappear even if the processor crashes. On failure, it's redelivered.
- **At-least-once = duplicates are a premise**. The official docs state clearly, for standard queues, "at-least-once message delivery," and for the Lambda event source, "process each event at least once, and duplicate processing of records can occur." So build it so that "the result is the same even if the same message comes twice." This is **idempotent**.
- **The visibility timeout = the grace period that prevents double delivery during processing**. When you receive a message, it becomes invisible to other consumers during that time. If you don't delete it within the deadline, it becomes visible again and is reprocessed.
- **The DLQ = the isolation destination for "messages that fail no matter how many times you try."** It removes a message that failed `maxReceiveCount` times from the main flow and offloads it to a place where it can be investigated and reprocessed. A safety valve so a poison message doesn't clog the whole processing line.
- **Idempotency is "the result being unique," ordering guarantee is "the order of processing."** These two are different things. Standard queues are best-effort ordering, FIFO queues are strict ordering. Choose by requirement (Section 4).

> This site has a separate production guide for a **resident-worker-pool type (Celery + Redis)** async task queue ([Celery + Redis async task queue production guide](/blog/celery-redis-production-async-task-queue-guide)). **That one is the "keep workers resident and run them" type, this article is the "serverless, event-driven" type**, and they're complementary. If you don't want a resident process / want auto-following of spikes / want to lean operations toward managed, this article's configuration; if you want to grasp fine control of workers and broker selection, that one suits.

---

## 1. The big picture: what the four parts handle

Serverless event-driven async processing is a combination of four parts with different roles. Understand it split by SRP (single responsibility), and you won't get lost about what to write where.

| Part | Role (single responsibility) | Behavior on failure |
| --- | --- | --- |
| **SQS (standard / FIFO)** | Buffering and redelivery of messages. At-least-once delivery | Retained until deleted. Redelivered on visibility-timeout expiry |
| **Lambda (event source mapping)** | Polls the queue and synchronously invokes the function in batches | A failed batch reappears after the visibility timeout. Retry & backoff |
| **DLQ (SQS)** | Isolates poison messages exceeding `maxReceiveCount` | Removes from the main flow, awaits investigation / reprocessing (redrive) |
| **EventBridge** | Event routing (bus) / scheduled invocation (Scheduler) | Can set retries and max retention time for schedule failures too |

There are two streams of data flow.

1. **On-demand**: some event (API reception, S3 upload, etc.) → enqueue a message to SQS → Lambda processes it.
2. **Scheduled batch**: EventBridge Scheduler launches on cron → enumerates targets and fans out to SQS → Lambda processes in parallel. This article's payment platform's "monthly batch billing" is this one.

In the official docs, EventBridge has two ways to process and deliver events — **event buses (a router) and pipes (point-to-point integration)** — and further provides **EventBridge Scheduler** (scheduled / one-time invocation via cron / rate expressions). This article's "monthly batch launch" uses Scheduler's cron.

---

## 2. Building an idempotent SQS consumer (the heart of this article)

In an at-least-once world, **an idempotent consumer is the very foundation of reliability**. Let's build it in order.

### 2.1 First, the naive form: returning partial batch failures

Lambda polls SQS and passes **multiple messages (a batch) per invocation**. The official default is "up to 10 messages." Here there's a landmine every beginner steps on.

> **The official warning**: if the function throws an error during batch processing, **by default the whole batch, including already-succeeded messages**, reappears after the visibility timeout. As a result, you can process the same message many times.

That is, "1 of 10 failed → all redone even though 9 had succeeded." What prevents this is **partial batch failure (partial batch response)**. Specify `ReportBatchItemFailures` in the event source mapping's `FunctionResponseTypes`, and the function returns **only the IDs of the failed messages**.

```python
"""SQS consumer: 部分バッチ失敗を返す最小形。
成功したメッセージは削除させ、失敗した id だけを再表示させる。"""

def lambda_handler(event, context):
    batch_item_failures = []

    for record in event["Records"]:
        try:
            process_one(record)  # ここが本体（次節で冪等にする）
        except Exception:
            # 失敗した messageId だけを itemIdentifier として積む
            batch_item_failures.append({"itemIdentifier": record["messageId"]})

    # この形を返すと、ここに載った id だけが再表示される
    return {"batchItemFailures": batch_item_failures}
```

The shape of the returned JSON is strictly defined by the official docs. Each element of `batchItemFailures` must have an `itemIdentifier` (= the failed `messageId`). For example, if `id2` and `id4` of `id1`–`id5` failed, you return it like this.

```json
{
  "batchItemFailures": [
    { "itemIdentifier": "id2" },
    { "itemIdentifier": "id4" }
  ]
}
```

**The conditions for being treated as success and the conditions for total failure** are also stated clearly by the official docs. Miss these and you get "reprocess everything" or "failures swallowed," so let me pin them down in a table.

| The function's return value | Lambda's interpretation |
| --- | --- |
| An empty `batchItemFailures` list / null | The whole batch is **succeeded** (all deleted) |
| Enumerate failed ids in `itemIdentifier` | Only those ids **reappear**, the rest are deleted |
| **Throw an exception as-is** | The **whole batch fails** (all reappear) |
| Invalid JSON / an empty string or null `itemIdentifier` / a nonexistent id | Treated as **whole-batch failure** |

> **A design pressure point**: if you use partial batch failure, **catch exceptions inside the handler and push them onto `batchItemFailures`**. Carelessly throw an exception all the way up, and the partial-failure specification is nullified and becomes whole-batch failure.

### 2.2 Idempotency ①: idempotency key + conditional write (the self-built royal road)

Partial batch failure reduces "useless reprocessing of succeeded items," but **it can't make redelivery itself zero**. So make the processing body (`process_one`) idempotent. This is the royal road I actually took on the payment platform.

Use the **idempotency key issued by the client** as a unique identifier, and at the entrance of processing attempt a **conditional insert (`attribute_not_exists`)**. If the same key already exists, regard it as "already processed" and skip — this alone stops double execution in principle.

```python
"""冪等化①: 冪等性キーの条件付き挿入で二重実行を阻止する。
DynamoDB の attribute_not_exists 条件で『初回だけ書ける』を実現する。
キーは『冪等性キーをソートキーに連結』し、TTL で自動失効させる。"""
import time
import boto3
from botocore.exceptions import ClientError

_ddb = boto3.resource("dynamodb")
_table = _ddb.Table("idempotency")
TTL_SECONDS = 90 * 24 * 60 * 60  # 既定90日で自動失効

class AlreadyProcessed(Exception):
    """同じ冪等性キーで処理済み。スキップしてよい（冪等）。"""

def claim_idempotency(tenant_id: str, idempotency_key: str) -> None:
    """初回だけ書き込みに成功する。2回目以降は ConditionalCheckFailed。"""
    # パーティション=テナント、ソートキー=冪等性キーを連結（マルチテナント分離）
    sort_key = f"charge#{idempotency_key}"
    try:
        _table.put_item(
            Item={
                "pk": tenant_id,
                "sk": sort_key,
                "ttl": int(time.time()) + TTL_SECONDS,  # 自動失効
            },
            # 「まだ無いときだけ書く」= 条件付き挿入。これが冪等性の要
            ConditionExpression="attribute_not_exists(pk) AND attribute_not_exists(sk)",
        )
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            raise AlreadyProcessed(sort_key) from e  # 意味論的失敗 → 再試行しない
        raise  # それ以外（スロットリング等）は上位へ
```

There are three design points.

1. **The idempotency key is client-issued.** Generate it server-side and "resend = a different key," so you can't reject duplicates. The iron rule is that the issuer decides "this operation is this key" and carries **the same key even on resend**.
2. **Concatenate the key into the sort key**, and make the partition key the tenant, to reconcile multi-tenant data isolation and the idempotency check in one table.
3. **Auto-expire with TTL (default 90 days).** You don't need to hold idempotency records forever. Take a sufficiently long window while having TTL automatically sweep them, to hold cost and table size down (cost efficiency).

### 2.3 Idempotency ②: AWS Lambda Powertools' `@idempotent` (the recommended shortcut)

The self-built conditional write "works," but **reinventing it in every project is a DRY violation**. AWS officially provides the **Lambda Powertools idempotency utility**, which lets you implement the above "INPROGRESS lock → COMPLETE cache → TTL expiry" with a single decorator. For new builds, considering this first is the standard.

```python
"""冪等化②: Powertools の @idempotent_function。
DynamoDB を永続層に、ペイロードのハッシュを冪等キーにして二重実行を防ぐ。
INPROGRESS で同時実行をロックし、完了後は結果をキャッシュして返す。"""
import os
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent_function,
)
from aws_lambda_powertools.utilities.typing import LambdaContext

persistence = DynamoDBPersistenceLayer(table_name=os.environ["IDEMPOTENCY_TABLE"])

# event_key_jmespath で「冪等キーにする部分」を指定。
# ここでは body 内の冪等性キーだけをキーにする（メタデータの差で誤判定しないため）。
config = IdempotencyConfig(
    event_key_jmespath='powertools_json(body)."idempotency_key"',
    expires_after_seconds=90 * 24 * 60 * 60,  # 既定3600秒 → 90日に延長
)

@idempotent_function(
    data_keyword_argument="record",
    config=config,
    persistence_store=persistence,
)
def process_one(record: dict):
    # 同じ idempotency_key の再配信ではここは実行されず、前回結果が返る
    charge_employee_card(record)

def lambda_handler(event, context: LambdaContext):
    config.register_lambda_context(context)  # 残時間でタイムアウト保護
    batch_item_failures = []
    for r in event["Records"]:
        try:
            process_one(record=r)  # idempotent_function はキーワード引数必須
        except Exception:
            batch_item_failures.append({"itemIdentifier": r["messageId"]})
    return {"batchItemFailures": batch_item_failures}
```

Let me pin down the official behavior.

- The default `expires_after_seconds` is **3600 seconds (1 hour)**. If you need a longer reprocessing window, extend it explicitly.
- With `event_key_jmespath`, specify **which part of the payload to use as the idempotency key** via JMESPath. If not specified, it uses the whole event as the key. **Using a part where metadata (receive count, etc.) is mixed in as the key misidentifies the same processing as something different**, so the point is to specify only the essential identifier.
- **Concurrent execution** of the same payload makes the second one get an `IdempotencyAlreadyInProgressError` via the `INPROGRESS` lock (preventing swallowing of contention).
- The hash is MD5 by default. The idempotency key is assembled from "the function name + the hash of the payload."

> **Self-built vs. Powertools usage**: if you want to grasp **domain-specific key design** like multi-tenant isolation or "idempotency key + sort-key concatenation," self-built (2.2). If the standard "the same input returns the same result" is enough, Powertools (2.3). On the payment platform I chose the former, because **I wanted to guarantee the tenant boundary and key design myself**. If the requirement is ordinary, don't reinvent the wheel.

### 2.4 Failures you may retry vs. failures you must not

As important as idempotency is the **line between "what you retry and what you fail immediately."** Be sloppy here and you'll either melt cost and latency with useless retries, or endlessly spin a non-retryable failure. The next table generalizes the criteria I took on the payment platform.

| Kind of failure | Example | Retry? | Reason |
| --- | --- | --- | --- |
| **Transient contention / throttling** | `TransactionConflict`, `ProvisionedThroughputExceeded`, 5xx, network drop | Yes (exponential backoff + jitter) | Can succeed given time. Safe if the operation is idempotent |
| **Semantic failure (invalid input)** | `ConditionalCheckFailed` (already processed), 4xx, validation violation | **No (immediate propagation)** | The result is the same no matter how many times. Retry is useless and harmful |
| **Poison message** | A record that becomes an exception no matter how many times processed | No → **isolate to DLQ** | Don't clog the main flow. Isolate and investigate the cause |

On the payment platform, **only on transient contention (`TransactionConflict`)** did I retry "with a base 50ms × 2^n exponential backoff + jitter (±50%), up to 3 times," and **didn't retry semantic failures (`ConditionalCheckFailed`) but propagated them immediately**. In code it's like this.

```python
"""リトライ対象を限定する指数バックオフ＋ジッター。
一時的競合だけを再試行し、意味論的失敗は即座に上げる（fail fast）。"""
import random
import time
from botocore.exceptions import ClientError

RETRYABLE = {"TransactionConflict", "ThrottlingException",
             "ProvisionedThroughputExceededException"}

def with_backoff(fn, *, max_attempts: int = 3, base_ms: float = 50.0):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code not in RETRYABLE:
                raise  # 意味論的失敗は再試行しても無駄 → 即伝播
            if attempt == max_attempts:
                raise
            # 基点 50ms × 2^(n-1)、ジッター ±50% で群衆突入（thundering herd）を散らす
            backoff = base_ms * (2 ** (attempt - 1))
            jitter = backoff * random.uniform(-0.5, 0.5)
            time.sleep((backoff + jitter) / 1000.0)
```

The jitter (±50%) is to scatter the situation where multiple consumers **retry all at once at the same timing and collide again** (thundering herd). Make the retry interval constant, and failures synchronize and ripple. **Just shifting them randomly** smooths the peaks of contention.

---

## 3. The DLQ: isolating and reprocessing poison messages

Even with an idempotent consumer assembled, **messages that fail no matter how many times you try** appear. Schema mismatches, the disappearance of a reference, unexpected input — let these retry endlessly in the main flow and even healthy messages get delayed. This is the **poison message** problem, and the solution is the **DLQ**.

### 3.1 How the DLQ works: `maxReceiveCount` and `RedrivePolicy`

The official definition is simple. Set a **redrive policy** on the source queue and specify `maxReceiveCount` (the number of times a consumer can receive a message before it's moved to the DLQ). If it's received `maxReceiveCount` times and not deleted, the message moves to the DLQ.

```yaml
# SAM / CloudFormation: 本流キュー + DLQ + redrive policy
Resources:
  ChargeDLQ:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: charge-dlq.fifo
      FifoQueue: true
      # 公式の推奨: DLQ の保持期間は本流より長く取る
      MessageRetentionPeriod: 1209600  # 14日（最大）

  ChargeQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: charge.fifo
      FifoQueue: true
      ContentBasedDeduplication: false  # 明示的な dedup id を使う（後述）
      VisibilityTimeout: 180            # consumer の最大処理時間に合わせる（第5章）
      RedrivePolicy:
        deadLetterTargetArn: !GetAtt ChargeDLQ.Arn
        maxReceiveCount: 5              # 5回失敗したら DLQ へ隔離
```

Three operational pressure points the official docs state clearly.

- **The DLQ must be in the same AWS account and Region as the source queue**, and **the same type** (a FIFO DLQ for FIFO, a standard DLQ for standard).
- **Take the DLQ's retention period longer than the main flow's** as a best practice. Since standard queues expire based on "the original enqueue timestamp," a message that ate up time in the main flow can be short-lived in the DLQ (the official example: 1 day in the main flow → with a 4-day DLQ retention, it disappears in the DLQ in 3 days).
- **Don't make `maxReceiveCount` too low.** With `1` it's "immediate DLQ on one failure." Even a transient failure gets isolated, so leave sufficient retry room (the official example is `> 3`).

### 3.2 Cautions when using a DLQ with FIFO

The official docs attach a warning to using a DLQ with FIFO. Because **in processing where order has context (an EDL of editing instructions, etc.), if order collapses in the DLQ, the meaning breaks**. For the payment platform's monthly billing, because **order is closed per message group (tenant × month)**, I designed it so that even if a specific message falls into the DLQ, there's no fatal impact on other processing within the group. If you use FIFO + DLQ, always identify "**which group is in trouble if order breaks by going to the DLQ**."

### 3.3 Reprocessing (redrive): fix the cause, then return it

Messages accumulated in the DLQ can be returned to the source queue (or another destination) with a **dead-letter queue redrive**. The operational flow is this.

1. **Isolate**: automatically to the DLQ on exceeding `maxReceiveCount`.
2. **Investigate**: match the DLQ message body against the logs (the exception stack) and identify why it failed. "Whether the consumer was given enough processing time" is also a verification point the official docs raise.
3. **Fix**: fix the code / data / reference mismatch.
4. **Reprocess (redrive)**: fix the cause, then redrive DLQ → source queue. **With an idempotent consumer, even partially-succeeded processing isn't double-executed**, so you can return it with peace of mind.

> **Idempotency and the DLQ work as a set.** Redrive to a non-idempotent consumer and you get the accident of "half-succeeded processing running again." The reason **automatic reprocessing ran safely** on the payment platform is that the idempotency of Section 2 was the foundation.

---

## 4. Standard vs. FIFO: how to design ordering and deduplication

SQS has two types, and **the guarantees of ordering and duplication are fundamentally different**. The choice here decides the subsequent reliability design.

| Aspect | Standard queue | FIFO queue (First-In-First-Out) |
| --- | --- | --- |
| Delivery guarantee | **at-least-once** (duplicates possible) | **exactly-once processing** (no duplicates let in) |
| Ordering | Best-effort (can reorder) | **Strict order** (per `MessageGroupId`) |
| Deduplication | None (idempotency on the consumer side) | **5-minute deduplication** (`MessageDeduplicationId`) |
| Throughput | Nearly unlimited | Lower than standard (a high-throughput mode exists) |
| Main use | High-throughput, order-agnostic processing | Billing, inventory, ledger, etc., where order and duplication are critical |

### 4.1 Don't mix up FIFO's two IDs

If you use FIFO, **absolutely don't confuse** the two IDs of differing natures.

- **`MessageGroupId`**: **the unit of ordering**. Messages with the same group ID are strictly order-processed. For billing, I made "tenant × target month" the group, parallelizing billing across tenants while keeping order within the same tenant.
- **`MessageDeduplicationId`**: **the unit of deduplication**. The official docs say that even if you retry `SendMessage` within the **5-minute deduplication interval**, with the same dedup id SQS doesn't let the duplicate in.

```python
"""FIFO への送信: グループ ID で順序を、dedup id で重複排除を制御する。"""
import boto3

sqs = boto3.client("sqs")

def enqueue_monthly_charge(tenant_id: str, year_month: str, payload: str,
                           idempotency_key: str) -> None:
    sqs.send_message(
        QueueUrl=CHARGE_FIFO_URL,
        MessageBody=payload,
        # 順序の単位: テナント×月。これが同じものは厳密順序で処理される
        MessageGroupId=f"{tenant_id}#{year_month}",
        # 重複排除の単位: 冪等性キー。5分窓内の再送は SQS が弾く
        MessageDeduplicationId=idempotency_key,
    )
```

According to the official docs, there are two ways to set deduplication. Either enable **content-based deduplication** (use the SHA-256 hash of the message body as the dedup id; it looks at the body but not the attributes), or **pass `MessageDeduplicationId` explicitly**. For billing, I made it **pass the idempotency key explicitly as the dedup id**. Because even if the body is identical, a "different billing" gets a different key, so it expresses the intent more accurately than content-based.

> **Correcting an important misconception**: FIFO's "5-minute deduplication" prevents **sender-side duplicates (a retry of the same `SendMessage`)**, and **does not eliminate consumer-side at-least-once redelivery**. Reprocessing due to visibility-timeout expiry happens even with FIFO. So **idempotency of the consumer is needed even with FIFO**. "I made it FIFO so I don't need idempotency" is wrong.

### 4.2 Standard or FIFO, which to choose

The decision criterion is simple.

- **If the result doesn't break even when order breaks, standard.** Notification sending, thumbnail generation, log aggregation, etc. Take throughput and cheapness.
- **If breaking either order or duplication is fatal, FIFO.** Billing, balance updates, inventory allocation, ledgers. The payment platform's monthly billing was FIFO without hesitation.

From the KISS standpoint, you should ask "won't standard + an idempotent consumer suffice first," but **using standard for processing where order affects the result is not KISS but mere cutting corners**. If the requirement demands order, choose FIFO and correctly take on the complexity.

---

## 5. The visibility timeout: decide it in relation to processing time

The visibility timeout is **a value the designer must always decide explicitly**. Leave it to the default and you'll have accidents.

### 5.1 Default and limit (the official numbers)

- **The default is 30 seconds.**
- The count starts the instant it's received. If not deleted within the deadline, it reappears.
- You can **dynamically extend / shorten** it with `ChangeMessageVisibility`. Set `VisibilityTimeout` to `0` to immediately release to other consumers.
- **The limit is "12 hours from the first receive."** Even if you extend, this 12 hours is **not reset**. If you need processing exceeding this, use Step Functions or split the task, the official docs say.
- A standard queue's **in-flight limit is about 120,000**. Exceed it and `OverLimit` error (with long polling, not an error but returning no new messages).

### 5.2 How to decide: processing time + buffer

The official best practice is "**match it to the maximum time it normally takes to process + delete a message**." Too short and it's **redelivered during processing → double processing** (useless duplication / cost), too long and **retry on failure is delayed**.

In practice, the way to decide is this.

1. Measure the consumer's **P99 processing time** (e.g. up to 60 seconds per message).
2. **Add a buffer on top** (about ×3. e.g. 180 seconds). For Lambda, satisfy **function timeout ≤ visibility timeout**. The reverse means it's redelivered while still processing.
3. If processing time is unreadable / variable, **periodically extend with a heartbeat `ChangeMessageVisibility`**. But be conscious of the 12-hour limit.

```yaml
# 可視性タイムアウトは「Lambda 関数タイムアウト × バッファ」で決める
ChargeQueue:
  Type: AWS::SQS::Queue
  Properties:
    VisibilityTimeout: 180   # 関数タイムアウト60秒の3倍。再配信前に余裕を持たせる
```

> **A Lambda-specific note**: according to the official docs, on a **function-code-origin error**, Lambda stops processing and narrows the parallelism, and the message reappears after the visibility timeout. On a **throttling-origin** one, it retries until the message's timestamp exceeds the visibility timeout, and once exceeded it **drops** it. That is, the visibility timeout is also a dial for "how much retry to allow." Enable `ReportBatchItemFailures` and Lambda **doesn't narrow polling** on function failure, so you can avoid partial failures affecting the processing rate.

---

## 6. Launching monthly batches with EventBridge

The payment platform's "monthly batch billing" is launched on schedule with **EventBridge Scheduler's cron**. Scheduler is a serverless scheduler that can set, in addition to repeat patterns of cron / rate expressions, one-time launches, flexible time windows, and even **a retry limit and a max retention time on failure**.

```yaml
# 毎月1日 03:00(UTC) に「月次課金ディスパッチャ」Lambda を起動する
MonthlyChargeSchedule:
  Type: AWS::Scheduler::Schedule
  Properties:
    Name: monthly-charge-dispatch
    ScheduleExpression: cron(0 3 1 * ? *)   # 毎月1日 03:00 UTC
    ScheduleExpressionTimezone: Asia/Tokyo
    FlexibleTimeWindow:
      Mode: "OFF"
    Target:
      Arn: !GetAtt ChargeDispatcherFunction.Arn
      RoleArn: !GetAtt SchedulerInvokeRole.Arn
      RetryPolicy:
        MaximumRetryAttempts: 3            # 失敗時のリトライ上限
        MaximumEventAgeInSeconds: 3600     # 失敗イベントの最大保持時間
```

The dispatcher Lambda itself **does no heavy processing** (SRP). What it does is only "enumerate target tenants × employees and fan out to SQS FIFO." The actual billing is processed on the consumer side, idempotently and with order guarantee.

```python
"""月次課金ディスパッチャ: 列挙して FIFO に投入するだけ（処理はしない）。
重い課金処理は冪等な consumer に委ね、ここは『起動とファンアウト』に専念する。"""
def lambda_handler(event, context):
    year_month = current_billing_month()  # 例: "2026-06"
    for tenant_id, employee_id, amount in iter_billable(year_month):
        # 冪等性キー = ディスパッチャが決定的に生成（再実行で同じキーになる）
        idempotency_key = f"{tenant_id}:{employee_id}:{year_month}"
        enqueue_monthly_charge(
            tenant_id=tenant_id,
            year_month=year_month,
            payload=build_charge_payload(employee_id, amount),
            idempotency_key=idempotency_key,
        )
```

Here **another layer of idempotency** is at work. Because the idempotency key is generated **deterministically** as `tenant:employee:year_month`, even if the scheduler **double-launches for some reason, double billing can be rejected doubly by FIFO's dedup and the consumer's idempotency**. It's a design that absorbs downstream, on the premise that "the scheduler is not exactly-once."

> **Why interpose SQS**: a configuration where the dispatcher hits the billing API directly is also possible. But then "if it crashes midway, the rest isn't processed" and "you can't absorb the billing API's rate limit." By interposing SQS, you obtain **buffering (rate control), retry, and partial-failure isolation (DLQ)** in a managed way. This is taking both ETC (ease of change) and reliability.

---

## 7. Production operations: observability and cost

Even if the design is correct, **you can't operate it if you can't see it**. Let me summarize the observability and cost pressure points that actually paid off on the payment platform.

### 7.1 What to monitor

At minimum, set alarms on the following CloudWatch metrics.

| Metric | What it means | The alarm's intent |
| --- | --- | --- |
| `ApproximateNumberOfMessagesVisible` (queue depth) | Stagnation of unprocessed messages | Detect that processing isn't keeping up / consumer stopped |
| `ApproximateAgeOfOldestMessage` | The stagnation time of the oldest message | A sign of clogging / a poison message |
| **The DLQ's `ApproximateNumberOfMessagesVisible`** | The count of isolated poison messages | **Alert immediately if even 1 enters** (needs investigation) |
| `NumberOfMessagesDeleted` | The deletion (= successful processing) count | **If it drops to 0**, suspect a missed return of partial batch failures |

The official docs also state clearly that "`NumberOfMessagesDeleted` dropping to 0 / `ApproximateAgeOfOldestMessage` surging is a sign the function isn't correctly returning failed messages." On the payment platform, I assembled **20+ CloudWatch alarms + composite alarms** and ran an operation that **notifies Slack with structured logs according to severity**. In particular, **alerting immediately if even 1 enters the DLQ** is the lifeline for catching poison messages early.

### 7.2 What to leave in the logs

To trace idempotency and reprocessing, leave **metadata, not the body (PII)**, in structured logs.

- `messageId` / idempotency key / `MessageGroupId` / `ApproximateReceiveCount` (receive count)
- Failure type (transient contention / semantic failure / poison message) and retry count
- The reason if it went to the DLQ (exception class + summary)

`ApproximateReceiveCount` is in the SQS event's `attributes`, so "how many times this message was redelivered" is clear at a glance. **When the 3rd time onward starts increasing, it's a signal to suspect a poison message or insufficient visibility timeout.**

### 7.3 Cost pressure points

- **SQS request billing**: reduce empty polling with long polling (large `WaitTimeSeconds`) to lower request count = cost.
- **Process in batches**: process up to 10 per invocation to reduce both the Lambda invocation count and the SQS API count (cost efficiency).
- **Sweep idempotency records with TTL**: auto-expire with DynamoDB's TTL (default 90 days) to hold down wasted storage and RCU/WCU.
- **Set `maxReceiveCount` appropriately**: too low and even non-poison messages fall into the DLQ, increasing reprocessing effort (= labor cost); too high and useless retries increase billing.

---

## 8. Summary: an idempotent async-processing cheat sheet

Finally, a quick-reference table for when you're unsure.

- **Major premise**: SQS is at-least-once. **Duplicates are by spec.** Building the consumer to be idempotent is the starting point of everything.
- **Partial batch failure**: `FunctionResponseTypes=ReportBatchItemFailures` + return failed ids in `batchItemFailures` / `itemIdentifier`. Don't throw an exception upward.
- **Idempotency**: self-built, idempotency key + a conditional insert via `attribute_not_exists` + TTL. New build, Powertools' `@idempotent` / `@idempotent_function`.
- **The retry line**: retry only transient contention with exponential backoff + jitter. Propagate semantic failures immediately.
- **DLQ**: isolate poison messages with `RedrivePolicy`'s `maxReceiveCount` (don't make it too low). Take the retention period longer than the main flow. Fix the cause, then redrive.
- **Standard vs. FIFO**: if order or duplication is critical, FIFO (order with `MessageGroupId`, 5-minute deduplication with `MessageDeduplicationId`). **Idempotency of the consumer is needed even with FIFO.**
- **Visibility timeout**: default 30 seconds. Decide by P99 processing time × buffer. Lambda function timeout ≤ visibility timeout. The limit is 12 hours from the first receive.
- **Scheduled launch**: EventBridge Scheduler's cron. The dispatcher concentrates on enumerate & fan-out, and delegates heavy processing to the idempotent consumer.
- **Observability**: set alarms on queue depth, oldest-message stagnation, **the DLQ count (alert immediately on 1)**, and the deletion count.

Serverless async processing looks like "just offloading to a queue," but it's **the work of designing idempotency, ordering, retry, and isolation on the premise of at-least-once delivery**. On an environmental-field multi-tenant payment platform, I assembled employee-card monthly batch billing and CO2 aggregation as **order-guaranteed, idempotent, automatically-reprocessed via SQS FIFO + DLQ**, hardened it with conditional insert of the idempotency key, limitation of retry targets, and observability via 20+ alarms, and achieved **0 double charges / balance inconsistencies during production operation**.

**"I want to make this async processing of mine free of double execution, without breaking order, and able to recover automatically even if it crashes" — from that design through implementation and operation, I can accompany you end to end at the speed of one person × generative AI (Claude Code).** Even from the requirements-organizing stage, feel free to consult me. If a resident-worker-pool type suits the case, see also the [Celery + Redis async task queue production guide](/blog/celery-redis-production-async-task-queue-guide).

---

### Reference (official documentation)

- [What is Amazon Simple Queue Service?](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) — the at-least-once of standard queues / the exactly-once processing of FIFO, the message lifecycle
- [Amazon SQS visibility timeout](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html) — default 30 seconds / max 12 hours, `ChangeMessageVisibility`, the in-flight limit
- [Exactly-once processing in Amazon SQS](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues-exactly-once-processing.html) — `MessageDeduplicationId`, 5-minute deduplication, content-based deduplication
- [Using dead-letter queues in Amazon SQS](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) — `RedrivePolicy` / `maxReceiveCount` / redrive, cautions for FIFO + DLQ
- [Using Lambda with Amazon SQS](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html) — event source mapping, batches, FIFO events, `MessageGroupId` / `MessageDeduplicationId`
- [Handling errors for an SQS event source in Lambda](https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-errorhandling.html) — `ReportBatchItemFailures` / `batchItemFailures` / `itemIdentifier`, the success/failure conditions, backoff
- [What Is Amazon EventBridge?](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) — event buses / pipes, EventBridge Scheduler (cron / rate expressions)
- [Idempotency - Powertools for AWS Lambda (Python)](https://docs.aws.amazon.com/powertools/python/latest/utilities/idempotency/) — `@idempotent` / `@idempotent_function`, `IdempotencyConfig`, `DynamoDBPersistenceLayer`, `event_key_jmespath`, `expires_after_seconds`