Skip to main content
友田 陽大
Reliability, async & real-time
AWS
SQS
サーバーレス
冪等性
アーキテクチャ設計

Building Idempotent Async Processing with SQS + Lambda + EventBridge: Duplicate, Ordering, and DLQ Design on the At-Least-Once Premise

An implementation guide for designing AWS serverless, event-driven async processing (SQS+Lambda+EventBridge) at production quality. Explained with real code: idempotent consumers due to at-least-once delivery, visibility timeout, DLQ and reprocessing, FIFO ordering/deduplication, and partial batch failure (ReportBatchItemFailures).

Published
Reading time
25 min read
Author
友田 陽大
Share

"I want to offload heavy processing to a queue and do it asynchronously" — as a requirement it's one line. But the moment you try to put it into production, the things to decide explode at once. What do you do if the same message comes twice? What if processing drags on and the visibility timeout expires? Where do you offload a "poison message" that fails no matter how many times you try? How do you protect processing that breaks if ordering collapses (balance updates, inventory allocation)?

This article is an implementation guide for assembling AWS serverless, event-driven async processing into production quality with SQS + Lambda + EventBridge + DLQ. As the subject matter, I'll weave in design decisions from an environmental / carbon-credit / regional-currency multi-tenant payment platform I built as a core developer (one of three principal developers) (the reliability design of a serverless payment platform). It's a real example where I assembled employee-card monthly batch billing and CO2 aggregation as order-guaranteed, idempotent, automatically-reprocessed via SQS FIFO + a dead-letter queue (DLQ), achieving 0 double charges / balance inconsistencies during production operation.

The rule of this article: The specs, parameter names, and defaults of SQS / Lambda / EventBridge / Powertools are based on the AWS official documentation (as of June 2026). Because quotas and defaults get revised, always check the latest values on each official page before going to production. And the most important premise: SQS is "at-least-once" delivery. The same message arriving twice is not an "anomaly" but the "spec." So building the consumer to be idempotent is the starting point of all design.


0. Mental model: a queue = "buffer + retry," so idempotency is the iron rule

First, let me pin down the five mental models that run through this article. Get these to sink in, and all the implementation that follows is just their consequences.

  • A queue = a buffer + retry mechanism. It temporally decouples the sender (producer) and the processor (consumer), and the message doesn't disappear even if the processor crashes. On failure, it's redelivered.
  • At-least-once = duplicates are a premise. The official docs state clearly, for standard queues, "at-least-once message delivery," and for the Lambda event source, "process each event at least once, and duplicate processing of records can occur." So build it so that "the result is the same even if the same message comes twice." This is idempotent.
  • The visibility timeout = the grace period that prevents double delivery during processing. When you receive a message, it becomes invisible to other consumers during that time. If you don't delete it within the deadline, it becomes visible again and is reprocessed.
  • The DLQ = the isolation destination for "messages that fail no matter how many times you try." It removes a message that failed maxReceiveCount times from the main flow and offloads it to a place where it can be investigated and reprocessed. A safety valve so a poison message doesn't clog the whole processing line.
  • Idempotency is "the result being unique," ordering guarantee is "the order of processing." These two are different things. Standard queues are best-effort ordering, FIFO queues are strict ordering. Choose by requirement (Section 4).

This site has a separate production guide for a resident-worker-pool type (Celery + Redis) async task queue (Celery + Redis async task queue production guide). That one is the "keep workers resident and run them" type, this article is the "serverless, event-driven" type, and they're complementary. If you don't want a resident process / want auto-following of spikes / want to lean operations toward managed, this article's configuration; if you want to grasp fine control of workers and broker selection, that one suits.


1. The big picture: what the four parts handle

Serverless event-driven async processing is a combination of four parts with different roles. Understand it split by SRP (single responsibility), and you won't get lost about what to write where.

PartRole (single responsibility)Behavior on failure
SQS (standard / FIFO)Buffering and redelivery of messages. At-least-once deliveryRetained until deleted. Redelivered on visibility-timeout expiry
Lambda (event source mapping)Polls the queue and synchronously invokes the function in batchesA failed batch reappears after the visibility timeout. Retry & backoff
DLQ (SQS)Isolates poison messages exceeding maxReceiveCountRemoves from the main flow, awaits investigation / reprocessing (redrive)
EventBridgeEvent routing (bus) / scheduled invocation (Scheduler)Can set retries and max retention time for schedule failures too

There are two streams of data flow.

  1. On-demand: some event (API reception, S3 upload, etc.) → enqueue a message to SQS → Lambda processes it.
  2. Scheduled batch: EventBridge Scheduler launches on cron → enumerates targets and fans out to SQS → Lambda processes in parallel. This article's payment platform's "monthly batch billing" is this one.

In the official docs, EventBridge has two ways to process and deliver events — event buses (a router) and pipes (point-to-point integration) — and further provides EventBridge Scheduler (scheduled / one-time invocation via cron / rate expressions). This article's "monthly batch launch" uses Scheduler's cron.


2. Building an idempotent SQS consumer (the heart of this article)

In an at-least-once world, an idempotent consumer is the very foundation of reliability. Let's build it in order.

2.1 First, the naive form: returning partial batch failures

Lambda polls SQS and passes multiple messages (a batch) per invocation. The official default is "up to 10 messages." Here there's a landmine every beginner steps on.

The official warning: if the function throws an error during batch processing, by default the whole batch, including already-succeeded messages, reappears after the visibility timeout. As a result, you can process the same message many times.

That is, "1 of 10 failed → all redone even though 9 had succeeded." What prevents this is partial batch failure (partial batch response). Specify ReportBatchItemFailures in the event source mapping's FunctionResponseTypes, and the function returns only the IDs of the failed messages.

"""SQS consumer: 部分バッチ失敗を返す最小形。
成功したメッセージは削除させ、失敗した id だけを再表示させる。"""

def lambda_handler(event, context):
    batch_item_failures = []

    for record in event["Records"]:
        try:
            process_one(record)  # ここが本体(次節で冪等にする)
        except Exception:
            # 失敗した messageId だけを itemIdentifier として積む
            batch_item_failures.append({"itemIdentifier": record["messageId"]})

    # この形を返すと、ここに載った id だけが再表示される
    return {"batchItemFailures": batch_item_failures}

The shape of the returned JSON is strictly defined by the official docs. Each element of batchItemFailures must have an itemIdentifier (= the failed messageId). For example, if id2 and id4 of id1id5 failed, you return it like this.

{
  "batchItemFailures": [
    { "itemIdentifier": "id2" },
    { "itemIdentifier": "id4" }
  ]
}

The conditions for being treated as success and the conditions for total failure are also stated clearly by the official docs. Miss these and you get "reprocess everything" or "failures swallowed," so let me pin them down in a table.

The function's return valueLambda's interpretation
An empty batchItemFailures list / nullThe whole batch is succeeded (all deleted)
Enumerate failed ids in itemIdentifierOnly those ids reappear, the rest are deleted
Throw an exception as-isThe whole batch fails (all reappear)
Invalid JSON / an empty string or null itemIdentifier / a nonexistent idTreated as whole-batch failure

A design pressure point: if you use partial batch failure, catch exceptions inside the handler and push them onto batchItemFailures. Carelessly throw an exception all the way up, and the partial-failure specification is nullified and becomes whole-batch failure.

2.2 Idempotency ①: idempotency key + conditional write (the self-built royal road)

Partial batch failure reduces "useless reprocessing of succeeded items," but it can't make redelivery itself zero. So make the processing body (process_one) idempotent. This is the royal road I actually took on the payment platform.

Use the idempotency key issued by the client as a unique identifier, and at the entrance of processing attempt a conditional insert (attribute_not_exists). If the same key already exists, regard it as "already processed" and skip — this alone stops double execution in principle.

"""冪等化①: 冪等性キーの条件付き挿入で二重実行を阻止する。
DynamoDB の attribute_not_exists 条件で『初回だけ書ける』を実現する。
キーは『冪等性キーをソートキーに連結』し、TTL で自動失効させる。"""
import time
import boto3
from botocore.exceptions import ClientError

_ddb = boto3.resource("dynamodb")
_table = _ddb.Table("idempotency")
TTL_SECONDS = 90 * 24 * 60 * 60  # 既定90日で自動失効

class AlreadyProcessed(Exception):
    """同じ冪等性キーで処理済み。スキップしてよい(冪等)。"""

def claim_idempotency(tenant_id: str, idempotency_key: str) -> None:
    """初回だけ書き込みに成功する。2回目以降は ConditionalCheckFailed。"""
    # パーティション=テナント、ソートキー=冪等性キーを連結(マルチテナント分離)
    sort_key = f"charge#{idempotency_key}"
    try:
        _table.put_item(
            Item={
                "pk": tenant_id,
                "sk": sort_key,
                "ttl": int(time.time()) + TTL_SECONDS,  # 自動失効
            },
            # 「まだ無いときだけ書く」= 条件付き挿入。これが冪等性の要
            ConditionExpression="attribute_not_exists(pk) AND attribute_not_exists(sk)",
        )
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            raise AlreadyProcessed(sort_key) from e  # 意味論的失敗 → 再試行しない
        raise  # それ以外(スロットリング等)は上位へ

There are three design points.

  1. The idempotency key is client-issued. Generate it server-side and "resend = a different key," so you can't reject duplicates. The iron rule is that the issuer decides "this operation is this key" and carries the same key even on resend.
  2. Concatenate the key into the sort key, and make the partition key the tenant, to reconcile multi-tenant data isolation and the idempotency check in one table.
  3. Auto-expire with TTL (default 90 days). You don't need to hold idempotency records forever. Take a sufficiently long window while having TTL automatically sweep them, to hold cost and table size down (cost efficiency).

The self-built conditional write "works," but reinventing it in every project is a DRY violation. AWS officially provides the Lambda Powertools idempotency utility, which lets you implement the above "INPROGRESS lock → COMPLETE cache → TTL expiry" with a single decorator. For new builds, considering this first is the standard.

"""冪等化②: Powertools の @idempotent_function。
DynamoDB を永続層に、ペイロードのハッシュを冪等キーにして二重実行を防ぐ。
INPROGRESS で同時実行をロックし、完了後は結果をキャッシュして返す。"""
import os
from aws_lambda_powertools.utilities.idempotency import (
    DynamoDBPersistenceLayer,
    IdempotencyConfig,
    idempotent_function,
)
from aws_lambda_powertools.utilities.typing import LambdaContext

persistence = DynamoDBPersistenceLayer(table_name=os.environ["IDEMPOTENCY_TABLE"])

# event_key_jmespath で「冪等キーにする部分」を指定。
# ここでは body 内の冪等性キーだけをキーにする(メタデータの差で誤判定しないため)。
config = IdempotencyConfig(
    event_key_jmespath='powertools_json(body)."idempotency_key"',
    expires_after_seconds=90 * 24 * 60 * 60,  # 既定3600秒 → 90日に延長
)

@idempotent_function(
    data_keyword_argument="record",
    config=config,
    persistence_store=persistence,
)
def process_one(record: dict):
    # 同じ idempotency_key の再配信ではここは実行されず、前回結果が返る
    charge_employee_card(record)

def lambda_handler(event, context: LambdaContext):
    config.register_lambda_context(context)  # 残時間でタイムアウト保護
    batch_item_failures = []
    for r in event["Records"]:
        try:
            process_one(record=r)  # idempotent_function はキーワード引数必須
        except Exception:
            batch_item_failures.append({"itemIdentifier": r["messageId"]})
    return {"batchItemFailures": batch_item_failures}

Let me pin down the official behavior.

  • The default expires_after_seconds is 3600 seconds (1 hour). If you need a longer reprocessing window, extend it explicitly.
  • With event_key_jmespath, specify which part of the payload to use as the idempotency key via JMESPath. If not specified, it uses the whole event as the key. Using a part where metadata (receive count, etc.) is mixed in as the key misidentifies the same processing as something different, so the point is to specify only the essential identifier.
  • Concurrent execution of the same payload makes the second one get an IdempotencyAlreadyInProgressError via the INPROGRESS lock (preventing swallowing of contention).
  • The hash is MD5 by default. The idempotency key is assembled from "the function name + the hash of the payload."

Self-built vs. Powertools usage: if you want to grasp domain-specific key design like multi-tenant isolation or "idempotency key + sort-key concatenation," self-built (2.2). If the standard "the same input returns the same result" is enough, Powertools (2.3). On the payment platform I chose the former, because I wanted to guarantee the tenant boundary and key design myself. If the requirement is ordinary, don't reinvent the wheel.

2.4 Failures you may retry vs. failures you must not

As important as idempotency is the line between "what you retry and what you fail immediately." Be sloppy here and you'll either melt cost and latency with useless retries, or endlessly spin a non-retryable failure. The next table generalizes the criteria I took on the payment platform.

Kind of failureExampleRetry?Reason
Transient contention / throttlingTransactionConflict, ProvisionedThroughputExceeded, 5xx, network dropYes (exponential backoff + jitter)Can succeed given time. Safe if the operation is idempotent
Semantic failure (invalid input)ConditionalCheckFailed (already processed), 4xx, validation violationNo (immediate propagation)The result is the same no matter how many times. Retry is useless and harmful
Poison messageA record that becomes an exception no matter how many times processedNo → isolate to DLQDon't clog the main flow. Isolate and investigate the cause

On the payment platform, only on transient contention (TransactionConflict) did I retry "with a base 50ms × 2^n exponential backoff + jitter (±50%), up to 3 times," and didn't retry semantic failures (ConditionalCheckFailed) but propagated them immediately. In code it's like this.

"""リトライ対象を限定する指数バックオフ+ジッター。
一時的競合だけを再試行し、意味論的失敗は即座に上げる(fail fast)。"""
import random
import time
from botocore.exceptions import ClientError

RETRYABLE = {"TransactionConflict", "ThrottlingException",
             "ProvisionedThroughputExceededException"}

def with_backoff(fn, *, max_attempts: int = 3, base_ms: float = 50.0):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code not in RETRYABLE:
                raise  # 意味論的失敗は再試行しても無駄 → 即伝播
            if attempt == max_attempts:
                raise
            # 基点 50ms × 2^(n-1)、ジッター ±50% で群衆突入(thundering herd)を散らす
            backoff = base_ms * (2 ** (attempt - 1))
            jitter = backoff * random.uniform(-0.5, 0.5)
            time.sleep((backoff + jitter) / 1000.0)

The jitter (±50%) is to scatter the situation where multiple consumers retry all at once at the same timing and collide again (thundering herd). Make the retry interval constant, and failures synchronize and ripple. Just shifting them randomly smooths the peaks of contention.


3. The DLQ: isolating and reprocessing poison messages

Even with an idempotent consumer assembled, messages that fail no matter how many times you try appear. Schema mismatches, the disappearance of a reference, unexpected input — let these retry endlessly in the main flow and even healthy messages get delayed. This is the poison message problem, and the solution is the DLQ.

3.1 How the DLQ works: maxReceiveCount and RedrivePolicy

The official definition is simple. Set a redrive policy on the source queue and specify maxReceiveCount (the number of times a consumer can receive a message before it's moved to the DLQ). If it's received maxReceiveCount times and not deleted, the message moves to the DLQ.

# SAM / CloudFormation: 本流キュー + DLQ + redrive policy
Resources:
  ChargeDLQ:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: charge-dlq.fifo
      FifoQueue: true
      # 公式の推奨: DLQ の保持期間は本流より長く取る
      MessageRetentionPeriod: 1209600  # 14日(最大)

  ChargeQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: charge.fifo
      FifoQueue: true
      ContentBasedDeduplication: false  # 明示的な dedup id を使う(後述)
      VisibilityTimeout: 180            # consumer の最大処理時間に合わせる(第5章)
      RedrivePolicy:
        deadLetterTargetArn: !GetAtt ChargeDLQ.Arn
        maxReceiveCount: 5              # 5回失敗したら DLQ へ隔離

Three operational pressure points the official docs state clearly.

  • The DLQ must be in the same AWS account and Region as the source queue, and the same type (a FIFO DLQ for FIFO, a standard DLQ for standard).
  • Take the DLQ's retention period longer than the main flow's as a best practice. Since standard queues expire based on "the original enqueue timestamp," a message that ate up time in the main flow can be short-lived in the DLQ (the official example: 1 day in the main flow → with a 4-day DLQ retention, it disappears in the DLQ in 3 days).
  • Don't make maxReceiveCount too low. With 1 it's "immediate DLQ on one failure." Even a transient failure gets isolated, so leave sufficient retry room (the official example is > 3).

3.2 Cautions when using a DLQ with FIFO

The official docs attach a warning to using a DLQ with FIFO. Because in processing where order has context (an EDL of editing instructions, etc.), if order collapses in the DLQ, the meaning breaks. For the payment platform's monthly billing, because order is closed per message group (tenant × month), I designed it so that even if a specific message falls into the DLQ, there's no fatal impact on other processing within the group. If you use FIFO + DLQ, always identify "which group is in trouble if order breaks by going to the DLQ."

3.3 Reprocessing (redrive): fix the cause, then return it

Messages accumulated in the DLQ can be returned to the source queue (or another destination) with a dead-letter queue redrive. The operational flow is this.

  1. Isolate: automatically to the DLQ on exceeding maxReceiveCount.
  2. Investigate: match the DLQ message body against the logs (the exception stack) and identify why it failed. "Whether the consumer was given enough processing time" is also a verification point the official docs raise.
  3. Fix: fix the code / data / reference mismatch.
  4. Reprocess (redrive): fix the cause, then redrive DLQ → source queue. With an idempotent consumer, even partially-succeeded processing isn't double-executed, so you can return it with peace of mind.

Idempotency and the DLQ work as a set. Redrive to a non-idempotent consumer and you get the accident of "half-succeeded processing running again." The reason automatic reprocessing ran safely on the payment platform is that the idempotency of Section 2 was the foundation.


4. Standard vs. FIFO: how to design ordering and deduplication

SQS has two types, and the guarantees of ordering and duplication are fundamentally different. The choice here decides the subsequent reliability design.

AspectStandard queueFIFO queue (First-In-First-Out)
Delivery guaranteeat-least-once (duplicates possible)exactly-once processing (no duplicates let in)
OrderingBest-effort (can reorder)Strict order (per MessageGroupId)
DeduplicationNone (idempotency on the consumer side)5-minute deduplication (MessageDeduplicationId)
ThroughputNearly unlimitedLower than standard (a high-throughput mode exists)
Main useHigh-throughput, order-agnostic processingBilling, inventory, ledger, etc., where order and duplication are critical

4.1 Don't mix up FIFO's two IDs

If you use FIFO, absolutely don't confuse the two IDs of differing natures.

  • MessageGroupId: the unit of ordering. Messages with the same group ID are strictly order-processed. For billing, I made "tenant × target month" the group, parallelizing billing across tenants while keeping order within the same tenant.
  • MessageDeduplicationId: the unit of deduplication. The official docs say that even if you retry SendMessage within the 5-minute deduplication interval, with the same dedup id SQS doesn't let the duplicate in.
"""FIFO への送信: グループ ID で順序を、dedup id で重複排除を制御する。"""
import boto3

sqs = boto3.client("sqs")

def enqueue_monthly_charge(tenant_id: str, year_month: str, payload: str,
                           idempotency_key: str) -> None:
    sqs.send_message(
        QueueUrl=CHARGE_FIFO_URL,
        MessageBody=payload,
        # 順序の単位: テナント×月。これが同じものは厳密順序で処理される
        MessageGroupId=f"{tenant_id}#{year_month}",
        # 重複排除の単位: 冪等性キー。5分窓内の再送は SQS が弾く
        MessageDeduplicationId=idempotency_key,
    )

According to the official docs, there are two ways to set deduplication. Either enable content-based deduplication (use the SHA-256 hash of the message body as the dedup id; it looks at the body but not the attributes), or pass MessageDeduplicationId explicitly. For billing, I made it pass the idempotency key explicitly as the dedup id. Because even if the body is identical, a "different billing" gets a different key, so it expresses the intent more accurately than content-based.

Correcting an important misconception: FIFO's "5-minute deduplication" prevents sender-side duplicates (a retry of the same SendMessage), and does not eliminate consumer-side at-least-once redelivery. Reprocessing due to visibility-timeout expiry happens even with FIFO. So idempotency of the consumer is needed even with FIFO. "I made it FIFO so I don't need idempotency" is wrong.

4.2 Standard or FIFO, which to choose

The decision criterion is simple.

  • If the result doesn't break even when order breaks, standard. Notification sending, thumbnail generation, log aggregation, etc. Take throughput and cheapness.
  • If breaking either order or duplication is fatal, FIFO. Billing, balance updates, inventory allocation, ledgers. The payment platform's monthly billing was FIFO without hesitation.

From the KISS standpoint, you should ask "won't standard + an idempotent consumer suffice first," but using standard for processing where order affects the result is not KISS but mere cutting corners. If the requirement demands order, choose FIFO and correctly take on the complexity.


5. The visibility timeout: decide it in relation to processing time

The visibility timeout is a value the designer must always decide explicitly. Leave it to the default and you'll have accidents.

5.1 Default and limit (the official numbers)

  • The default is 30 seconds.
  • The count starts the instant it's received. If not deleted within the deadline, it reappears.
  • You can dynamically extend / shorten it with ChangeMessageVisibility. Set VisibilityTimeout to 0 to immediately release to other consumers.
  • The limit is "12 hours from the first receive." Even if you extend, this 12 hours is not reset. If you need processing exceeding this, use Step Functions or split the task, the official docs say.
  • A standard queue's in-flight limit is about 120,000. Exceed it and OverLimit error (with long polling, not an error but returning no new messages).

5.2 How to decide: processing time + buffer

The official best practice is "match it to the maximum time it normally takes to process + delete a message." Too short and it's redelivered during processing → double processing (useless duplication / cost), too long and retry on failure is delayed.

In practice, the way to decide is this.

  1. Measure the consumer's P99 processing time (e.g. up to 60 seconds per message).
  2. Add a buffer on top (about ×3. e.g. 180 seconds). For Lambda, satisfy function timeout ≤ visibility timeout. The reverse means it's redelivered while still processing.
  3. If processing time is unreadable / variable, periodically extend with a heartbeat ChangeMessageVisibility. But be conscious of the 12-hour limit.
# 可視性タイムアウトは「Lambda 関数タイムアウト × バッファ」で決める
ChargeQueue:
  Type: AWS::SQS::Queue
  Properties:
    VisibilityTimeout: 180   # 関数タイムアウト60秒の3倍。再配信前に余裕を持たせる

A Lambda-specific note: according to the official docs, on a function-code-origin error, Lambda stops processing and narrows the parallelism, and the message reappears after the visibility timeout. On a throttling-origin one, it retries until the message's timestamp exceeds the visibility timeout, and once exceeded it drops it. That is, the visibility timeout is also a dial for "how much retry to allow." Enable ReportBatchItemFailures and Lambda doesn't narrow polling on function failure, so you can avoid partial failures affecting the processing rate.


6. Launching monthly batches with EventBridge

The payment platform's "monthly batch billing" is launched on schedule with EventBridge Scheduler's cron. Scheduler is a serverless scheduler that can set, in addition to repeat patterns of cron / rate expressions, one-time launches, flexible time windows, and even a retry limit and a max retention time on failure.

# 毎月1日 03:00(UTC) に「月次課金ディスパッチャ」Lambda を起動する
MonthlyChargeSchedule:
  Type: AWS::Scheduler::Schedule
  Properties:
    Name: monthly-charge-dispatch
    ScheduleExpression: cron(0 3 1 * ? *)   # 毎月1日 03:00 UTC
    ScheduleExpressionTimezone: Asia/Tokyo
    FlexibleTimeWindow:
      Mode: "OFF"
    Target:
      Arn: !GetAtt ChargeDispatcherFunction.Arn
      RoleArn: !GetAtt SchedulerInvokeRole.Arn
      RetryPolicy:
        MaximumRetryAttempts: 3            # 失敗時のリトライ上限
        MaximumEventAgeInSeconds: 3600     # 失敗イベントの最大保持時間

The dispatcher Lambda itself does no heavy processing (SRP). What it does is only "enumerate target tenants × employees and fan out to SQS FIFO." The actual billing is processed on the consumer side, idempotently and with order guarantee.

"""月次課金ディスパッチャ: 列挙して FIFO に投入するだけ(処理はしない)。
重い課金処理は冪等な consumer に委ね、ここは『起動とファンアウト』に専念する。"""
def lambda_handler(event, context):
    year_month = current_billing_month()  # 例: "2026-06"
    for tenant_id, employee_id, amount in iter_billable(year_month):
        # 冪等性キー = ディスパッチャが決定的に生成(再実行で同じキーになる)
        idempotency_key = f"{tenant_id}:{employee_id}:{year_month}"
        enqueue_monthly_charge(
            tenant_id=tenant_id,
            year_month=year_month,
            payload=build_charge_payload(employee_id, amount),
            idempotency_key=idempotency_key,
        )

Here another layer of idempotency is at work. Because the idempotency key is generated deterministically as tenant:employee:year_month, even if the scheduler double-launches for some reason, double billing can be rejected doubly by FIFO's dedup and the consumer's idempotency. It's a design that absorbs downstream, on the premise that "the scheduler is not exactly-once."

Why interpose SQS: a configuration where the dispatcher hits the billing API directly is also possible. But then "if it crashes midway, the rest isn't processed" and "you can't absorb the billing API's rate limit." By interposing SQS, you obtain buffering (rate control), retry, and partial-failure isolation (DLQ) in a managed way. This is taking both ETC (ease of change) and reliability.


7. Production operations: observability and cost

Even if the design is correct, you can't operate it if you can't see it. Let me summarize the observability and cost pressure points that actually paid off on the payment platform.

7.1 What to monitor

At minimum, set alarms on the following CloudWatch metrics.

MetricWhat it meansThe alarm's intent
ApproximateNumberOfMessagesVisible (queue depth)Stagnation of unprocessed messagesDetect that processing isn't keeping up / consumer stopped
ApproximateAgeOfOldestMessageThe stagnation time of the oldest messageA sign of clogging / a poison message
The DLQ's ApproximateNumberOfMessagesVisibleThe count of isolated poison messagesAlert immediately if even 1 enters (needs investigation)
NumberOfMessagesDeletedThe deletion (= successful processing) countIf it drops to 0, suspect a missed return of partial batch failures

The official docs also state clearly that "NumberOfMessagesDeleted dropping to 0 / ApproximateAgeOfOldestMessage surging is a sign the function isn't correctly returning failed messages." On the payment platform, I assembled 20+ CloudWatch alarms + composite alarms and ran an operation that notifies Slack with structured logs according to severity. In particular, alerting immediately if even 1 enters the DLQ is the lifeline for catching poison messages early.

7.2 What to leave in the logs

To trace idempotency and reprocessing, leave metadata, not the body (PII), in structured logs.

  • messageId / idempotency key / MessageGroupId / ApproximateReceiveCount (receive count)
  • Failure type (transient contention / semantic failure / poison message) and retry count
  • The reason if it went to the DLQ (exception class + summary)

ApproximateReceiveCount is in the SQS event's attributes, so "how many times this message was redelivered" is clear at a glance. When the 3rd time onward starts increasing, it's a signal to suspect a poison message or insufficient visibility timeout.

7.3 Cost pressure points

  • SQS request billing: reduce empty polling with long polling (large WaitTimeSeconds) to lower request count = cost.
  • Process in batches: process up to 10 per invocation to reduce both the Lambda invocation count and the SQS API count (cost efficiency).
  • Sweep idempotency records with TTL: auto-expire with DynamoDB's TTL (default 90 days) to hold down wasted storage and RCU/WCU.
  • Set maxReceiveCount appropriately: too low and even non-poison messages fall into the DLQ, increasing reprocessing effort (= labor cost); too high and useless retries increase billing.

8. Summary: an idempotent async-processing cheat sheet

Finally, a quick-reference table for when you're unsure.

  • Major premise: SQS is at-least-once. Duplicates are by spec. Building the consumer to be idempotent is the starting point of everything.
  • Partial batch failure: FunctionResponseTypes=ReportBatchItemFailures + return failed ids in batchItemFailures / itemIdentifier. Don't throw an exception upward.
  • Idempotency: self-built, idempotency key + a conditional insert via attribute_not_exists + TTL. New build, Powertools' @idempotent / @idempotent_function.
  • The retry line: retry only transient contention with exponential backoff + jitter. Propagate semantic failures immediately.
  • DLQ: isolate poison messages with RedrivePolicy's maxReceiveCount (don't make it too low). Take the retention period longer than the main flow. Fix the cause, then redrive.
  • Standard vs. FIFO: if order or duplication is critical, FIFO (order with MessageGroupId, 5-minute deduplication with MessageDeduplicationId). Idempotency of the consumer is needed even with FIFO.
  • Visibility timeout: default 30 seconds. Decide by P99 processing time × buffer. Lambda function timeout ≤ visibility timeout. The limit is 12 hours from the first receive.
  • Scheduled launch: EventBridge Scheduler's cron. The dispatcher concentrates on enumerate & fan-out, and delegates heavy processing to the idempotent consumer.
  • Observability: set alarms on queue depth, oldest-message stagnation, the DLQ count (alert immediately on 1), and the deletion count.

Serverless async processing looks like "just offloading to a queue," but it's the work of designing idempotency, ordering, retry, and isolation on the premise of at-least-once delivery. On an environmental-field multi-tenant payment platform, I assembled employee-card monthly batch billing and CO2 aggregation as order-guaranteed, idempotent, automatically-reprocessed via SQS FIFO + DLQ, hardened it with conditional insert of the idempotency key, limitation of retry targets, and observability via 20+ alarms, and achieved 0 double charges / balance inconsistencies during production operation.

"I want to make this async processing of mine free of double execution, without breaking order, and able to recover automatically even if it crashes" — from that design through implementation and operation, I can accompany you end to end at the speed of one person × generative AI (Claude Code). Even from the requirements-organizing stage, feel free to consult me. If a resident-worker-pool type suits the case, see also the Celery + Redis async task queue production guide.


Reference (official documentation)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading