# DynamoDB Streams × Event-Driven Architecture / CDC Complete Guide (2026 Edition): Safely Propagating Change Data with Lambda and EventBridge Pipes

> We explain — faithfully to the AWS official specs — the production design of capturing change data (CDC) with DynamoDB Streams and safely propagating it downstream with Lambda / EventBridge Pipes. From view types, 24-hour retention, and per-item ordering guarantees to BatchSize/ParallelizationFactor/BisectBatchOnFunctionError/DLQ, idempotent consumers, materialized views, search-index sync, Outbox integration, and fan-out, summarized in real TypeScript / Terraform code.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: AWS, DynamoDB, イベント駆動, 冪等性, サーバーレス, TypeScript, Terraform, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/dynamodb-streams-event-driven-architecture-cdc-lambda-eventbridge-guide
- Category: DynamoDB
- Pillar guide: https://tomodahinata.com/en/blog/dynamodb-single-table-design-reliability-idempotency-patterns

## Key points

- Streams is a CDC that retains item-level changes in time series for 24 hours. The ordering guarantee is 'within the same item only,' and total order across the whole table or between partitions is not guaranteed
- A Lambda event source mapping is 'at least once.' As AWS explicitly states, design on the premise that duplicates come, and always make the consumer idempotent
- One poison message clogs the whole shard for up to a day. Isolate it with BisectBatchOnFunctionError, MaximumRetryAttempts, MaximumRecordAgeInSeconds, an on-failure DLQ, and ReportBatchItemFailures
- If retention, consumer count, or order is insufficient, to Kinesis Data Streams. But accept the trade-off that Kinesis is out-of-order and has duplicates (judge with ApproximateCreationDateTime)
- Per-use production patterns: materialized views/aggregation, search-index sync to OpenSearch etc., fan-out with EventBridge Pipes, Transactional Outbox integration. Beware that TTL deletes also appear in the Stream

---

The reliability of an event-driven architecture isn't born from optimistic premises. It's born from a single pessimistic premise.

**"Events arrive out of order, arrive at least once, and fail"** — taking this as given, design with an **idempotent consumer.** This is the only correct answer to making CDC (Change Data Capture) with DynamoDB Streams function in production. Streams is a powerful primitive for distributing "all changes to a table, without missing, in time series, near-real-time," but its way of distributing has clear guarantees and clear limits. Overestimate the guarantees, and production breaks with double reflection, broken order, and shard stoppage from poison messages.

This article is a systematization of only the **design of safely propagating Streams CDC downstream**, based on my experience designing and leading the reliability layer of an **AWS-serverless (Lambda + DynamoDB) multi-tenant payment platform** and maintaining **0 double charges in production.** The basics of data modeling and idempotency/conditional writes are left to the sister article [DynamoDB Single-Table Design & Production Reliability Patterns Complete Guide](/blog/dynamodb-single-table-design-reliability-idempotency-patterns), and pricing and performance to [DynamoDB Capacity, Cost, and Performance Design Complete Guide](/blog/dynamodb-capacity-cost-performance-on-demand-vs-provisioned-guide). Complementary to those, this article narrows to **"how to capture a change, how to convey it, and how to make it not break."**

All specs and limits are cross-checked against the **AWS official documentation (as of June 2026).**

---

## 1. How DynamoDB Streams works: accurately grasp the foundation of CDC

### What is Streams

DynamoDB Streams is, to say it exactly as the official definition, a feature that "**captures item-level changes in a table in time series and stores them in a log for up to 24 hours.**" An application can access this log and read **the before/after data near-real-time.** This is exactly a textbook implementation of CDC.

> Whenever an application creates, updates, or deletes items in the table, DynamoDB Streams writes a stream record with the primary key attributes of the items that were modified.

What's important is that Streams operates **asynchronously to the table's write path.** The official explicitly states "because the stream operates asynchronously, enabling it doesn't affect the table's performance." That capturing CDC itself doesn't increase write latency — this is the reason Streams can be used as a transaction log.

### View type (StreamViewType): what to flow

"Which image, before/after, to put in the stream" is decided per table with `StreamViewType`. The official's 4 kinds:

| StreamViewType | What's on the stream record |
| --- | --- |
| `KEYS_ONLY` | Only the key attributes of the changed item |
| `NEW_IMAGE` | The whole item after the change |
| `OLD_IMAGE` | The whole item before the change |
| `NEW_AND_OLD_IMAGES` | Both before and after the change |

The design guideline is clear. **For materialized-view sync and search-index sync, `NEW_AND_OLD_IMAGES`** is needed. To accurately delete the old index entry (take the diff) you need "the before image," and to reflect the new value you need "the after image." On the other hand, if you only notify another system of a delete event, `KEYS_ONLY` is enough. As an official caveat, **once `StreamViewType` is set, it can't be changed.** To change it, you need to disable the stream and recreate it (= a new stream).

### 24-hour retention, shards, ordering guarantees

Let me accurately grasp the 3 official specs that govern CDC design decisions.

**(1) Retention is 24 hours.** "All data in DynamoDB Streams is subject to a 24-hour lifetime. Data older than 24 hours can be deleted (trimmed) at any time." That is, **Streams is not a persistent log for replay.** If the consumer stops for 24+ hours, the changes during that time are **lost forever.** If long-term retention/replay is a requirement, consider the Kinesis Data Streams described later.

**(2) Shards.** Stream records are grouped into **shards**, and each shard is a container of multiple records. Shards are auto-generated/split as needed, and **as writes to the parent table increase, shards split** to enable parallel processing. Shards have a parent-child relationship (lineage), and order is kept by **processing the parent shard then the child shard** (with Lambda, this management is automatic).

**(3) The ordering guarantee is "per item" only.** This is the most important and most misunderstood point. The official guarantees are 2:

> + Each stream record appears exactly once in the stream.
> + For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.

Broken down — **what's guaranteed in order is only "the sequence of changes to the same item (the same primary key)."** Between different items, let alone the total order across the whole table, is **not guaranteed.** So don't expect "order A's update will always arrive before another order B's update." The order `created → paid` for the same `OrderId` is kept, but a design depending on order across them breaks.

> **The "exactly once" trap**: the above "exactly once" is a story of **the record within the stream**, not that **the consumer (Lambda) is invoked exactly once.** As described later, Lambda delivery is at-least-once. Confuse this and you skimp on idempotency and have an accident.

Finally, 2 detailed rules that take effect in practice:

- **A write that changes nothing generates no record.** "If `PutItem`/`UpdateItem` didn't actually change the item, Streams doesn't write a record." That an idempotent resend (overwriting with the same value) doesn't dirty CDC is thanks to this spec.
- **TTL deletes also appear in the Stream.** Automatic deletion by expiry also flows to the stream as a `REMOVE` event (details in Section 8). "It should have quietly disappeared by TTL" appears in CDC, so consideration is needed in a design that conveys deletes downstream.

---

## 2. DynamoDB Streams vs Kinesis Data Streams: choosing the CDC delivery route

There are 2 paths to capture DynamoDB changes. Plain DynamoDB Streams, and **Kinesis Data Streams for DynamoDB** (replicating the table's changes to a Kinesis data stream). Many workloads are fine with Streams, but when **retention, consumer count, or throughput** exceeds the requirements, Kinesis becomes necessary.

A comparison based on the official descriptions:

| Viewpoint | DynamoDB Streams | Kinesis Data Streams for DynamoDB |
| --- | --- | --- |
| Data retention | **Fixed 24 hours** | Default 24 hours, extendable to **up to 365 days** (longer-term replay) |
| Ordering guarantee | **Guaranteed per item** (same-item changes are in order) | **Can be out of order** (judge order with `ApproximateCreationDateTime`) |
| Duplicates | "Each record appears exactly once in the stream" | **The same item's notification can appear multiple times** (premise of duplicates) |
| Consumer count | **Up to 2 processes per shard** (up to 2 Lambda functions) | Simultaneous delivery to 2+ downstream apps with **enhanced fan-out** |
| Main consumers | Lambda trigger, Kinesis Adapter (KCL) | KCL, Lambda, Firehose, Managed Service for Apache Flink |
| Association | 1 stream per table | **1 table → 1 Kinesis data stream** |
| Billing | (Via Lambda) no `GetRecords` billing | CDC-unit billing + Kinesis-side billing |

The official frankly writes the Kinesis-side trade-offs.

> The Kinesis data stream records might appear in a different order than when the item changes occurred. The same item notifications might also appear more than once in the stream.

That is, **if you need long-term retention, many fan-outs, or large-scale analysis, Kinesis**, but accept that **the ordering guarantee weakens and duplicates increase**, and absorb it with order judgment by `ApproximateCreationDateTime` and idempotency. Conversely, **if it's standard event-driven, low-latency, and few consumers, DynamoDB Streams + Lambda** is simple, cheap, and fast — that's the line-drawing.

> **My selection criterion**: in the payment platform, downstream propagation that needs consistency (the balance view, ledger, audit) I built with **DynamoDB Streams + Lambda**, with a design that reliably digests within 24 hours (monitoring `IteratorAge` with the later observability). On the other hand, areas that need long-term replay or an analysis pipeline I offload to the Kinesis side — that's the distinction. "When unsure, Streams; when stuck on retention/fan-out, Kinesis" is the practical starting point.

---

## 3. Lambda-trigger design: correctly configure the event source mapping

The most common consumer of a Stream is a Lambda trigger. Lambda polls the stream via a resource called an **event source mapping (ESM).** According to the official, **Lambda polls each shard 4 times per second**, and when there are records, it **invokes the function synchronously** and waits for the result.

### Batching, parallelism, and starting position

| Parameter | Default | Range/cap | Role |
| --- | --- | --- | --- |
| `BatchSize` | 100 | Max **10,000** | The cap on records passed to one invocation |
| `MaximumBatchingWindowInSeconds` | 0 | Max 300 (5 min) | The time to wait until the batch fills. Suppresses invocation count at low traffic |
| `ParallelizationFactor` | 1 | Max **10** | The number of parallel batches processing 1 shard at once |
| `StartingPosition` | Required | `TRIM_HORIZON` / `LATEST` | Where to start reading |

3 important official points:

1. **The batch payload cap is 6 MB.** Lambda invokes on any of "reaching a full batch," "the batching window expiring," or "the payload reaching 6 MB."
2. **Raising `ParallelizationFactor` doesn't break order.** As the official explicitly states, even raising parallelism, **Lambda guarantees per-item (partition + sort key) ordered processing.** When `IteratorAge` is high and processing can't keep up, you can raise throughput while keeping the same item's order.
3. **`LATEST` can drop records.** Because polling start at ESM creation/update is eventually consistent and takes a few minutes, with `LATEST` you **may drop events during that time.** "If you don't permit drops, specify `TRIM_HORIZON`," the official explicitly states.

### Error handling: to not stop the shard with a poison message

This is the heart of production design. **When a synchronous invocation fails, Lambda retries that batch until it succeeds or the record expires.** Leave it naively and one "poison pill" **blocks the whole shard for up to a day** (official: "with the default settings, a malformed record can block the processing of that shard for up to a day"). The 4 levers to prevent this:

| Setting | Default | Role |
| --- | --- | --- |
| `BisectBatchOnFunctionError` | false | On failure, split the batch in 2 to isolate the malformed record. **The split doesn't consume the retry quota** |
| `MaximumRetryAttempts` | -1 (infinite) | The retry cap. Min 0, max 10,000. -1 retries until the record expires |
| `MaximumRecordAgeInSeconds` | -1 (infinite) | The max age of a record. Min -1, max 604,800 (7 days). Discards old records |
| `DestinationConfig` (on-failure) | None | Where to send the discarded record's metadata (standard SQS / standard SNS / S3 / Kafka) |

The official's organization of failure behavior:

- **Pre-invocation failure** (throttling, etc.): retry until the record expires or exceeds `MaximumRecordAgeInSeconds`.
- **In-invocation failure** (the function returns an error): retry until the record expires, exceeds `MaximumRecordAgeInSeconds`, or reaches `MaximumRetryAttempts`. Enable `BisectBatchOnFunctionError`, and it **splits the failing batch in 2 to carve out the malformed record and avoid a timeout.**
- If it fails even after exhausting error handling, Lambda **discards that record and proceeds to the next batch.** That's exactly why **an on-failure destination (DLQ) is needed.**

But there's an important constraint. **What enters the on-failure destination is only "the failing batch's metadata," and the record body isn't included** (only the S3 destination includes it whole in `payload`). What comes to SQS/SNS is the shard ID and the start/end sequence numbers.

> The actual records aren't included, so you must process this record and retrieve them from the stream before they expire and are lost.

That is, you can leave "evidence" in the DLQ but the "content" must be picked back up by yourself within 24 hours. So **realistic reprocessing isolates per item with ReportBatchItemFailures (Section 5)**, and keeps the DLQ as the last safety net — that's the standard play.

### Filter criteria: don't invoke the function with unneeded events

With the ESM's `FilterCriteria`, you can **narrow events before passing to Lambda.** In CDC, you often want to process "only a specific status change" or "only TTL deletes," and a filter can reduce the function's invocation count itself (official: you can define up to 5 patterns). For example, a filter to pick up only TTL deletes:

```json
{
  "Filters": [
    {
      "Pattern": "{\"userIdentity\":{\"type\":[\"Service\"],\"principalId\":[\"dynamodb.amazonaws.com\"]}}"
    }
  ]
}
```

On a table like "of 2 million changes/hour, TTL deletes are 5%," a filter **cuts invocations to 100k/hour, reducing billing and wasted processing at the same time.** A filter is the first breakwater before flowing downstream.

---

## 4. The idempotent consumer: design on the premise of at-least-once

This is the core of this article's philosophy. **A Lambda event source mapping is "at least once" delivery**, and duplicates happen by spec. The official warning is clear.

> Lambda event source mappings process each event at least once, and duplicate processing of records can occur. To avoid potential issues related to duplicate events, we strongly recommend that you make your function code idempotent.

The paths where duplicates arise are structural — retries, batch reprocessing, resend past the `ReportBatchItemFailures` checkpoint, boundaries due to `ParallelizationFactor`. So the only correct answer is not "duplicates won't come" but "**duplicates will come. Build it so the result doesn't change even if they come.**"

The standard play for idempotency is a **processed-marker.** Since each stream record has a unique `eventID` (or `SequenceNumber`), make this the idempotency key and reject with a conditional write so that "only a record seen for the first time runs the side effect."

```ts
import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import {
  DynamoDBDocumentClient,
  PutCommand,
} from "@aws-sdk/lib-dynamodb";
import { ConditionalCheckFailedException } from "@aws-sdk/client-dynamodb";

const ddb = DynamoDBDocumentClient.from(new DynamoDBClient({}));
const TABLE = process.env.PROCESSED_TABLE!;

/**
 * 「このストリームレコードを初めて処理するか」を原子的に判定する。
 * eventID は各レコードで一意。attribute_not_exists で二重処理を弾く。
 * 2回目以降は ConditionalCheckFailedException となり false を返す。
 */
async function claimOnce(eventId: string): Promise<boolean> {
  try {
    await ddb.send(
      new PutCommand({
        TableName: TABLE,
        Item: {
          PK: `EVT#${eventId}`,
          processedAt: new Date().toISOString(),
          // 処理済みマーカーを永久に残さない（容量・コスト抑制）。
          // Streams 保持(24h)より十分長い TTL にして再送ウィンドウを覆う。
          expiresAt: Math.floor(Date.now() / 1000) + 60 * 60 * 24 * 7,
        },
        // まだ処理していないレコードのときだけ「確保」できる
        ConditionExpression: "attribute_not_exists(PK)",
      }),
    );
    return true; // このレコードは初見 → 副作用を実行してよい
  } catch (err) {
    if (err instanceof ConditionalCheckFailedException) {
      return false; // 既に処理済み（重複）→ スキップ
    }
    throw err;
  }
}
```

The point is to **not protect idempotency with an app lock or "SELECT-whether-processed first."** If a concurrent invocation slips in between the SELECT and the PUT, it breaks. `attribute_not_exists` is evaluated as **the same atomic operation as the write**, so it doesn't break even under contention. The details of conditional writes are summarized in the [Single-Table Design & Idempotency Guide](/blog/dynamodb-single-table-design-reliability-idempotency-patterns), and idempotent async processing with SQS/EventBridge in [Designing Idempotent Async Processing with SQS × Lambda × EventBridge](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide).

> **Caution**: split the "claim" of the processed-marker and the "actual side effect" into separate writes, and a window opens where, if it crashes after the claim and before the side effect, the side effect never runs. Ideally, **write the marker in the same transaction (`TransactWriteItems`) as the side effect.** Make the marker TTL a length that reliably covers the 24 hours during which a Streams resend can occur (e.g., 7 days).

---

## 5. Partial batch failure (ReportBatchItemFailures): don't roll back the successes

Alongside idempotency, another production-essential technique is this. **The default Lambda checkpoints up to the highest sequence number only when the batch fully succeeds**, and otherwise **fails the whole batch** and retries it wholesale. Even 99 of 100 succeeding and 1 failing reprocesses the 99 — an instant accident if not idempotent, and even if idempotent, a wasteful retry.

Enable `ReportBatchItemFailures`, and you can make it a **partial success by returning only the failed records' sequence numbers.** The official's required response syntax:

```json
{
  "batchItemFailures": [
    { "itemIdentifier": "<SequenceNumber>" }
  ]
}
```

The official behavior:

- When there are multiple in `batchItemFailures`, Lambda **checkpoints at the smallest sequence number** and retries from there.
- **Return values regarded as full success**: an empty `batchItemFailures` list, a null list, an empty/null `EventResponse`.
- **Return values regarded as full failure**: an empty-string `itemIdentifier`, a null `itemIdentifier`, an invalid key name.
- To enable, include **`ReportBatchItemFailures`** in the ESM's `FunctionResponseTypes`.

Given this, the following is a TypeScript handler that bundles **idempotency (Section 4) + partial batch failure + type safety** into one. It uses `aws-lambda`'s types and processes after `unmarshall`ing the DynamoDB JSON back to a plain object.

```ts
import type {
  DynamoDBStreamEvent,
  DynamoDBStreamHandler,
  DynamoDBBatchResponse,
  DynamoDBBatchItemFailure,
  DynamoDBRecord,
} from "aws-lambda";
import { unmarshall } from "@aws-sdk/util-dynamodb";
import type { AttributeValue } from "@aws-sdk/client-dynamodb";

/**
 * 1レコードを処理する。冪等＆副作用はこの中だけに閉じ込める（SRP）。
 * 失敗時は throw する。呼び出し側がシーケンス番号を失敗として記録する。
 */
async function handleRecord(record: DynamoDBRecord): Promise<void> {
  // 変更前/変更後イメージは StreamViewType に依存。NEW_AND_OLD_IMAGES 前提。
  const keys = record.dynamodb?.Keys;
  const newImage = record.dynamodb?.NewImage;
  const oldImage = record.dynamodb?.OldImage;

  // DynamoDB JSON（{ "S": "..." }）→ 素の JS オブジェクトへ。
  const key = keys ? unmarshall(keys as Record<string, AttributeValue>) : undefined;
  const next = newImage ? unmarshall(newImage as Record<string, AttributeValue>) : undefined;
  const prev = oldImage ? unmarshall(oldImage as Record<string, AttributeValue>) : undefined;

  // eventID を冪等キーにして二重処理を弾く（セクション4）。
  if (!record.eventID || !(await claimOnce(record.eventID))) return;

  switch (record.eventName) {
    case "INSERT":
    case "MODIFY":
      await projectUpsert(key, next, prev); // マテビュー/検索インデックスへ反映
      break;
    case "REMOVE":
      await projectDelete(key, prev); // 下流から削除
      break;
  }
}

export const handler: DynamoDBStreamHandler = async (
  event: DynamoDBStreamEvent,
): Promise<DynamoDBBatchResponse> => {
  const batchItemFailures: DynamoDBBatchItemFailure[] = [];

  // 同一アイテムの順序を壊さないため、レコードは「直列」で処理する。
  // 失敗したレコード以降が再試行されるよう、最小シーケンス番号を返す。
  for (const record of event.Records) {
    try {
      await handleRecord(record);
    } catch (err) {
      // この1件を失敗として報告。Lambda はここから再チェックポイントする。
      batchItemFailures.push({ itemIdentifier: record.dynamodb?.SequenceNumber ?? "" });
      // 以降のレコードは次回まとめて再試行されるので、ここで打ち切る。
      break;
    }
  }

  return { batchItemFailures };
};

// 下流への反映は別モジュールに分離（マテビュー・検索同期は SRP で独立）
declare function projectUpsert(key: unknown, next: unknown, prev: unknown): Promise<void>;
declare function projectDelete(key: unknown, prev: unknown): Promise<void>;
```

Let me add 2 design decisions. First, **in stream processing, don't throw records in parallel but process them serially and abort at the first failure** is safe. Because the checkpoint is the smallest sequence number, having it retry from the failure onward in bulk meshes more straightforwardly with order and retries. Second, `BisectBatchOnFunctionError` and `ReportBatchItemFailures` can be used together, and according to the official, **when both are enabled, it splits the batch in 2 at the returned sequence number and retries only the rest.** You can't completely eliminate duplicates (official: "even for a successful record, the possibility of retry remains"), so **the last bastion is always idempotency.**

---

## 6. Practical patterns: materialized views, search indexes, fan-out, Outbox

On top of the foundation so far (ordering guarantee, at-least-once, idempotency, partial failure), let me put the 4 propagation patterns often used in production.

### (1) Materialized views / aggregation

DynamoDB is bad at JOIN and GROUP BY. So "a view shaped into a read form" being **made in advance at write time via Streams** is the royal road of CDC. For example, aggregate "the order total per user" and "the count per status" into another item (or another table) on receiving the order's `INSERT`/`MODIFY`/`REMOVE`. Since you can take the **diff** (old status → new status) with `NEW_AND_OLD_IMAGES`, you can correctly increment/decrement the counter. Update the aggregation with a **conditional atomic update**, and prevent double-counting with an idempotency key (don't use the atomic counter `ADD` alone for balance-type since it's not idempotent — details in the [Single-Table Design Guide](/blog/dynamodb-single-table-design-reliability-idempotency-patterns)).

### (2) Search-index sync to OpenSearch etc.

If you need full-text search or faceted search, keep DynamoDB as the source of truth and **asynchronously sync to OpenSearch etc. with Streams.** Upsert the document on `INSERT`/`MODIFY`, delete it on `REMOVE` (including TTL deletes). Here too `NEW_AND_OLD_IMAGES` takes effect — because you can identify "the old index entry that should be deleted" with the old image. Since the sync is eventually consistent, design the UX on the premise of "don't expect a search hit right after a write."

### (3) Fan-out with EventBridge Pipes

If you want to connect "Streams → an arbitrary target" without writing Lambda, **Amazon EventBridge Pipes** is a strong candidate. According to the official, Pipes is a managed integration that **connects a source and a target point-to-point and can interpose filtering and (optional) enrichment**, and **supports DynamoDB Streams as a source.**

> It reduces the need for specialized knowledge and integration code when developing event-driven architectures.

Pipes consists of 4 stages (**source → filtering → enrichment → target**), and the target can be an EventBridge event bus, Step Functions, SQS, SNS, Lambda, API Destinations, etc. **Fan-out (one change to many downstream)** is the standard play of flowing Streams → Pipes → **an EventBridge event bus** and distributing to multiple rules/targets on the bus (an event bus is, the official says, "suited to many-to-many routing"). "Distribute CDC without writing glue code" → Pipes, "need complex conditional branching or custom idempotency control" → Lambda — use by distinction.

### (4) Transactional Outbox integration

What prevents "updated the DB but failed to publish the event (or vice versa)" in event publishing to an external system is the **Transactional Outbox pattern.** The idea is simple — **write the business data and the event (an outbox record) in the same transaction, and publish later only what's confirmed.** This meshes cleanly in DynamoDB — write the business item and the outbox item atomically with `TransactWriteItems`, and **on the outbox = the business change is confirmed** is guaranteed. And **using Streams as that outbox's relay (publisher)** makes only confirmed events flow downstream without polling. The property that Streams is "the table's confirmed-change log" directly matches Outbox's publishing guarantee. The detailed implementation is summarized in [Designing "Reliable Event Publishing" with the Transactional Outbox Pattern](/blog/transactional-outbox-pattern-reliable-event-publishing-guide).

---

## 7. Terraform: the production event-source-mapping definition

Let me drop the design decisions so far (batch, parallelism, bisect, DLQ, partial failure, filter) into reproducible IaC. It's a production configuration subscribing to a DynamoDB Stream with `aws_lambda_event_source_mapping`.

```hcl
# 失敗バッチのメタデータを退避する DLQ（標準 SQS）
resource "aws_sqs_queue" "stream_dlq" {
  name                      = "app-stream-dlq"
  message_retention_seconds = 1209600 # 14日。Streams の 24h より長く保持して調査時間を確保
}

resource "aws_lambda_event_source_mapping" "ddb_stream" {
  event_source_arn  = aws_dynamodb_table.app.stream_arn
  function_name     = aws_lambda_function.projector.arn
  starting_position = "TRIM_HORIZON" # LATEST は作成直後の取りこぼしリスクがある

  # --- バッチング・並列度 ---
  batch_size                         = 100 # 既定。下流の処理コストに合わせて調整
  maximum_batching_window_in_seconds = 5   # 低トラフィック時に呼び出し回数を抑える
  parallelization_factor             = 10  # 同一アイテム順序は維持したままスループットを上げる

  # --- 毒メッセージ対策（1件でシャードを止めない） ---
  bisect_batch_on_function_error = true # 失敗バッチを二分割して不正レコードを隔離
  maximum_retry_attempts         = 5    # 無限リトライ(-1)で滞留させない
  maximum_record_age_in_seconds  = 3600 # 1時間より古いレコードは破棄して前進

  # --- 部分的バッチ失敗：成功レコードを巻き戻さない ---
  function_response_types = ["ReportBatchItemFailures"]

  # --- 破棄レコードの退避先（最後の安全網。本体は含まれずメタデータのみ） ---
  destination_config {
    on_failure {
      destination_arn = aws_sqs_queue.stream_dlq.arn
    }
  }

  # --- フィルタ：下流に要るイベントだけを関数へ（例：status が paid になった変更のみ） ---
  filter_criteria {
    filter {
      pattern = jsonencode({
        dynamodb = {
          NewImage = {
            status = { S = ["paid"] }
          }
        }
      })
    }
  }
}
```

The design crux:

- Always make `maximum_retry_attempts` and `maximum_record_age_in_seconds` **finite values.** The default `-1` (infinite) invites the worst case of a poison message blocking the shard for 24 hours.
- The DLQ holds **only metadata.** From the SQS message's shard ID + sequence number, separately prepare a reprocessing job that picks back up the target record from the stream **within 24 hours.**
- The filter is the first breakwater for "don't invoke the function with events downstream doesn't need." Narrow by CDC meaning, like `status = paid`.
- Give the Lambda function's IAM role only the minimum permissions of stream reading (`dynamodb:GetRecords`/`GetShardIterator`/`DescribeStream`/`ListStreams`) and DLQ sending (`sqs:SendMessage`) (the principle of least privilege).

---

## 8. Pitfalls and observability: notice before it clogs

### Common pitfalls (and countermeasures)

| Pitfall | Why it happens | Countermeasure |
| --- | --- | --- |
| A poison message stops the shard for a day | Infinitely retry the failing batch until expiry | `BisectBatchOnFunctionError` + finite `MaximumRetryAttempts`/`MaximumRecordAgeInSeconds` + DLQ |
| Double reflection from duplicate processing | Delivery is at-least-once | `eventID` idempotency key + conditional write (Section 4) |
| Breaks assuming total order | The ordering guarantee is per item only | A design not depending on order between different items. If needed, gather to the same PK |
| Reprocessing even the successful records | The default retries on whole-batch failure | Partial success with `ReportBatchItemFailures` (Section 5) |
| Change loss on consumer stop | Retention is 24 hours only | Monitor `IteratorAge`. If long-term retention is needed, Kinesis (up to 365 days) |
| Clogs with the 3rd reader per shard | Up to 2 processes per shard | Lambda is up to 2 functions per stream. Exceed it and read throttling |
| Mis-propagating a TTL delete as a "change" | TTL deletes also flow as `REMOVE` | Distinguish a service delete with `userIdentity` (below) |

The last TTL handling is easy to overlook, so let me make it explicit. TTL (expiry auto-deletion) **flows to the Stream as a `REMOVE` event** with a `userIdentity` field to distinguish it from a normal delete.

```json
{
  "userIdentity": {
    "type": "Service",
    "principalId": "dynamodb.amazonaws.com"
  }
}
```

If you want to handle "a delete by user operation" and "an expiry by TTL" separately downstream (e.g., expiry → archive, manual delete → audit log), branch on this `userIdentity`, or pass **only TTL deletes / only non-TTL deletes** to the function with the ESM's `FilterCriteria` (Section 3).

### Observability: make `IteratorAge` the lead

The most important metric in stream processing is **`IteratorAge`** (the age of the record the consumer is reading). It continuing to rise = **digestion isn't keeping up with generation** = if left, it exceeds the 24-hour retention and loses data — a danger sign. The minimum monitoring:

| Metric | What it shows | Action |
| --- | --- | --- |
| `IteratorAge` (Lambda) | Processing delay (lag) | Alert immediately on an upward trend. Increase `ParallelizationFactor`, lighten processing |
| `Errors` / `Throttles` (Lambda) | Function failure / throttling | Detect retry stagnation / concurrency-cap reach |
| The DLQ's `ApproximateNumberOfMessagesVisible` | Discarded failing batches | Investigate if even 1 comes (poison message) |
| `ReadThrottleEvents` (Streams/table) | Too many readers / insufficient capacity | Suspect exceeding 2 processes per shard |

In my payment platform, I set CloudWatch alarms on the rise of `IteratorAge` and arrival at the DLQ, notifying Slack immediately. **Not missing "the moment processing starts to lag" and "the moment even 1 is dropped"** is the key to reliably digesting CDC inside the 24-hour retention wall. The foundation of observability is contiguous with capacity design, so also see the [Capacity, Cost, and Performance Design Guide](/blog/dynamodb-capacity-cost-performance-on-demand-vs-provisioned-guide).

---

## FAQ

**Q1. What's DynamoDB Streams' retention period? Can I replay?**
Retention is **fixed 24 hours.** Records older than 24 hours can be trimmed at any time. Streams is not a long-term replay log, so if the consumer stops for 24+ hours, the changes during that time are lost. **If you need longer-term retention/replay, use Kinesis Data Streams** (default 24 hours, max 365 days).

**Q2. Is the stream's order guaranteed?**
What's guaranteed is only **the order of changes to the same item (the same primary key).** The official says "each item's changes appear in the actual order" and "each record appears exactly once in the stream," but **the total order between different items or across the whole table is not guaranteed.** Avoid a design depending on total order, and if needed, gather to the same PK.

**Q3. Which should I choose, Streams or Kinesis Data Streams?**
For standard event-driven, low-latency, few consumers (up to 2 functions/shard), **DynamoDB Streams + Lambda.** For long-term retention, many fan-outs, or large-scale analysis, **Kinesis.** But Kinesis **can be out of order and the same notification can appear multiple times**, so order judgment by `ApproximateCreationDateTime` and idempotency are the premise.

**Q4. Do duplicates come? How do I guarantee idempotency?**
They come. A Lambda event source mapping is officially **"at least once,"** and duplicate processing can occur. AWS also strongly recommends making the function code idempotent. The standard play is to make each record's unique `eventID` (or `SequenceNumber`) the idempotency key and reject double processing with an `attribute_not_exists` conditional write.

**Q5. How do I prevent one error from stopping the whole shard?**
Combine `BisectBatchOnFunctionError` (split the failing batch in 2 to isolate the malformed record), finite `MaximumRetryAttempts`/`MaximumRecordAgeInSeconds`, an on-failure destination (DLQ), and `ReportBatchItemFailures` (a partial success returning only the failed records' sequence numbers). Left at the default infinite retry, a poison message **blocks the shard for up to a day.**

**Q6. Do items deleted by TTL also flow to the Stream?**
They do. A TTL delete appears in the stream as a `REMOVE` event with `userIdentity.type = "Service"` and `userIdentity.principalId = "dynamodb.amazonaws.com"`. This lets you distinguish it from a delete by user operation, and you can also pass only TTL deletes / only non-TTL deletes to the function with the ESM's `FilterCriteria`.

---

## Closing: a pessimistic premise makes a system that runs optimistically

CDC with DynamoDB Streams is powerful but not unconditionally safe. It becomes safe only **when you place "out-of-order, at-least-once, fails" as the starting point of design and harden it with an idempotent consumer, partial batch failure, and poison-message isolation.**

- **Know the guarantees and limits accurately**: 24-hour retention, per-item order, in-stream exactly-once. But delivery is at-least-once.
- **Make idempotency the last bastion**: `eventID` + conditional write. Duplicates can't be erased; make the result not change.
- **Don't stop the shard**: bisect + finite retry + finite record age + DLQ + partial batch failure.
- **Choose the delivery route by requirement**: when unsure, Streams + Lambda; when stuck on retention/fan-out, Kinesis; for glue reduction, EventBridge Pipes.
- **Get ahead with `IteratorAge`**: reliably digest inside the 24-hour wall.

In a one-person × generative-AI (Claude Code) setup, I have thoroughly practiced an approach that **passes design judgments through human verification gates**, and implemented and operated this CDC-propagation design on a serverless payment platform with 0 double charges in production. I can accompany you from design review through implementation on event-driven reliability design centered on DynamoDB Streams (idempotent consumers, materialized-view/search sync, Outbox integration, poison-message isolation, observability).

**If you're struggling with serverless / event-driven reliability design, consult us via [Contact](/contact).** Let's start by sweeping out your current change-propagation paths and risk areas (order dependence, duplicate tolerance, poison messages) together.
