# The Transactional Outbox Pattern: Make the DB Update and Event Publishing Atomic, and Cut Off Lost Events and Double Publishing

> An implementation guide to the transactional outbox pattern that solves the dual-write problem of distributed systems. Write to the outbox in the same transaction as the business update, publish reliably with a relay (polling/CDC), and make downstream idempotent. We explain ordering guarantees, at-least-once, and reconciliation in real code.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: アーキテクチャ設計, AWS, 冪等性, 決済, メッセージング
- URL: https://tomodahinata.com/en/blog/transactional-outbox-pattern-reliable-event-publishing-guide
- Category: Reliability, async & real-time

## Key points

- A DB commit and a message publish are 2 writes to different systems and can't be made atomic, producing either lost events or ghost events
- Write the event to an outbox table in the same transaction as the business update, reducing the write targets to one to establish atomicity
- A separate process (the relay) handles publishing, achieving at-least-once with the order publish→mark
- For the relay, start with Polling (works on any SQL, no extra infrastructure); for low latency and large scale, choose CDC
- Downstream must be idempotent — absorb it by putting a deduplication marker in the same transaction as the consumer-side business update

---

"After updating the DB, publish that change as an event" — the requirement is one line. When an order is confirmed, to the notification service. When a payment is recorded, to the aggregation service. When inventory is reduced, to the shipping service. **Whether microservices or the async processing of a modular monolith, this "business update → event publishing" appears everywhere.**

And most implementations are **silently broken.** "INSERT into the DB, then publish to the queue" — this honest 2-line code, under network outages, process crashes, and retries, produces either a **lost event** or a **ghost event.** Production failures like "notifications sometimes don't arrive" or "occasionally charged twice" usually have their epicenter right here.

This article squarely solves that root cause — the **dual-write problem** — and is an implementation guide to establishing "the DB update and event publishing atomically" with the **transactional outbox pattern.** As the subject matter, I'll mix in design decisions I built into an invite-only B2B SaaS (lumber-distribution DX, Stripe Connect marketplace payments, [METI Minister's Award](/case-studies/lumber-industry-dx)), where I assembled billing reconciliation in a form that **prevents double charges and lost events even under network outages and retries.**

> **The rules of this article**: The pattern's definition and structure are based on the descriptions in **microservices.io (Chris Richardson) and AWS Prescriptive Guidance (as of June 2026).** Parts that depend on specific tools (Debezium / DynamoDB Streams, etc.) are explicitly marked as "examples." Since specs get revised, always check the latest descriptions in the [official documentation](#reference-official-documentation) before going to production. The code is shaped to be usable in real operation, but secrets are assumed to be in environment variables (no hardcoding).

---

## 0. Why "write to the DB, then publish" is broken

First, let's fix the mental model. Once this sinks in, the rest is just implementation.

The following code is the "honest" implementation written all over the world. **This is the dual-write problem itself.**

```python
# ❌ 壊れている：2つの別システムへの書き込みは原子的にできない
def place_order(order: Order) -> None:
    db.execute("INSERT INTO orders (...) VALUES (...)")  # ① DBへコミット
    db.commit()
    broker.publish("OrderPlaced", order.to_event())       # ② ブローカーへpublish
```

`db.commit()` and `broker.publish()` are writes to **completely different systems** (a relational DB and a message broker). No "single atomic operation" spanning both exists. So the following two breakdowns inevitably happen.

### Breakdown pattern A: lost event

Right after ① succeeds, before ② executes, the process crashes / the network is cut / the broker is temporarily down. **The order was recorded in the DB, but the event is never published.** The downstream service doesn't know about the change. To borrow AWS's words, "the downstream service will not be aware of the change, and the system can enter an inconsistent state."

### Breakdown pattern B: ghost event

The reverse also happens. The ② publish went through (or appeared to), but ①'s transaction rolls back for some reason, or the post-publish commit fails. **An event for a change that doesn't exist flows downstream, and a payment for a non-existent order runs.** AWS expresses this as "data could get corrupted."

### "Just swap the order" doesn't work either

"Then publish first, then write to the DB?" — this breaks too (it only worsens pattern B). "If publish fails, just roll back the DB" — you can't distinguish the case where publish **succeeded but the response didn't return** (timeout). Retry and you double-publish; give up and you lose the event. **As long as it spans two systems, no matter how you swap the write order, you can't get atomicity.**

### Why not adopt 2PC (two-phase commit)?

In theory, a distributed transaction (XA / 2PC) can bind the two. But microservices.io clearly rejects it. It says many brokers and DBs don't support 2PC, and it tightly couples the service to **both the DB and the broker.** With 2PC, the coordinator becomes a single point of failure, and both latency and availability worsen. **"It works, but you don't want it in production"** — this is the field's consensus.

> **The core**: a "DB commit" and a "message publish" are **two writes to different systems** and can't be bound atomically. If only one succeeds, you get inconsistency (a lost event or a ghost event). **The solution is to reduce the write targets to one** — write the event to the outbox table in "the same DB transaction," and have a separate process reliably publish it.

---

## 1. The pattern's definition: write the event to the DB too

Let me lay out the transactional outbox's solution exactly as in Chris Richardson's original.

> A service saves the message in the database as part of the transaction updating the business entity. **"A separate process then sends the messages to the message broker."** This avoids the dual-write problem of losing messages when sending outside the transaction boundary.

The structure is just two parts.

1. **The outbox table**: a table **in the same database** as the business DB that stores the events you want to publish. INSERT it in the **same transaction** as the business update.
2. **The message relay**: a **separate process** that reads unpublished events from the outbox table, publishes them to the broker, and marks them as published.

The key is **"reducing the write targets to one."** Because the business table and the outbox table are in the **same DB**, their updates can be committed atomically in **one local transaction.** If the transaction commits, the business change and the event **both** necessarily remain. If it rolls back, **neither** remains. In AWS's words, "Guarantees message delivery if and only if database transactions succeed."

The success/failure of the publish is no longer part of the transaction. **The responsibility of "reliably publishing someday" is delegated to the relay**, and since the relay deals with a persisted ledger called the outbox, it can resume no matter how many times it goes down.

```
┌─────────────────────────── Same transaction ──────────────────────────────┐
│  UPDATE orders SET status='PAID' ...                                       │
│  INSERT INTO outbox (event_type, payload, ...) VALUES ('OrderPaid', ...)   │
└──────────────────────────────── COMMIT ────────────────────────────────────┘
                                     │
                        (separate process = relay)
                                     ▼
                  Read outbox → publish to broker → update published_at
                                     │
                                     ▼
                          SQS / Kafka / EventBridge → downstream (idempotent consumer)
```

---

## 2. Designing the outbox table

Start with the schema. **No more, no less** — too little and you get stuck in operations, too much and maintenance cost rises (KISS / YAGNI).

```sql
CREATE TABLE outbox (
    id            BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, -- 単調増加 = 発行・順序の基準
    aggregate_type TEXT        NOT NULL,   -- 例: 'Order' / 'Payment'（集約の種類）
    aggregate_id   TEXT        NOT NULL,   -- 例: order_id（同一集約のイベント順序キー）
    event_type     TEXT        NOT NULL,   -- 例: 'OrderPaid'（消費側のルーティング）
    payload        JSONB       NOT NULL,   -- イベント本体（自己完結させる）
    created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    published_at   TIMESTAMPTZ              -- NULL = 未発行（リレーが埋める）
);

-- 未発行行だけを高速に拾う部分インデックス（テーブルが育ってもスキャンが軽い）
CREATE INDEX idx_outbox_unpublished
    ON outbox (id) WHERE published_at IS NULL;
```

Let me state each column's intent explicitly, as a design decision.

- **`id` (monotonically increasing)**: the pickup order for publishing and the basis for ordering guarantees. AWS also says it "preserves the order of messages by using timestamps and sequence numbers," and **the sequence number is the lifeline of ordering.**
- **`aggregate_id`**: the ordering key for "events for the same order keep their publish order" (Chapter 7). It corresponds to SQS FIFO's `MessageGroupId` or Kafka's partition key.
- **`payload` (self-contained)**: put in enough information that the consumer **can complete processing with this event alone.** A design of "put just the ID and make it come fetch the body later" creates contention because the DB state has changed by the time it comes to fetch. The principle is that **an event carries a snapshot of the moment it occurred.**
- **`published_at` (NULL = unpublished)**: the relay's progress ledger. With the **partial index** of `WHERE published_at IS NULL`, even if the table grows to millions of rows, you can lightly pick up just the tens of unpublished rows (performance).

> **Design decision (DRY)**: "Split the outbox per business table, or have one company-wide." I basically **consolidate into one outbox.** The relay only has to deal with one table, and I can keep the publishing logic, monitoring, and cleanup as a single concern (SRP). `aggregate_type` distinguishes the kinds.

---

## 3. Write the business update and outbox insert in the same transaction

This is the **heart of the pattern.** Put the business update and the outbox INSERT in **necessarily the same transaction.** To avoid being swept along by the framework's "implicit commit," make the transaction boundary explicit.

### Python (SQLAlchemy 2.x)

```python
"""業務更新と outbox 挿入を同一トランザクションで行う。
どちらか一方だけがコミットされる状態は構造的に発生し得ない。"""
from sqlalchemy.orm import Session

def mark_order_paid(session: Session, order_id: str, amount: int) -> None:
    with session.begin():  # この with を抜けるとき一括 COMMIT / 例外なら一括 ROLLBACK
        order = session.get(Order, order_id, with_for_update=True)
        order.status = "PAID"
        order.paid_amount = amount  # ① 業務変更

        session.add(OutboxEvent(                          # ② イベントを“同じTx”でoutboxへ
            aggregate_type="Order",
            aggregate_id=order_id,
            event_type="OrderPaid",
            payload={"order_id": order_id, "amount": amount},
        ))
    # ここに到達した時点で①②は両方コミット済み。publish はまだしていない（それはリレーの仕事）
```

### TypeScript (Prisma's interactive transaction)

```ts
// 業務更新と outbox 挿入を 1 つの $transaction に閉じる。
// publish はここで“やらない”——それがこのパターンの肝。
await prisma.$transaction(async (tx) => {
  await tx.order.update({
    where: { id: orderId },
    data: { status: "PAID", paidAmount: amount }, // ① 業務変更
  });

  await tx.outbox.create({                          // ② 同一Txでイベントを永続化
    data: {
      aggregateType: "Order",
      aggregateId: orderId,
      eventType: "OrderPaid",
      payload: { orderId, amount },
    },
  });
});
// $transaction を抜けた時点で①②は不可分にコミット済み
```

Notice that **nowhere in this function is there a `broker.publish()`.** That's the correct form. The application code's responsibility is up to "atomically inscribe the business change + the intent to publish into the DB." **The actual publishing is handled by a completely separated, separate process (the relay)** (SRP).

> **Anti-pattern (explicitly noted as a drawback by microservices.io)**: "Developers may forget to publish messages after database updates." That is, the accident of **forgetting to write the outbox INSERT.** The countermeasure is to enforce by types, in the repository layer that performs the business update, that "a state-transition method always returns a domain event → the caller stacks it into the outbox," or to consolidate the app layer into one place so the **structure makes a missing publish a subject of code review.**

### 3.1 Payload design: an event carries "the fact at the moment of occurrence"

What you put in `payload` is a design decision that takes effect later. The principle is **"a snapshot of the moment of occurrence that lets the consumer complete processing with this event alone."**

```json
{
  "event_type": "OrderPaid",
  "event_version": 1,
  "occurred_at": "2026-06-24T09:30:00Z",
  "order_id": "ord_123",
  "amount": 12000,
  "currency": "JPY"
}
```

- **`event_version`**: the event's shape will inevitably evolve. Put a version number in from the start, and the consumer can branch on "v1 is the old fields, v2 is the new fields," letting you **deploy the sender and receiver independently** (ETC: localizing changes). Adding it later is compatibility hell, so it's one of the few "look-aheads" you put in **from day 1.**
- **`occurred_at` (the time the event occurred)**: a separate concept from `created_at` (the time it was written to the outbox). It's the basis for the consumer to judge "when did the fact happen."
- **Avoid "put just the ID and make it come fetch later"**: by the time the consumer comes to fetch the body, the DB's state may have already moved on to the next. **An event is a past fact, not a reference to the current state** — this distinction structurally prevents contention and race conditions.

> **Related patterns (context of the sources)**: pushing this snapshot-type event to its limit leads to **Event Sourcing** (representing the state itself as a sequence of events), which AWS also lists in its applicability. And if you want to bind updates spanning multiple services, combine it with **Saga** (compensating transactions). The outbox is "the **foundation that reliably fires** Saga / Domain Events" — that microservices.io positions it as "Triggers the need: Saga, Domain Event" reflects this dependency. This article focuses purely on the **certainty of firing** and leaves Saga itself to another piece (YAGNI: what to solve now is the dual-write).

---

## 4. The message relay (1): the polling publisher

The first strategy for publishing the events accumulated in the outbox is the **Polling Publisher.** microservices.io's definition is simple: **"Publish messages by polling the database's outbox table."**

### Implementation (Python, polling relay)

```python
"""未発行イベントを id 昇順で取得 → publish → published_at を更新。
落ちても“未発行のまま”の行が残るので、再起動すれば必ず再開する（at-least-once）。"""
import time

BATCH = 100

def relay_once(session, broker) -> int:
    # ① 未発行を id 昇順（=発生順）で取得。行ロックで多重起動の競合を防ぐ
    rows = session.execute(text("""
        SELECT id, aggregate_id, event_type, payload
        FROM outbox
        WHERE published_at IS NULL
        ORDER BY id ASC
        LIMIT :batch
        FOR UPDATE SKIP LOCKED
    """), {"batch": BATCH}).all()

    for r in rows:
        # ② ブローカーへ発行。発行キー（順序）には aggregate_id を使う
        broker.publish(
            topic=r.event_type,
            key=r.aggregate_id,
            body=r.payload,
            # consumer 側の重複排除キー。outbox.id は決定的で再実行しても変わらない
            dedup_id=str(r.id),
        )
        # ③ 発行できたものから published_at を埋める
        session.execute(
            text("UPDATE outbox SET published_at = now() WHERE id = :id"),
            {"id": r.id},
        )
    session.commit()
    return len(rows)

def run_relay(session_factory, broker, interval: float = 1.0) -> None:
    while True:
        with session_factory() as session:
            published = relay_once(session, broker)
        if published == 0:
            time.sleep(interval)  # 空振り時だけ待つ（busy-loop を避ける）
```

### Why this relay is safe

Three design decisions create its reliability.

1. **`FOR UPDATE SKIP LOCKED`**: even if you make the relay redundant (multiple instances), two instances won't grab the same row at the same time. A row locked by one is skipped by the other. It achieves **horizontal scaling + elimination of a single point of failure** in one line.
2. **The order "publish → mark" produces at-least-once**: if publish succeeded but it crashes before the `UPDATE`, that row remains **unpublished** and is **re-published** next time. **Lost events are zero, and the cost is the possibility of double publishing** — this is at-least-once semantics. In the reverse order (mark first → publish), lost events occur, so **always** keep this order.
3. **`dedup_id=outbox.id`**: even if re-published, since `id` is deterministic and immutable, the consumer can absorb duplicates with "process the same `dedup_id` only once" (Chapter 6).

### The limits of polling, honestly

The drawbacks microservices.io lists are exactly the field's worries. **"Tricky to publish events in order"** — parallelizing breaks the order (the countermeasure is Chapter 7). Plus the **polling-interval trade-off**: shorten it and latency drops but DB load rises; lengthen it and vice versa. For many business systems, around 1 second is a safe starting point. The advantage is clear: **"Works with any SQL database"** — you can start today with no extra infrastructure.

---

## 5. The message relay (2): transaction log tailing (CDC)

The other strategy is **Transaction Log Tailing** (CDC = change data capture). microservices.io's definition is **"Tail the database transaction log and publish each message/event inserted into the outbox to the message broker."**

Where polling "SELECTs the table" from the app's perspective, CDC **tails the commit log (change history) itself** from the DB's perspective. Because it reads the primary record the DB writes, **there's neither polling lag nor query load on the DB.**

Representative implementations (**all as examples**):

- **Debezium** (example) reads **PostgreSQL's WAL** (logical replication) and flows it to Kafka.
- It tails **MySQL's binlog** similarly.
- **DynamoDB Streams** (example): captures item-level changes in a time-series stream, and Lambda forwards them to SQS / EventBridge. This is a configuration that AWS Prescriptive Guidance squarely recommends as "Using change data capture (CDC)."

Another advantage of CDC is the point AWS makes that it "saves the overhead of creating another table." With DynamoDB, if you embed the event in the item's attributes and flow it via Streams, there are cases where you can make **even the outbox table unnecessary.**

However, microservices.io also explicitly notes drawbacks: **"Requires database specific solutions"** and **"Tricky to avoid duplicate publishing."** Handling WAL and binlog differs per DB, and you need operational knowledge of the CDC platform (Debezium / Kafka Connect, etc.). The strong advantage **"Guaranteed to be accurate"** trades off against operational heaviness.

---

## 6. Choosing a relay strategy: Polling vs CDC

It's not "which is correct" but **choose by requirement.** Let me make a decision table.

| Decision axis | Polling Publisher | Transaction Log Tailing (CDC) |
| --- | --- | --- |
| **Ease of implementation** | ◎ Just SQL and a scheduler. You can start today | △ Requires building/operating a CDC platform (Debezium, etc.) |
| **Extra infrastructure** | Unneeded (just the existing DB) | Needed (connector / stream / Kafka, etc.) |
| **Publish latency** | △ Delayed by the polling interval | ◎ Nearly real-time right after commit |
| **DB load** | △ A SELECT runs periodically | ◎ Just tailing the log, zero query load |
| **Ordering guarantee** | △ Easily breaks when parallelized | ◎ The log = commit order, so easy to keep naturally |
| **DB dependence** | ◎ Works on any SQL DB | △ DB-specific, like WAL/binlog/Streams |
| **Throughput** | △ Polling-rate-limited | ◎ Strong at high throughput |
| **Suited for** | Small/medium scale, want to start fast, don't want more ops | Large scale, low latency required, already have a Kafka/CDC platform |

### My practical guideline (YAGNI)

**Starting with Polling first** is the correct order in most projects. The reason is "it works on any SQL DB, has zero extra infrastructure, and the logic is visible." CDC is powerful, but operating Debezium + Kafka Connect **becomes a product in itself.** **Putting in CDC at the stage of "not troubled by second-level latency" and "don't have Kafka yet" is investment in a problem you haven't solved** (YAGNI).

In fact, the billing reconciliation I worked on in the lumber-distribution DX I built in the form of a **reconciliation Lambda running on EventBridge's scheduled execution** — structurally a Polling-leaning relay + reconciliation — to "prevent double charges and lost events even under network outages and retries." In a payment system where **reliable, idempotent reflection** is the top priority over low latency, a simple, easy-to-monitor polling-type is strong. You can consider migrating to CDC once a phase comes where latency truly matters.

---

## 7. At-least-once and ordering guarantees: the downstream's prerequisites

What the outbox pattern guarantees is **at-least-once**, not exactly-once. Misunderstand this and you have an accident in production. Design downstream on the premise of **"published at least once, sometimes twice or more."**

### 7.1 The consumer must be idempotent (required)

AWS and microservices.io speak in unison. AWS says "we recommend that you make the consuming service idempotent by tracking the processed messages." Relay re-execution, SQS standard queue's duplicate delivery, CDC duplicates — **duplicates are structurally unavoidable**, so **absorb them on the consuming side.**

The most reliable is to **"put the deduplication marker in the same transaction as the business update."** Use the **mirror image** of the idea by which the outbox made the "publishing side" atomic, on the consuming side too.

```sql
-- 処理済みイベントの台帳（消費側）。主キーが重複排除そのもの
CREATE TABLE processed_events (
    event_id     TEXT PRIMARY KEY,        -- 発行側の outbox.id を dedup_id として運ぶ
    processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
```

```python
"""consumer：重複排除マーカーの INSERT と業務更新を“同一トランザクション”で。
すでに処理済みなら主キー衝突で弾く → 何度配信されても副作用は1回だけ。"""
def handle_order_paid(session, event) -> None:
    try:
        with session.begin():
            # ① まず重複排除マーカーを入れる（同じ event_id は2回目で必ず衝突）
            session.execute(
                text("INSERT INTO processed_events (event_id) VALUES (:id)"),
                {"id": event.dedup_id},
            )
            # ② 業務側の副作用（マーカーと同一Tx。ハンドラが落ちれば両方ロールバック）
            apply_payment_settlement(session, event.payload)
    except UniqueViolation:
        # すでに処理済み = 重複配信。何もせず正常終了（冪等）
        return
```

This "marker and business update in the same transaction" works because **if the handler crashes midway, the marker isn't written either.** No marker remaining = unprocessed, so on resend it's **correctly reprocessed.** "Commit just the marker first → fail in the business processing" causes the accident of becoming **permanently unprocessed**, so **always** make it the same transaction.

> **Related technique (the payment platform eco-pay)**: what I used in the payment platform is exactly this mirror image — a design that **includes the Stripe Webhook's idempotency marker within the same transaction as the business update.** If the handler fails, the marker isn't written either, so Stripe's resend **correctly reprocesses** it. Both "the publishing-side outbox" and "the consuming-side idempotency marker" are applications of the same principle: **"atomize the side effect and the ledger in the same transaction."**
>
> In a different project's (lumber-distribution DX) Stripe Webhook, I **deduplicated with DynamoDB's conditional write (`attribute_not_exists`)** and re-resolved the amount on the server side. Relational → primary-key collision, DynamoDB → conditional write — **"making a uniqueness constraint do the deduplication"** is the royal road regardless of language or DB.

> **For those who want to dig into idempotent-consumption design**: the side that idempotently consumes published events (SQS visibility timeout, DLQ, EventBridge routing, how to hold idempotency keys) is summarized with implementation patterns in [Idempotent Async Processing with SQS + Lambda + EventBridge](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide). Read them complementarily as **this article = the side that reliably "publishes," that one = the side that reliably "consumes."**

### 7.2 Ordering guarantees: strongly, only where needed

AWS says "Send messages in the same order in which the service updates the database," but you usually don't need to keep the **total order of all events.** What you should keep is only the **"relative order of events for the same aggregate"** (e.g., `Created → Paid → Shipped` for the same order).

- **The basis for publish order**: pull the outbox in `id ASC` (sequence number = order of occurrence).
- **Maintaining intra-aggregate order**: make `aggregate_id` the publish key. With SQS FIFO it's `MessageGroupId`, with Kafka the same partition, and **the same key keeps its order.**
- **Compatibility with parallelization**: if you partition by `aggregate_id`, you can **publish different aggregates in parallel while serializing the same aggregate.** This is the standard play for getting both throughput and order.

"Flow all events into one queue in complete order" kills throughput and is usually an **excessive requirement** (YAGNI). If the consumer is idempotent and intra-aggregate order alone is kept, many businesses work correctly.

---

## 8. Production operation: drive lost events to zero "structurally"

Just putting in the pattern doesn't make it production quality. Let me line up, in priority order, the operational equipment I actually made effective in a payment system.

### 8.1 The reconciliation job — the last safety net

Even if the relay is perfect, over the long run of operations, "missed marks," "unknown bugs," and "external broker anomalies" can happen. So always place a **periodic reconciliation job** as an independent safety net.

```sql
-- “異常に古いのに未発行”の行を検出 = リレーが取りこぼしている兆候
SELECT id, aggregate_id, event_type, created_at
FROM outbox
WHERE published_at IS NULL
  AND created_at < now() - INTERVAL '5 minutes'
ORDER BY id ASC;
```

Run this on EventBridge's scheduled execution (example) and the like, and **recover and re-publish "rows unpublished beyond a certain time."** The lumber-distribution DX's billing reconciliation was made exactly into this form of a **reconciliation Lambda (EventBridge scheduled execution)** that periodically picks back up the lost events of network outages and retries. **A two-stage setup of a real-time relay (immediacy) and a periodic reconciliation (comprehensiveness)** structurally supports a payment system's "0 lost events."

### 8.2 Observability: turn the unpublished count into an alarm

The most important metrics to monitor are **"the oldest age of unpublished events" and "the unpublished count."**

- **Alert if the oldest unpublished `created_at` exceeds a threshold (e.g., 5 minutes)** → a sign the relay has stopped.
- **The unpublished count monotonically increasing** → the relay is short on throughput.
- Emit the relay's **publish count per cycle, failure count, and publish latency** in structured logs.

"Continuing to accumulate in the outbox" is the earliest signal that downstream is silently starving. **An outbox implementation with no alarm here is only half complete.**

### 8.3 Outbox cleanup (TTL / archive)

Leaving published rows makes the outbox monotonically bloat, and the relay's scan slows down too. **Periodically delete the published** (cost efficiency).

```sql
-- 発行済みかつ一定期間を過ぎた行を削除（監査が要るならアーカイブ後に）
DELETE FROM outbox
WHERE published_at IS NOT NULL
  AND published_at < now() - INTERVAL '7 days';
```

If you have audit requirements, archive to low-cost storage (e.g., S3) before deletion. With DynamoDB, auto-expiring with a **TTL attribute** is the standard play. Decide the retention period by working backward from "the window the reconciliation job reviews + the audit requirements."

### 8.4 Relay re-execution safety (restart, redundancy)

The relay must be correct **whenever it goes down and however many run.**

- **Restart tolerance**: all state is in the outbox's `published_at`. The process is stateless, so restart = resume from unpublished, and that's it.
- **Multi-launch tolerance**: prevent double-grabbing of rows with `FOR UPDATE SKIP LOCKED` (Chapter 4).
- **Poison-message countermeasure**: if publish repeatedly fails for a specific row, the whole thing clogs. Add an `attempts` column and design it so it **isolates after N failures (evacuate to a dead-letter row) and doesn't stop what follows** (reliability: a single row's failure doesn't stop the whole).

---

## 9. Summary: the outbox cheat sheet

A quick reference for when you're unsure.

- **The essence of the dual-write problem**: a "DB commit" and a "publish" are 2 writes to different systems. Can't be bound atomically → one fails and you get a **lost event or a ghost event.**
- **The solution**: write the event to the **outbox table** in the **same transaction as the business update** → the write targets become one and atomicity is established.
- **Publishing is a separate process (the relay)**: pick up `published_at IS NULL` → publish → mark. **The order publish→mark** creates at-least-once.
- **Relay strategy**: first **Polling** (works on any SQL, no extra infrastructure). For low latency, large scale, or existing Kafka, **CDC** (Debezium / DynamoDB Streams are examples).
- **Downstream must be idempotent**: a deduplication marker in the **same transaction as the consumer-side business update.** Make a uniqueness constraint (primary key / conditional write) do the deduplication.
- **Order**: total order is unnecessary. Keyed by `aggregate_id`, **serialize only the same aggregate** (SQS FIFO's MessageGroupId / Kafka partition).
- **Production equipment**: the reconciliation job (periodic recovery of lost events), the unpublished-count alarm, cleanup (TTL/archive), redundancy with `SKIP LOCKED`, poison-message isolation.

---

"Write to the DB, then publish" — the dual-write problem hidden in this one line, left alone, becomes a hard-to-reproduce, trust-eroding failure of **"notifications sometimes don't arrive" and "very rarely double-charged."** The outbox pattern eradicates that uncertainty with **just one structural change: "write in the same transaction."**

In an invite-only B2B SaaS (lumber-distribution DX, Stripe Connect marketplace payments, METI Minister's Award), I assembled billing reconciliation with a **transactional outbox + reconciliation Lambda** in a form that "prevents double charges and lost events even under network outages and retries," handled Webhooks with a separate Lambda **deduplicating with DynamoDB's conditional write (`attribute_not_exists`)**, and resolved the amount on the server side. **Reliably reflecting money-moving processing exactly once, even if the network is cut, the process crashes, and retries pile up** — from that design through implementation, operation, and monitoring, I accompany you fast and safely, armed with one person × generative AI (Claude Code).

**"I want to rebuild my own 'DB update → event publishing' with zero lost events and zero double publishing" — feel free to consult us, even from the requirements-organizing stage.**

---

### Reference (Official Documentation)

- [Pattern: Transactional outbox (microservices.io / Chris Richardson)](https://microservices.io/patterns/data/transactional-outbox.html) — the canonical definition of the pattern, the dual-write problem, benefits/drawbacks
- [Transactional outbox pattern (AWS Prescriptive Guidance)](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/transactional-outbox.html) — intent / motivation / implementation (outbox table, CDC, SQS), at-least-once and idempotency
- [Pattern: Polling publisher (microservices.io)](https://microservices.io/patterns/data/polling-publisher.html) — the strategy of polling the outbox to publish, advantages and limits
- [Pattern: Transaction log tailing (microservices.io)](https://microservices.io/patterns/data/transaction-log-tailing.html) — the strategy of tailing the transaction log (WAL / binlog / DynamoDB Streams)
