# Designing 'zero double charges' in a serverless payment foundation — implementing idempotency, atomicity, and zero-downtime migration with DynamoDB

> The reliability-layer design of a serverless payment foundation that handles actual money. Based on real code, it explains the implementation patterns that prevent double charges with idempotency keys and conditional writes, guarantee balance consistency with DynamoDB transactions, and evolve the schema without stopping production with dual writes.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: AWS, DynamoDB, Python, サーバーレス, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/dynamodb-payment-reliability-idempotency-zero-downtime
- Category: Payments & billing
- Pillar guide: https://tomodahinata.com/en/blog/stripe-payments-production-guide-webhooks-idempotency-subscriptions

## Key points

- Protect correctness by the structure of code, not operational carefulness. Idempotency, atomicity, and consistency were delegated to DynamoDB primitives.
- Idempotency is a client-issued key + attribute_not_exists + TTL. The crux is putting the marker insertion in the same TransactWriteItems as the payment body.
- Atomicity erases read-modify-write with ADD + ConditionExpression, having the DB judge so the balance never goes negative even under contention.
- Retry only on TransactionConflict. ConditionalCheckFailed propagates immediately, and exponential backoff + jitter prevents an avalanche.
- Zero-downtime migration is the Expand→Migrate→Contract dual write. ADD's idempotency migrated 13+ phases without stopping production for even one second.

---

A bug in a payment system either takes money directly from the user's wallet or causes a loss to the business. "Roughly works" doesn't cut it. **Not one yen, not one time, must it ever be wrong.**

On a multi-tenant payment platform in the environment / carbon-credit / local-currency field (an AWS serverless foundation), as the core engineer of team development (3 main developers), I cross-implemented the frontend/backend (4 backends + 4 frontends) and the shared foundation and infrastructure spanning the 4 faces of customer, merchant, admin, and storefront terminal (I handled about 60% of the repository's commits). In this article, I focus on the **payment-reliability layer** I designed and led among them. On a foundation handling actual money, points, J-Credits (carbon credits), and local currency, it was an area required to make double charges and balance inconsistencies zero, and to **keep evolving the data model without stopping production for even one second.** And in fact, I maintain **0 double charges / balance inconsistencies in production operation.**

In this article, I explain the design adopted in that reliability layer, based on the patterns of code actually in operation. There are 4 broad themes.

- **Idempotency** — converge a charge to once even if retries come
- **Atomicity** — don't break the balance even under contention
- **Retry** — absorb only "failures that may be retried"
- **Zero-downtime migration** — swap the data model while running

There's one philosophy running through it all. **Guarantee correctness by "the structure of code," not "operational carefulness."** Correctness protected by review or procedure documents breaks someday. Correctness protected by types, conditional expressions, and transactions doesn't break.

> Note: the code in the text is simplified by extracting the gist to convey the design intent. Tenant names, table names, specific values, etc., are abstracted.

## Premise: why "serverless × single table"

The foundation is a serverless configuration of **AWS Lambda + DynamoDB.** Because the payment logic is called commonly from multiple Lambdas — the customer app, merchant, admin, and storefront terminal — the business logic is consolidated into a **shared Lambda Layer** as the single source of truth (SSoT).

The reason I chose DynamoDB is that, instead of relational transactions, you get the 2 primitives actually needed for payments.

1. **Atomic numeric increment/decrement with `ADD`** — make "read, add, write back" one instruction
2. **Conditional write with `ConditionExpression`** — judge "subtract if the balance is enough" atomically on the DB side

Combine these two with **`TransactWriteItems` (commit up to 100 items all-or-nothing)** and you can delegate the "correctness" of payments not to application-code locks but to the DB's consistency guarantee. This is the foundation of the entire reliability layer.

The data is a single-table design (PK=`uuid`, SK=`type`), separating records by concern.

| Record (SK) | Role |
| --- | --- |
| `BALANCE#ECOPAY` | the main balance |
| `BALANCE#POINT` / `BALANCE#JCREDIT` | point / carbon-credit balance |
| `BALANCE#REGIONAL#{key}` | local-currency balance |
| `METRICS` | aggregate values like CO2 reduction |
| `card_op_idem#<operation>#<key>` | idempotency marker (with TTL) |

This design of separating SK by concern works in both the "atomicity" and "zero-downtime migration" described later.

## 1. Idempotency: prevent double charges by "design"

### The problem

In a mobile-app payment, there's no guarantee anywhere that the request arrives once. A 3G/4G line timeout, an API Gateway retry, a Lambda re-execution — all of these generate "re-sends of the same payment request."

What matters here is the resignation that **a retry itself is not an anomaly but the normal path.** What to do is not "stop retries" but "converge the charge to once no matter how many times it arrives."

### The solution: client-issued key + `attribute_not_exists` + TTL

The client includes `idempotencyKey` (a UUID) in the payment request's body. The server side concatenates this into the sort key of the **idempotency marker** and inserts it with the `attribute_not_exists` condition.

```python
# 冪等性マーカーを「データ」として組み立てる純粋関数（簡略版）
# DB I/O は一切せず、TransactItem の dict を返すだけ。
def build_idempotency_item(
    *,
    table: str,
    uuid: str,
    sk: str,                # 例: "card_op_idem#topup#<key>"
    ttl_seconds: int,       # 既定 90 日
    timestamp: int,
    extra: dict | None = None,
) -> dict:
    item = {
        "uuid": {"S": uuid},
        "type": {"S": sk},
        "processed_at": {"N": str(timestamp)},
        "ttl": {"N": str(timestamp + ttl_seconds)},
        **(extra or {}),
    }
    return {
        "Put": {
            "TableName": table,
            "Item": item,
            # 同一 (uuid, type) が既にあれば挿入失敗 = 二重実行を阻止
            "ConditionExpression": "attribute_not_exists(#uuid)",
            "ExpressionAttributeNames": {"#uuid": "uuid"},
        }
    }
```

There are 3 design points.

- **It's a pure function.** This function doesn't touch DynamoDB. It returns only the data of "how to write the marker," so it can be unit-tested without starting the DB, and destructive changes to the SK format or TTL calculation can be detected with **golden-vector tests.**
- **It auto-expires with TTL.** The marker is auto-deleted by DynamoDB TTL after a default 90 days. Holding records solely for idempotency forever makes storage cost and table size swell linearly. It's a design to balance "correctness" and "cost efficiency."
- **Put the marker insertion in the same transaction as the payment body.** This is the crux. Bundle the idempotency marker's `Put` and the balance `ADD` into a **single `TransactWriteItems`.** The second request gets `ConditionalCheckFailed` on the marker insertion, and since the entire transaction rolls back atomically, the balance doesn't move one millimeter.

### Webhook idempotency: if it fails, "don't leave a marker either"

For a webhook from an external payment (Stripe), the same principle applies, but there's a trap here.

The common implementation is "write the event ID's marker **before** running the handler," but with this, when the handler dies midway, only the marker remains. Even if Stripe correctly re-sends, it's misjudged as "processed" and **the payment is silently lost.**

So include the event ID's marker **in the same transaction as the handler's side effect.**

```python
# Stripe イベント ID ベースの冪等マーカーも TransactItem として返す。
# handler 失敗時はマーカーも書かれない → Stripe の再送で正しく再処理される。
idem_item = build_event_idempotency_transact_item(
    table=table,
    event_id=event["id"],
    ttl_seconds=30 * 86400,   # Stripe の再送上限は約3日。30日は監査用の保守的設定。
    timestamp=now,
)

client.transact_write_items(
    TransactItems=[idem_item, deduct_item, history_item],  # all-or-nothing
)
```

Because the marker and the side effect share their fate, the limbo state of "the marker remained but the processing failed" **structurally cannot exist.** The webhook idempotency period is short at 30 days (Stripe's re-send limit of 3 days + audit margin), optimizing cost here too.

### Staged rollout and observability

Making the idempotency key mandatory all at once breaks existing clients. So I migrated:

1. **Optional phase** — use the key if present. On absence, fire the `IdempotencyKeyMissing` metric
2. **Monitor the production absence rate with CloudWatch** — confirm with numbers that it converged to 0
3. **Mandatory phase** — reject key absence with HTTP 400

I also observe replays (re-arrival of the same key) with the `IdempotencyReplay` metric, enabling detection of abnormal re-send patterns. It's the idea of **judging "when it's OK to make it mandatory" not by intuition but by metrics.**

## 2. Atomicity: don't break the balance even under contention

### The problem

Payment, charge, and refund can fly to the same card and same customer simultaneously. The naive implementation is this.

```python
# アンチパターン：read-modify-write はレースを生む
balance = read_balance(card_id)          # ① 残高を読む
if balance < amount:                     # ② チェック
    raise InsufficientError
write_balance(card_id, balance - amount) # ③ 書き戻す
```

If another transaction cuts in between ① and ③, the check is done on the stale balance and **the balance sinks negative.** You can prevent it with an app-side lock, but a lock is a new single point of failure and a hotbed of deadlock.

### The solution: erase read-modify-write with `ADD` + `ConditionExpression`

In DynamoDB, you can fold "read and subtract" into one conditional `Update`.

```python
# 残高更新を「データ」として組み立てる純粋関数（簡略版）
def build_deduct(table: str, card_id: str, amount: Decimal, now: int) -> dict:
    return {
        "Update": {
            "TableName": table,
            "Key": {"uuid": {"S": card_id}, "type": {"S": "card"}},
            # ADD は原子的。読み取り後書き込みのレースが存在しない。
            "UpdateExpression": "ADD balance :delta SET updated_at = :ua",
            # 「残高が足りるとき」だけコミットされる。判定は DB 側で原子的。
            "ConditionExpression": "balance >= :amount",
            "ExpressionAttributeValues": {
                ":delta":  {"N": str(-amount)},
                ":amount": {"N": str(amount)},
                ":ua":     {"N": str(now)},
            },
        }
    }
```

The "read the balance" step disappeared. DynamoDB does the conditional judgment atomically **at commit time**, so no matter how much it runs in parallel, the balance never goes negative.

A "adding side" constraint like a charge cap can be expressed with the same idea. For example, to guarantee `balance + amount ≤ cap` in a card charge, leveraging the property that ADD and condition are evaluated separately, write it like this.

```text
ADD       balance :delta            # :delta = charge amount + bonus
CONDITION balance <= :max_allowed   # :max_allowed = cap − charge amount
```

It's ADD'd only when `balance <= cap − amount` holds, so as a result `balance + amount <= cap` is atomically guaranteed.

### Classify failure by "type"

When a conditional write fails, the reason isn't one. Insufficient balance, cap exceeded, or just contention — these are **different in the action the caller should take**, so don't swallow them vaguely. Classify the failure cause into a typed enum.

```python
class BalanceTransactFailure(StrEnum):
    INSUFFICIENT = "insufficient"   # 残高不足   → HTTP 400（再試行しても無駄）
    CAP_EXCEEDED = "cap_exceeded"   # 上限超過   → HTTP 400
    CONFLICT     = "conflict"       # 純粋な競合 → 再試行可
    THROTTLED    = "throttled"      # 容量超過   → 503 / 再試行可
```

Treating `INSUFFICIENT` and `CONFLICT` as the same "failure" wastes latency and cost by uselessly retrying an insufficient-balance request. Classifying the cause lets you **separate failures that should be retried from those that shouldn't** (connecting to the next chapter).

### All amounts are `Decimal`. Don't accept `float`

Amounts and the CO2 conversion (the conversion rate is `Decimal('0.01')`) are **all unified as `Decimal`.** `float` is a binary floating point, so it's the world of `0.1 + 0.2 != 0.3`. Allow this in payments and rounding errors accumulate with every transaction.

So at the stage of the low-level function that converts a value to a DynamoDB attribute, I **reject `float` at runtime.** By rejecting `float` both statically with type annotations (mypy strict) and dynamically in the conversion function, I doubly prevent the accident of "`float` accidentally mixing in."

## 3. Retry: absorb only failures that may be retried

`TransactWriteItems` can be temporarily canceled with `TransactionConflict` by optimistic concurrency control. Since this is just "they happened to try to write at the same time," waiting a bit and retrying succeeds.

On the other hand, `ConditionalCheckFailed` (insufficient balance, idempotency collision, cap exceeded) **doesn't change the result no matter how many times you retry.** Rather, retrying is harmful.

I implemented a retry helper that distinguishes these two.

```python
_RETRY_MAX_ATTEMPTS = 3
_RETRY_BASE_DELAY_MS = 50

def transact_write_with_retry(client, transact_items, *, max_retries=_RETRY_MAX_ATTEMPTS):
    for attempt in range(max_retries + 1):
        try:
            client.transact_write_items(TransactItems=transact_items)
            return
        except client.exceptions.TransactionCanceledException as exc:
            # ConditionalCheckFailed は意味論的失敗 → 即座に伝播（再試行しない）
            if any_condition_failed(exc):
                raise
            # TransactionConflict 以外のキャンセルも再試行しない
            if not is_transaction_conflict(exc):
                raise

            _emit_transaction_conflict_metric()   # CloudWatch に発火
            if attempt == max_retries:
                raise

            # 指数バックオフ（50ms → 100ms → 200ms）+ ジッター（±50%）
            delay_s = (_RETRY_BASE_DELAY_MS * (2 ** attempt)) / 1000
            jitter_s = random.uniform(0, delay_s * 0.5)
            time.sleep(delay_s + jitter_s)
```

The design points.

- **Retry only on `TransactionConflict`.** Since `ConditionalCheckFailed` directly relates to the semantics of idempotency collision and balance conditions, propagate it to the caller immediately. Not mixing "failures fixed by retry" and "failures not fixed" protects both latency and cost.
- **Exponential backoff + jitter.** Double the wait time with `50ms × 2^attempt`, and further add `±50%` jitter. Without jitter, multiple contended requests retry all at once at the same interval, causing a **thundering herd (avalanche).**
- **SSoT the reason-code judgment.** Don't scatter the string judgment of `"TransactionConflict"` / `"ConditionalCheckFailed"` everywhere; consolidate it in a shared module (`any_condition_failed` / `is_transaction_conflict`). Fix one place and it reflects to all Lambdas — the practice of DRY.
- **Contention is observable.** On every contention, fire the `EcoPay/Payments::TransactionConflict` metric, and a CloudWatch alarm (alerts on more than 5 in 5 minutes) detects a surge in concurrent contention. Don't swallow it with retries to "make it invisible"; the point is to **absorb while measuring.**

## 4. Zero-downtime migration: swap the engine while running

### The problem

The initial data model had one customer's balance, points, J-Credit, and profile all cohabiting in one huge record (the so-called **God Record**). Write conflicts on concurrent updates were likely, and separation of concerns wasn't done.

I wanted to migrate this to a "new schema split by concern." But — **production payments can't be stopped for even one second.** The option of putting up a maintenance screen late at night and doing a bulk batch migration was off the table from the start. A bulk migration has too high a risk of data inconsistency during migration and of rollback on failure.

### The solution: staged migration by dual writes (mirror writes)

What I adopted was a staged migration applying the idea of **Expand / Migrate / Contract** to payment data.

1. **Dual write (Expand)** — reflect new writes to **both** the old schema and the new schema
2. **Re-read / dedup (Migrate)** — switch reads to be new-schema-centric, and during the period where old/new are mixed, dedup and reconcile. In parallel, backfill old data into the new schema
3. **Remove old data (Contract)** — confirm unification to the new schema and stop writes to the old schema

I broke this into 13+ phases, designing it so that **each phase alone, no matter when stopped, doesn't break production.** For example, "split the God Record into profile and balance (12-3)," "split `BALANCE#POINT`/`BALANCE#JCREDIT` from the aggregate value `METRICS` (12-4)," and so on.

I made all migration writes a form where a **pure builder function** returns an `Update` `TransactItem`.

```python
# 残高ミラー書き込みを「データ」として組み立てる純粋関数（簡略版）
def build_balance_mirror_items(table, customer_id, *, point_delta=0, now) -> list[dict]:
    items: list[dict] = []
    if point_delta:
        items.append({
            "Update": {
                "TableName": table,
                "Key": {"uuid": {"S": customer_id}, "type": {"S": "BALANCE#POINT"}},
                # ADD は冪等な増分。バックフィルが再実行されても最終状態は一意に収束。
                "UpdateExpression": "ADD balance :delta SET updated_at = :ua",
                "ExpressionAttributeValues": {
                    ":delta": {"N": str(point_delta)},
                    ":ua": {"N": str(now)},
                },
            }
        })
    return items
```

The reason this approach is safe is the crux of migration.

- **`ADD` is an idempotent increment operation.** "Add +100 to the balance" — you'd think it doesn't work correctly no matter how many times it flows… but **if you design the backfill to be idempotent**, that's another story. A new dual write is `ADD`'d once, and the past backfill is designed to "transfer only unprocessed records once." Since `ADD` is atomic, even if dual writes and backfill interleave temporally, the final balance converges uniquely.
- **Protect existing values with `if_not_exists`.** In the profile backfill, write like `created_at = if_not_exists(created_at, :ca)` so the backfill doesn't overwrite an already-existing value.
- **Distinguish deletion intent from `None`.** "Don't update this attribute (`None`)" and "delete this attribute" are different. To express the latter, prepare a dedicated `CLEAR` sentinel and **make deletion explicit** like `phone_number=CLEAR`. It eliminates the ambiguity of deleting with `None`.
- **Return an empty array if no write is needed.** When there's nothing to change, the builder returns `[]` and does no wasteful write (no-op).
- **Dedup on the read side.** During the migration period, both old-format and new-format transaction history can exist, so on read, dedup by the identity of `(PK, SK, timestamp, type)` so the migration looks transparent to the user.

As a result, **without stopping the running payments for even one second**, I completed the migration of 13+ phases non-stop from the God Record to a concern-separated schema.

## A cross-cutting foundation: testability, type safety, observability

Three designs are common to the code so far. These weren't things to "add later" but **preconditions** that make reliability hold.

**Testability (pure functions).** The idempotency marker, balance update, and migration builders are all pure functions with no DB I/O. Since you only need to verify "what `TransactItem` it returns" for the input, you can test fast — including boundary conditions — without starting DynamoDB or assembling mocks. Furthermore, I fix the wire format of SK, conditional expressions, and TTL with **golden vectors**, stopping the accident of "the storage format changing unnoticed" with regression tests.

**Type safety (mypy strict).** The shared Layer enforces function type annotations in mypy strict mode with `disallow_untyped_defs`. By eliminating `Any` from the payment logic, I can detect oversights at refactor time at a compile-equivalent stage. Types are the best — and the cheapest — test.

**Observability (structured logs + metrics).** I continuously observe the idempotency-key absence rate, replays, and transaction conflicts with structured logs and CloudWatch metrics. What matters is not leaving PII in logs. Mask emails and phone numbers, and **never output** PINs, tokens, passwords, and the like to logs. Observability and personal-information protection need to coexist.

## The principles running through the design

Finally, let me organize the principles running through this reliability layer.

| Principle | How it appears in this foundation |
| --- | --- |
| **Protect correctness by structure** | idempotency via `attribute_not_exists`, consistency via `ADD` + conditional expression, atomicity delegated to `TransactWriteItems` |
| **SSoT / DRY** | payment logic consolidated in the shared Layer. Reason-code judgment in one place too |
| **SRP** | separate "judgment (pure function)" and "execution (I/O)." The builder returns only data |
| **Classify failure** | distinguish retriable (CONFLICT/THROTTLED) and non-retriable (INSUFFICIENT/CAP_EXCEEDED) by type |
| **Cost efficiency** | idempotency markers auto-expire with TTL. Follow demand with serverless |
| **Observability** | not just absorb but measure. Judge by metrics |
| **Non-stop evolution (ETC)** | swap without stopping via Expand→Migrate→Contract |

Payment reliability isn't a flashy feature. **"Nothing happened" itself is the achievement.** No double charges happen, the balance doesn't break, no one notices even during migration — guaranteeing that by design, not by chance, is the duty of a system entrusted with actual money, I believe. In fact, I maintain 0 double charges / balance inconsistencies in production operation.

If you're troubled by the same kind of "reliability design of a foundation handling actual money/points," "idempotent, atomic data consistency in serverless," or "schema migration that doesn't stop production," I can help from the design stage.
