Designing 'zero double charges' in a serverless payment foundation — implementing idempotency, atomicity, and zero-downtime migration with DynamoDB

A bug in a payment system either takes money directly from the user's wallet or causes a loss to the business. "Roughly works" doesn't cut it. Not one yen, not one time, must it ever be wrong.

On a multi-tenant payment platform in the environment / carbon-credit / local-currency field (an AWS serverless foundation), as the core engineer of team development (3 main developers), I cross-implemented the frontend/backend (4 backends + 4 frontends) and the shared foundation and infrastructure spanning the 4 faces of customer, merchant, admin, and storefront terminal (I handled about 60% of the repository's commits). In this article, I focus on the payment-reliability layer I designed and led among them. On a foundation handling actual money, points, J-Credits (carbon credits), and local currency, it was an area required to make double charges and balance inconsistencies zero, and to keep evolving the data model without stopping production for even one second. And in fact, I maintain 0 double charges / balance inconsistencies in production operation.

In this article, I explain the design adopted in that reliability layer, based on the patterns of code actually in operation. There are 4 broad themes.

Idempotency — converge a charge to once even if retries come
Atomicity — don't break the balance even under contention
Retry — absorb only "failures that may be retried"
Zero-downtime migration — swap the data model while running

There's one philosophy running through it all. Guarantee correctness by "the structure of code," not "operational carefulness." Correctness protected by review or procedure documents breaks someday. Correctness protected by types, conditional expressions, and transactions doesn't break.

Note: the code in the text is simplified by extracting the gist to convey the design intent. Tenant names, table names, specific values, etc., are abstracted.

Premise: why "serverless × single table"

The foundation is a serverless configuration of AWS Lambda + DynamoDB. Because the payment logic is called commonly from multiple Lambdas — the customer app, merchant, admin, and storefront terminal — the business logic is consolidated into a shared Lambda Layer as the single source of truth (SSoT).

The reason I chose DynamoDB is that, instead of relational transactions, you get the 2 primitives actually needed for payments.

Atomic numeric increment/decrement with ADD — make "read, add, write back" one instruction
Conditional write with ConditionExpression — judge "subtract if the balance is enough" atomically on the DB side

Combine these two with TransactWriteItems (commit up to 100 items all-or-nothing) and you can delegate the "correctness" of payments not to application-code locks but to the DB's consistency guarantee. This is the foundation of the entire reliability layer.

The data is a single-table design (PK=uuid, SK=type), separating records by concern.

Record (SK)	Role
`BALANCE#ECOPAY`	the main balance
`BALANCE#POINT` / `BALANCE#JCREDIT`	point / carbon-credit balance
`BALANCE#REGIONAL#{key}`	local-currency balance
`METRICS`	aggregate values like CO2 reduction
`card_op_idem#<operation>#<key>`	idempotency marker (with TTL)

This design of separating SK by concern works in both the "atomicity" and "zero-downtime migration" described later.

1. Idempotency: prevent double charges by "design"

The problem

In a mobile-app payment, there's no guarantee anywhere that the request arrives once. A 3G/4G line timeout, an API Gateway retry, a Lambda re-execution — all of these generate "re-sends of the same payment request."

What matters here is the resignation that a retry itself is not an anomaly but the normal path. What to do is not "stop retries" but "converge the charge to once no matter how many times it arrives."

The solution: client-issued key + `attribute_not_exists` + TTL

The client includes idempotencyKey (a UUID) in the payment request's body. The server side concatenates this into the sort key of the idempotency marker and inserts it with the attribute_not_exists condition.

# 冪等性マーカーを「データ」として組み立てる純粋関数（簡略版）
# DB I/O は一切せず、TransactItem の dict を返すだけ。
def build_idempotency_item(
    *,
    table: str,
    uuid: str,
    sk: str,                # 例: "card_op_idem#topup#<key>"
    ttl_seconds: int,       # 既定 90 日
    timestamp: int,
    extra: dict | None = None,
) -> dict:
    item = {
        "uuid": {"S": uuid},
        "type": {"S": sk},
        "processed_at": {"N": str(timestamp)},
        "ttl": {"N": str(timestamp + ttl_seconds)},
        **(extra or {}),
    }
    return {
        "Put": {
            "TableName": table,
            "Item": item,
            # 同一 (uuid, type) が既にあれば挿入失敗 = 二重実行を阻止
            "ConditionExpression": "attribute_not_exists(#uuid)",
            "ExpressionAttributeNames": {"#uuid": "uuid"},
        }
    }

There are 3 design points.

It's a pure function. This function doesn't touch DynamoDB. It returns only the data of "how to write the marker," so it can be unit-tested without starting the DB, and destructive changes to the SK format or TTL calculation can be detected with golden-vector tests.
It auto-expires with TTL. The marker is auto-deleted by DynamoDB TTL after a default 90 days. Holding records solely for idempotency forever makes storage cost and table size swell linearly. It's a design to balance "correctness" and "cost efficiency."
Put the marker insertion in the same transaction as the payment body. This is the crux. Bundle the idempotency marker's Put and the balance ADD into a single TransactWriteItems. The second request gets ConditionalCheckFailed on the marker insertion, and since the entire transaction rolls back atomically, the balance doesn't move one millimeter.

Webhook idempotency: if it fails, "don't leave a marker either"

For a webhook from an external payment (Stripe), the same principle applies, but there's a trap here.

The common implementation is "write the event ID's marker before running the handler," but with this, when the handler dies midway, only the marker remains. Even if Stripe correctly re-sends, it's misjudged as "processed" and the payment is silently lost.

So include the event ID's marker in the same transaction as the handler's side effect.

# Stripe イベント ID ベースの冪等マーカーも TransactItem として返す。
# handler 失敗時はマーカーも書かれない → Stripe の再送で正しく再処理される。
idem_item = build_event_idempotency_transact_item(
    table=table,
    event_id=event["id"],
    ttl_seconds=30 * 86400,   # Stripe の再送上限は約3日。30日は監査用の保守的設定。
    timestamp=now,
)

client.transact_write_items(
    TransactItems=[idem_item, deduct_item, history_item],  # all-or-nothing
)

Because the marker and the side effect share their fate, the limbo state of "the marker remained but the processing failed" structurally cannot exist. The webhook idempotency period is short at 30 days (Stripe's re-send limit of 3 days + audit margin), optimizing cost here too.

Staged rollout and observability

Making the idempotency key mandatory all at once breaks existing clients. So I migrated:

Optional phase — use the key if present. On absence, fire the IdempotencyKeyMissing metric
Monitor the production absence rate with CloudWatch — confirm with numbers that it converged to 0
Mandatory phase — reject key absence with HTTP 400

I also observe replays (re-arrival of the same key) with the IdempotencyReplay metric, enabling detection of abnormal re-send patterns. It's the idea of judging "when it's OK to make it mandatory" not by intuition but by metrics.

2. Atomicity: don't break the balance even under contention

The problem

Payment, charge, and refund can fly to the same card and same customer simultaneously. The naive implementation is this.

# アンチパターン：read-modify-write はレースを生む
balance = read_balance(card_id)          # ① 残高を読む
if balance < amount:                     # ② チェック
    raise InsufficientError
write_balance(card_id, balance - amount) # ③ 書き戻す

If another transaction cuts in between ① and ③, the check is done on the stale balance and the balance sinks negative. You can prevent it with an app-side lock, but a lock is a new single point of failure and a hotbed of deadlock.

The solution: erase read-modify-write with `ADD` + `ConditionExpression`

In DynamoDB, you can fold "read and subtract" into one conditional Update.

# 残高更新を「データ」として組み立てる純粋関数（簡略版）
def build_deduct(table: str, card_id: str, amount: Decimal, now: int) -> dict:
    return {
        "Update": {
            "TableName": table,
            "Key": {"uuid": {"S": card_id}, "type": {"S": "card"}},
            # ADD は原子的。読み取り後書き込みのレースが存在しない。
            "UpdateExpression": "ADD balance :delta SET updated_at = :ua",
            # 「残高が足りるとき」だけコミットされる。判定は DB 側で原子的。
            "ConditionExpression": "balance >= :amount",
            "ExpressionAttributeValues": {
                ":delta":  {"N": str(-amount)},
                ":amount": {"N": str(amount)},
                ":ua":     {"N": str(now)},
            },
        }
    }

The "read the balance" step disappeared. DynamoDB does the conditional judgment atomically at commit time, so no matter how much it runs in parallel, the balance never goes negative.

A "adding side" constraint like a charge cap can be expressed with the same idea. For example, to guarantee balance + amount ≤ cap in a card charge, leveraging the property that ADD and condition are evaluated separately, write it like this.

ADD       balance :delta            # :delta = charge amount + bonus
CONDITION balance <= :max_allowed   # :max_allowed = cap − charge amount

It's ADD'd only when balance <= cap − amount holds, so as a result balance + amount <= cap is atomically guaranteed.

Classify failure by "type"

When a conditional write fails, the reason isn't one. Insufficient balance, cap exceeded, or just contention — these are different in the action the caller should take, so don't swallow them vaguely. Classify the failure cause into a typed enum.

class BalanceTransactFailure(StrEnum):
    INSUFFICIENT = "insufficient"   # 残高不足   → HTTP 400（再試行しても無駄）
    CAP_EXCEEDED = "cap_exceeded"   # 上限超過   → HTTP 400
    CONFLICT     = "conflict"       # 純粋な競合 → 再試行可
    THROTTLED    = "throttled"      # 容量超過   → 503 / 再試行可

Treating INSUFFICIENT and CONFLICT as the same "failure" wastes latency and cost by uselessly retrying an insufficient-balance request. Classifying the cause lets you separate failures that should be retried from those that shouldn't (connecting to the next chapter).

All amounts are `Decimal`. Don't accept `float`

Amounts and the CO2 conversion (the conversion rate is Decimal('0.01')) are all unified as Decimal. float is a binary floating point, so it's the world of 0.1 + 0.2 != 0.3. Allow this in payments and rounding errors accumulate with every transaction.

So at the stage of the low-level function that converts a value to a DynamoDB attribute, I reject float at runtime. By rejecting float both statically with type annotations (mypy strict) and dynamically in the conversion function, I doubly prevent the accident of "float accidentally mixing in."

3. Retry: absorb only failures that may be retried

TransactWriteItems can be temporarily canceled with TransactionConflict by optimistic concurrency control. Since this is just "they happened to try to write at the same time," waiting a bit and retrying succeeds.

On the other hand, ConditionalCheckFailed (insufficient balance, idempotency collision, cap exceeded) doesn't change the result no matter how many times you retry. Rather, retrying is harmful.

I implemented a retry helper that distinguishes these two.

_RETRY_MAX_ATTEMPTS = 3
_RETRY_BASE_DELAY_MS = 50

def transact_write_with_retry(client, transact_items, *, max_retries=_RETRY_MAX_ATTEMPTS):
    for attempt in range(max_retries + 1):
        try:
            client.transact_write_items(TransactItems=transact_items)
            return
        except client.exceptions.TransactionCanceledException as exc:
            # ConditionalCheckFailed は意味論的失敗 → 即座に伝播（再試行しない）
            if any_condition_failed(exc):
                raise
            # TransactionConflict 以外のキャンセルも再試行しない
            if not is_transaction_conflict(exc):
                raise

            _emit_transaction_conflict_metric()   # CloudWatch に発火
            if attempt == max_retries:
                raise

            # 指数バックオフ（50ms → 100ms → 200ms）+ ジッター（±50%）
            delay_s = (_RETRY_BASE_DELAY_MS * (2 ** attempt)) / 1000
            jitter_s = random.uniform(0, delay_s * 0.5)
            time.sleep(delay_s + jitter_s)

The design points.

Retry only on TransactionConflict. Since ConditionalCheckFailed directly relates to the semantics of idempotency collision and balance conditions, propagate it to the caller immediately. Not mixing "failures fixed by retry" and "failures not fixed" protects both latency and cost.
Exponential backoff + jitter. Double the wait time with 50ms × 2^attempt, and further add ±50% jitter. Without jitter, multiple contended requests retry all at once at the same interval, causing a thundering herd (avalanche).
SSoT the reason-code judgment. Don't scatter the string judgment of "TransactionConflict" / "ConditionalCheckFailed" everywhere; consolidate it in a shared module (any_condition_failed / is_transaction_conflict). Fix one place and it reflects to all Lambdas — the practice of DRY.
Contention is observable. On every contention, fire the EcoPay/Payments::TransactionConflict metric, and a CloudWatch alarm (alerts on more than 5 in 5 minutes) detects a surge in concurrent contention. Don't swallow it with retries to "make it invisible"; the point is to absorb while measuring.

4. Zero-downtime migration: swap the engine while running

The problem

The initial data model had one customer's balance, points, J-Credit, and profile all cohabiting in one huge record (the so-called God Record). Write conflicts on concurrent updates were likely, and separation of concerns wasn't done.

I wanted to migrate this to a "new schema split by concern." But — production payments can't be stopped for even one second. The option of putting up a maintenance screen late at night and doing a bulk batch migration was off the table from the start. A bulk migration has too high a risk of data inconsistency during migration and of rollback on failure.

The solution: staged migration by dual writes (mirror writes)

What I adopted was a staged migration applying the idea of Expand / Migrate / Contract to payment data.

Dual write (Expand) — reflect new writes to both the old schema and the new schema
Re-read / dedup (Migrate) — switch reads to be new-schema-centric, and during the period where old/new are mixed, dedup and reconcile. In parallel, backfill old data into the new schema
Remove old data (Contract) — confirm unification to the new schema and stop writes to the old schema

I broke this into 13+ phases, designing it so that each phase alone, no matter when stopped, doesn't break production. For example, "split the God Record into profile and balance (12-3)," "split BALANCE#POINT/BALANCE#JCREDIT from the aggregate value METRICS (12-4)," and so on.

I made all migration writes a form where a pure builder function returns an Update TransactItem.

# 残高ミラー書き込みを「データ」として組み立てる純粋関数（簡略版）
def build_balance_mirror_items(table, customer_id, *, point_delta=0, now) -> list[dict]:
    items: list[dict] = []
    if point_delta:
        items.append({
            "Update": {
                "TableName": table,
                "Key": {"uuid": {"S": customer_id}, "type": {"S": "BALANCE#POINT"}},
                # ADD は冪等な増分。バックフィルが再実行されても最終状態は一意に収束。
                "UpdateExpression": "ADD balance :delta SET updated_at = :ua",
                "ExpressionAttributeValues": {
                    ":delta": {"N": str(point_delta)},
                    ":ua": {"N": str(now)},
                },
            }
        })
    return items

The reason this approach is safe is the crux of migration.

ADD is an idempotent increment operation. "Add +100 to the balance" — you'd think it doesn't work correctly no matter how many times it flows… but if you design the backfill to be idempotent, that's another story. A new dual write is ADD'd once, and the past backfill is designed to "transfer only unprocessed records once." Since ADD is atomic, even if dual writes and backfill interleave temporally, the final balance converges uniquely.
Protect existing values with if_not_exists. In the profile backfill, write like created_at = if_not_exists(created_at, :ca) so the backfill doesn't overwrite an already-existing value.
Distinguish deletion intent from None. "Don't update this attribute (None)" and "delete this attribute" are different. To express the latter, prepare a dedicated CLEAR sentinel and make deletion explicit like phone_number=CLEAR. It eliminates the ambiguity of deleting with None.
Return an empty array if no write is needed. When there's nothing to change, the builder returns [] and does no wasteful write (no-op).
Dedup on the read side. During the migration period, both old-format and new-format transaction history can exist, so on read, dedup by the identity of (PK, SK, timestamp, type) so the migration looks transparent to the user.

As a result, without stopping the running payments for even one second, I completed the migration of 13+ phases non-stop from the God Record to a concern-separated schema.

A cross-cutting foundation: testability, type safety, observability

Three designs are common to the code so far. These weren't things to "add later" but preconditions that make reliability hold.

Testability (pure functions). The idempotency marker, balance update, and migration builders are all pure functions with no DB I/O. Since you only need to verify "what TransactItem it returns" for the input, you can test fast — including boundary conditions — without starting DynamoDB or assembling mocks. Furthermore, I fix the wire format of SK, conditional expressions, and TTL with golden vectors, stopping the accident of "the storage format changing unnoticed" with regression tests.

Type safety (mypy strict). The shared Layer enforces function type annotations in mypy strict mode with disallow_untyped_defs. By eliminating Any from the payment logic, I can detect oversights at refactor time at a compile-equivalent stage. Types are the best — and the cheapest — test.

Observability (structured logs + metrics). I continuously observe the idempotency-key absence rate, replays, and transaction conflicts with structured logs and CloudWatch metrics. What matters is not leaving PII in logs. Mask emails and phone numbers, and never output PINs, tokens, passwords, and the like to logs. Observability and personal-information protection need to coexist.

The principles running through the design

Finally, let me organize the principles running through this reliability layer.

Principle	How it appears in this foundation
Protect correctness by structure	idempotency via `attribute_not_exists`, consistency via `ADD` + conditional expression, atomicity delegated to `TransactWriteItems`
SSoT / DRY	payment logic consolidated in the shared Layer. Reason-code judgment in one place too
SRP	separate "judgment (pure function)" and "execution (I/O)." The builder returns only data
Classify failure	distinguish retriable (CONFLICT/THROTTLED) and non-retriable (INSUFFICIENT/CAP_EXCEEDED) by type
Cost efficiency	idempotency markers auto-expire with TTL. Follow demand with serverless
Observability	not just absorb but measure. Judge by metrics
Non-stop evolution (ETC)	swap without stopping via Expand→Migrate→Contract

Payment reliability isn't a flashy feature. "Nothing happened" itself is the achievement. No double charges happen, the balance doesn't break, no one notices even during migration — guaranteeing that by design, not by chance, is the duty of a system entrusted with actual money, I believe. In fact, I maintain 0 double charges / balance inconsistencies in production operation.

If you're troubled by the same kind of "reliability design of a foundation handling actual money/points," "idempotent, atomic data consistency in serverless," or "schema migration that doesn't stop production," I can help from the design stage.

Designing 'zero double charges' in a serverless payment foundation — implementing idempotency, atomicity, and zero-downtime migration with DynamoDB

Premise: why "serverless × single table"

1. Idempotency: prevent double charges by "design"

The problem

The solution: client-issued key + `attribute_not_exists` + TTL

Webhook idempotency: if it fails, "don't leave a marker either"

Staged rollout and observability

2. Atomicity: don't break the balance even under contention

The problem

The solution: erase read-modify-write with `ADD` + `ConditionExpression`

Classify failure by "type"

All amounts are `Decimal`. Don't accept `float`

3. Retry: absorb only failures that may be retried

4. Zero-downtime migration: swap the engine while running

The problem

The solution: staged migration by dual writes (mirror writes)

A cross-cutting foundation: testability, type safety, observability

The principles running through the design

Implementing Stripe Webhooks and Idempotency at Production Quality: Signature Verification, Out-of-Order / At-Least-Once Delivery Resistance, the Subscription State Machine

Stripe Billing implementation guide (2026 edition, official-compliant): subscriptions, usage-based (Billing Meters / Metronome), customer portal, and proration in real code

The Complete Guide to Implementing Stripe Payments at Production Quality (2026 Edition, Official-Documentation-Conformant): Checkout Sessions, Webhooks, Idempotency, and Connect in Real Code

Stripe Connect Marketplace Payments Production Guide: Safely Designing Account Types, Charge Models, and Webhook Idempotency

Also worth reading

DynamoDB Capacity, Cost, and Performance Design Complete Guide (2026 Edition): On-Demand vs. Provisioned, Auto Scaling, Avoiding Hot Partitions, Cost Optimization

DynamoDB Global Tables × Multi-Region × Disaster Recovery (DR) Complete Guide (2026 Edition): MREC/MRSC Consistency, Conflict Resolution, RTO/RPO Design, PITR, Cost

DynamoDB Streams × Event-Driven Architecture / CDC Complete Guide (2026 Edition): Safely Propagating Change Data with Lambda and EventBridge Pipes

Premise: why "serverless × single table"

1. Idempotency: prevent double charges by "design"

The problem

The solution: client-issued key + attribute_not_exists + TTL

Webhook idempotency: if it fails, "don't leave a marker either"

Staged rollout and observability

2. Atomicity: don't break the balance even under contention

The problem

The solution: erase read-modify-write with ADD + ConditionExpression

Classify failure by "type"

All amounts are Decimal. Don't accept float

3. Retry: absorb only failures that may be retried

4. Zero-downtime migration: swap the engine while running

The problem

The solution: staged migration by dual writes (mirror writes)

A cross-cutting foundation: testability, type safety, observability

The principles running through the design

Related articles

Implementing Stripe Webhooks and Idempotency at Production Quality: Signature Verification, Out-of-Order / At-Least-Once Delivery Resistance, the Subscription State Machine

Stripe Billing implementation guide (2026 edition, official-compliant): subscriptions, usage-based (Billing Meters / Metronome), customer portal, and proration in real code

The Complete Guide to Implementing Stripe Payments at Production Quality (2026 Edition, Official-Documentation-Conformant): Checkout Sessions, Webhooks, Idempotency, and Connect in Real Code

Stripe Connect Marketplace Payments Production Guide: Safely Designing Account Types, Charge Models, and Webhook Idempotency

Also worth reading

DynamoDB Capacity, Cost, and Performance Design Complete Guide (2026 Edition): On-Demand vs. Provisioned, Auto Scaling, Avoiding Hot Partitions, Cost Optimization

DynamoDB Global Tables × Multi-Region × Disaster Recovery (DR) Complete Guide (2026 Edition): MREC/MRSC Consistency, Conflict Resolution, RTO/RPO Design, PITR, Cost

DynamoDB Streams × Event-Driven Architecture / CDC Complete Guide (2026 Edition): Safely Propagating Change Data with Lambda and EventBridge Pipes

The solution: client-issued key + `attribute_not_exists` + TTL

The solution: erase read-modify-write with `ADD` + `ConditionExpression`

All amounts are `Decimal`. Don't accept `float`