A bug in a payment system either takes money directly from the user's wallet or causes a loss to the business. "Roughly works" doesn't cut it. Not one yen, not one time, must it ever be wrong.
On a multi-tenant payment platform in the environment / carbon-credit / local-currency field (an AWS serverless foundation), as the core engineer of team development (3 main developers), I cross-implemented the frontend/backend (4 backends + 4 frontends) and the shared foundation and infrastructure spanning the 4 faces of customer, merchant, admin, and storefront terminal (I handled about 60% of the repository's commits). In this article, I focus on the payment-reliability layer I designed and led among them. On a foundation handling actual money, points, J-Credits (carbon credits), and local currency, it was an area required to make double charges and balance inconsistencies zero, and to keep evolving the data model without stopping production for even one second. And in fact, I maintain 0 double charges / balance inconsistencies in production operation.
In this article, I explain the design adopted in that reliability layer, based on the patterns of code actually in operation. There are 4 broad themes.
- Idempotency — converge a charge to once even if retries come
- Atomicity — don't break the balance even under contention
- Retry — absorb only "failures that may be retried"
- Zero-downtime migration — swap the data model while running
There's one philosophy running through it all. Guarantee correctness by "the structure of code," not "operational carefulness." Correctness protected by review or procedure documents breaks someday. Correctness protected by types, conditional expressions, and transactions doesn't break.
Note: the code in the text is simplified by extracting the gist to convey the design intent. Tenant names, table names, specific values, etc., are abstracted.
Premise: why "serverless × single table"
The foundation is a serverless configuration of AWS Lambda + DynamoDB. Because the payment logic is called commonly from multiple Lambdas — the customer app, merchant, admin, and storefront terminal — the business logic is consolidated into a shared Lambda Layer as the single source of truth (SSoT).
The reason I chose DynamoDB is that, instead of relational transactions, you get the 2 primitives actually needed for payments.
- Atomic numeric increment/decrement with
ADD— make "read, add, write back" one instruction - Conditional write with
ConditionExpression— judge "subtract if the balance is enough" atomically on the DB side
Combine these two with TransactWriteItems (commit up to 100 items all-or-nothing) and you can delegate the "correctness" of payments not to application-code locks but to the DB's consistency guarantee. This is the foundation of the entire reliability layer.
The data is a single-table design (PK=uuid, SK=type), separating records by concern.
| Record (SK) | Role |
|---|---|
BALANCE#ECOPAY | the main balance |
BALANCE#POINT / BALANCE#JCREDIT | point / carbon-credit balance |
BALANCE#REGIONAL#{key} | local-currency balance |
METRICS | aggregate values like CO2 reduction |
card_op_idem#<operation>#<key> | idempotency marker (with TTL) |
This design of separating SK by concern works in both the "atomicity" and "zero-downtime migration" described later.
1. Idempotency: prevent double charges by "design"
The problem
In a mobile-app payment, there's no guarantee anywhere that the request arrives once. A 3G/4G line timeout, an API Gateway retry, a Lambda re-execution — all of these generate "re-sends of the same payment request."
What matters here is the resignation that a retry itself is not an anomaly but the normal path. What to do is not "stop retries" but "converge the charge to once no matter how many times it arrives."
The solution: client-issued key + attribute_not_exists + TTL
The client includes idempotencyKey (a UUID) in the payment request's body. The server side concatenates this into the sort key of the idempotency marker and inserts it with the attribute_not_exists condition.
# 冪等性マーカーを「データ」として組み立てる純粋関数(簡略版)
# DB I/O は一切せず、TransactItem の dict を返すだけ。
def build_idempotency_item(
*,
table: str,
uuid: str,
sk: str, # 例: "card_op_idem#topup#<key>"
ttl_seconds: int, # 既定 90 日
timestamp: int,
extra: dict | None = None,
) -> dict:
item = {
"uuid": {"S": uuid},
"type": {"S": sk},
"processed_at": {"N": str(timestamp)},
"ttl": {"N": str(timestamp + ttl_seconds)},
**(extra or {}),
}
return {
"Put": {
"TableName": table,
"Item": item,
# 同一 (uuid, type) が既にあれば挿入失敗 = 二重実行を阻止
"ConditionExpression": "attribute_not_exists(#uuid)",
"ExpressionAttributeNames": {"#uuid": "uuid"},
}
}
There are 3 design points.
- It's a pure function. This function doesn't touch DynamoDB. It returns only the data of "how to write the marker," so it can be unit-tested without starting the DB, and destructive changes to the SK format or TTL calculation can be detected with golden-vector tests.
- It auto-expires with TTL. The marker is auto-deleted by DynamoDB TTL after a default 90 days. Holding records solely for idempotency forever makes storage cost and table size swell linearly. It's a design to balance "correctness" and "cost efficiency."
- Put the marker insertion in the same transaction as the payment body. This is the crux. Bundle the idempotency marker's
Putand the balanceADDinto a singleTransactWriteItems. The second request getsConditionalCheckFailedon the marker insertion, and since the entire transaction rolls back atomically, the balance doesn't move one millimeter.
Webhook idempotency: if it fails, "don't leave a marker either"
For a webhook from an external payment (Stripe), the same principle applies, but there's a trap here.
The common implementation is "write the event ID's marker before running the handler," but with this, when the handler dies midway, only the marker remains. Even if Stripe correctly re-sends, it's misjudged as "processed" and the payment is silently lost.
So include the event ID's marker in the same transaction as the handler's side effect.
# Stripe イベント ID ベースの冪等マーカーも TransactItem として返す。
# handler 失敗時はマーカーも書かれない → Stripe の再送で正しく再処理される。
idem_item = build_event_idempotency_transact_item(
table=table,
event_id=event["id"],
ttl_seconds=30 * 86400, # Stripe の再送上限は約3日。30日は監査用の保守的設定。
timestamp=now,
)
client.transact_write_items(
TransactItems=[idem_item, deduct_item, history_item], # all-or-nothing
)
Because the marker and the side effect share their fate, the limbo state of "the marker remained but the processing failed" structurally cannot exist. The webhook idempotency period is short at 30 days (Stripe's re-send limit of 3 days + audit margin), optimizing cost here too.
Staged rollout and observability
Making the idempotency key mandatory all at once breaks existing clients. So I migrated:
- Optional phase — use the key if present. On absence, fire the
IdempotencyKeyMissingmetric - Monitor the production absence rate with CloudWatch — confirm with numbers that it converged to 0
- Mandatory phase — reject key absence with HTTP 400
I also observe replays (re-arrival of the same key) with the IdempotencyReplay metric, enabling detection of abnormal re-send patterns. It's the idea of judging "when it's OK to make it mandatory" not by intuition but by metrics.
2. Atomicity: don't break the balance even under contention
The problem
Payment, charge, and refund can fly to the same card and same customer simultaneously. The naive implementation is this.
# アンチパターン:read-modify-write はレースを生む
balance = read_balance(card_id) # ① 残高を読む
if balance < amount: # ② チェック
raise InsufficientError
write_balance(card_id, balance - amount) # ③ 書き戻す
If another transaction cuts in between ① and ③, the check is done on the stale balance and the balance sinks negative. You can prevent it with an app-side lock, but a lock is a new single point of failure and a hotbed of deadlock.
The solution: erase read-modify-write with ADD + ConditionExpression
In DynamoDB, you can fold "read and subtract" into one conditional Update.
# 残高更新を「データ」として組み立てる純粋関数(簡略版)
def build_deduct(table: str, card_id: str, amount: Decimal, now: int) -> dict:
return {
"Update": {
"TableName": table,
"Key": {"uuid": {"S": card_id}, "type": {"S": "card"}},
# ADD は原子的。読み取り後書き込みのレースが存在しない。
"UpdateExpression": "ADD balance :delta SET updated_at = :ua",
# 「残高が足りるとき」だけコミットされる。判定は DB 側で原子的。
"ConditionExpression": "balance >= :amount",
"ExpressionAttributeValues": {
":delta": {"N": str(-amount)},
":amount": {"N": str(amount)},
":ua": {"N": str(now)},
},
}
}
The "read the balance" step disappeared. DynamoDB does the conditional judgment atomically at commit time, so no matter how much it runs in parallel, the balance never goes negative.
A "adding side" constraint like a charge cap can be expressed with the same idea. For example, to guarantee balance + amount ≤ cap in a card charge, leveraging the property that ADD and condition are evaluated separately, write it like this.
ADD balance :delta # :delta = charge amount + bonus
CONDITION balance <= :max_allowed # :max_allowed = cap − charge amount
It's ADD'd only when balance <= cap − amount holds, so as a result balance + amount <= cap is atomically guaranteed.
Classify failure by "type"
When a conditional write fails, the reason isn't one. Insufficient balance, cap exceeded, or just contention — these are different in the action the caller should take, so don't swallow them vaguely. Classify the failure cause into a typed enum.
class BalanceTransactFailure(StrEnum):
INSUFFICIENT = "insufficient" # 残高不足 → HTTP 400(再試行しても無駄)
CAP_EXCEEDED = "cap_exceeded" # 上限超過 → HTTP 400
CONFLICT = "conflict" # 純粋な競合 → 再試行可
THROTTLED = "throttled" # 容量超過 → 503 / 再試行可
Treating INSUFFICIENT and CONFLICT as the same "failure" wastes latency and cost by uselessly retrying an insufficient-balance request. Classifying the cause lets you separate failures that should be retried from those that shouldn't (connecting to the next chapter).
All amounts are Decimal. Don't accept float
Amounts and the CO2 conversion (the conversion rate is Decimal('0.01')) are all unified as Decimal. float is a binary floating point, so it's the world of 0.1 + 0.2 != 0.3. Allow this in payments and rounding errors accumulate with every transaction.
So at the stage of the low-level function that converts a value to a DynamoDB attribute, I reject float at runtime. By rejecting float both statically with type annotations (mypy strict) and dynamically in the conversion function, I doubly prevent the accident of "float accidentally mixing in."
3. Retry: absorb only failures that may be retried
TransactWriteItems can be temporarily canceled with TransactionConflict by optimistic concurrency control. Since this is just "they happened to try to write at the same time," waiting a bit and retrying succeeds.
On the other hand, ConditionalCheckFailed (insufficient balance, idempotency collision, cap exceeded) doesn't change the result no matter how many times you retry. Rather, retrying is harmful.
I implemented a retry helper that distinguishes these two.
_RETRY_MAX_ATTEMPTS = 3
_RETRY_BASE_DELAY_MS = 50
def transact_write_with_retry(client, transact_items, *, max_retries=_RETRY_MAX_ATTEMPTS):
for attempt in range(max_retries + 1):
try:
client.transact_write_items(TransactItems=transact_items)
return
except client.exceptions.TransactionCanceledException as exc:
# ConditionalCheckFailed は意味論的失敗 → 即座に伝播(再試行しない)
if any_condition_failed(exc):
raise
# TransactionConflict 以外のキャンセルも再試行しない
if not is_transaction_conflict(exc):
raise
_emit_transaction_conflict_metric() # CloudWatch に発火
if attempt == max_retries:
raise
# 指数バックオフ(50ms → 100ms → 200ms)+ ジッター(±50%)
delay_s = (_RETRY_BASE_DELAY_MS * (2 ** attempt)) / 1000
jitter_s = random.uniform(0, delay_s * 0.5)
time.sleep(delay_s + jitter_s)
The design points.
- Retry only on
TransactionConflict. SinceConditionalCheckFaileddirectly relates to the semantics of idempotency collision and balance conditions, propagate it to the caller immediately. Not mixing "failures fixed by retry" and "failures not fixed" protects both latency and cost. - Exponential backoff + jitter. Double the wait time with
50ms × 2^attempt, and further add±50%jitter. Without jitter, multiple contended requests retry all at once at the same interval, causing a thundering herd (avalanche). - SSoT the reason-code judgment. Don't scatter the string judgment of
"TransactionConflict"/"ConditionalCheckFailed"everywhere; consolidate it in a shared module (any_condition_failed/is_transaction_conflict). Fix one place and it reflects to all Lambdas — the practice of DRY. - Contention is observable. On every contention, fire the
EcoPay/Payments::TransactionConflictmetric, and a CloudWatch alarm (alerts on more than 5 in 5 minutes) detects a surge in concurrent contention. Don't swallow it with retries to "make it invisible"; the point is to absorb while measuring.
4. Zero-downtime migration: swap the engine while running
The problem
The initial data model had one customer's balance, points, J-Credit, and profile all cohabiting in one huge record (the so-called God Record). Write conflicts on concurrent updates were likely, and separation of concerns wasn't done.
I wanted to migrate this to a "new schema split by concern." But — production payments can't be stopped for even one second. The option of putting up a maintenance screen late at night and doing a bulk batch migration was off the table from the start. A bulk migration has too high a risk of data inconsistency during migration and of rollback on failure.
The solution: staged migration by dual writes (mirror writes)
What I adopted was a staged migration applying the idea of Expand / Migrate / Contract to payment data.
- Dual write (Expand) — reflect new writes to both the old schema and the new schema
- Re-read / dedup (Migrate) — switch reads to be new-schema-centric, and during the period where old/new are mixed, dedup and reconcile. In parallel, backfill old data into the new schema
- Remove old data (Contract) — confirm unification to the new schema and stop writes to the old schema
I broke this into 13+ phases, designing it so that each phase alone, no matter when stopped, doesn't break production. For example, "split the God Record into profile and balance (12-3)," "split BALANCE#POINT/BALANCE#JCREDIT from the aggregate value METRICS (12-4)," and so on.
I made all migration writes a form where a pure builder function returns an Update TransactItem.
# 残高ミラー書き込みを「データ」として組み立てる純粋関数(簡略版)
def build_balance_mirror_items(table, customer_id, *, point_delta=0, now) -> list[dict]:
items: list[dict] = []
if point_delta:
items.append({
"Update": {
"TableName": table,
"Key": {"uuid": {"S": customer_id}, "type": {"S": "BALANCE#POINT"}},
# ADD は冪等な増分。バックフィルが再実行されても最終状態は一意に収束。
"UpdateExpression": "ADD balance :delta SET updated_at = :ua",
"ExpressionAttributeValues": {
":delta": {"N": str(point_delta)},
":ua": {"N": str(now)},
},
}
})
return items
The reason this approach is safe is the crux of migration.
ADDis an idempotent increment operation. "Add +100 to the balance" — you'd think it doesn't work correctly no matter how many times it flows… but if you design the backfill to be idempotent, that's another story. A new dual write isADD'd once, and the past backfill is designed to "transfer only unprocessed records once." SinceADDis atomic, even if dual writes and backfill interleave temporally, the final balance converges uniquely.- Protect existing values with
if_not_exists. In the profile backfill, write likecreated_at = if_not_exists(created_at, :ca)so the backfill doesn't overwrite an already-existing value. - Distinguish deletion intent from
None. "Don't update this attribute (None)" and "delete this attribute" are different. To express the latter, prepare a dedicatedCLEARsentinel and make deletion explicit likephone_number=CLEAR. It eliminates the ambiguity of deleting withNone. - Return an empty array if no write is needed. When there's nothing to change, the builder returns
[]and does no wasteful write (no-op). - Dedup on the read side. During the migration period, both old-format and new-format transaction history can exist, so on read, dedup by the identity of
(PK, SK, timestamp, type)so the migration looks transparent to the user.
As a result, without stopping the running payments for even one second, I completed the migration of 13+ phases non-stop from the God Record to a concern-separated schema.
A cross-cutting foundation: testability, type safety, observability
Three designs are common to the code so far. These weren't things to "add later" but preconditions that make reliability hold.
Testability (pure functions). The idempotency marker, balance update, and migration builders are all pure functions with no DB I/O. Since you only need to verify "what TransactItem it returns" for the input, you can test fast — including boundary conditions — without starting DynamoDB or assembling mocks. Furthermore, I fix the wire format of SK, conditional expressions, and TTL with golden vectors, stopping the accident of "the storage format changing unnoticed" with regression tests.
Type safety (mypy strict). The shared Layer enforces function type annotations in mypy strict mode with disallow_untyped_defs. By eliminating Any from the payment logic, I can detect oversights at refactor time at a compile-equivalent stage. Types are the best — and the cheapest — test.
Observability (structured logs + metrics). I continuously observe the idempotency-key absence rate, replays, and transaction conflicts with structured logs and CloudWatch metrics. What matters is not leaving PII in logs. Mask emails and phone numbers, and never output PINs, tokens, passwords, and the like to logs. Observability and personal-information protection need to coexist.
The principles running through the design
Finally, let me organize the principles running through this reliability layer.
| Principle | How it appears in this foundation |
|---|---|
| Protect correctness by structure | idempotency via attribute_not_exists, consistency via ADD + conditional expression, atomicity delegated to TransactWriteItems |
| SSoT / DRY | payment logic consolidated in the shared Layer. Reason-code judgment in one place too |
| SRP | separate "judgment (pure function)" and "execution (I/O)." The builder returns only data |
| Classify failure | distinguish retriable (CONFLICT/THROTTLED) and non-retriable (INSUFFICIENT/CAP_EXCEEDED) by type |
| Cost efficiency | idempotency markers auto-expire with TTL. Follow demand with serverless |
| Observability | not just absorb but measure. Judge by metrics |
| Non-stop evolution (ETC) | swap without stopping via Expand→Migrate→Contract |
Payment reliability isn't a flashy feature. "Nothing happened" itself is the achievement. No double charges happen, the balance doesn't break, no one notices even during migration — guaranteeing that by design, not by chance, is the duty of a system entrusted with actual money, I believe. In fact, I maintain 0 double charges / balance inconsistencies in production operation.
If you're troubled by the same kind of "reliability design of a foundation handling actual money/points," "idempotent, atomic data consistency in serverless," or "schema migration that doesn't stop production," I can help from the design stage.