Designing Systems That Don't Fall Over When External Dependencies Do: A Retry, Exponential-Backoff + Jitter, and Circuit-Breaker Implementation Guide

"External APIs fall over"—this is not pessimism but a premise of distributed systems. Payment providers, email sending, inventory APIs, internal microservices. A dependency you can't control will, surely, someday, at an unpredictable timing, lag, time out, and return 5xx. The question is not "whether it falls over" but "when the other side falls over, do you fall over."

On a serverless payment platform in the environmental field, I designed and implemented a payment-reliability layer (exactly-once processing, atomic balance updates) centered on decimal arithmetic and idempotency. And on a B2B SaaS that won an award related to the Ministry of Economy, Trade and Industry's IT-introduction subsidy, I built external-API-dependency resilience against Stripe Connect. For payments, "no double charge even if it falls over" is an absolute condition. I explain the resilience patterns that paid off on that front line, faithful to the AWS and Microsoft official docs, but with a thick decision axis for "when, how, and why to use it," in working TypeScript.

This article's sources are primary information only. Exponential backoff + jitter is based on the AWS Builders' Library Timeouts, retries, and backoff with jitter and the AWS Architecture Blog Exponential Backoff And Jitter; the circuit breaker / retry / bulkhead are based on the respective pattern pages of the Microsoft Azure Architecture Center. I fabricate nothing for specific numbers (like improvement in success rate), as I haven't measured them.

0. The whole picture: resilience is the superposition of "6 layers"

Resilience is not a single silver bullet. It's a design that stacks defensive layers with different roles in the correct order. Confuse them and you instead kill yourself (the "retry storm" below is the typical case).

Layer	What it protects	On-failure behavior	This article's chapter
Timeout	Your own thread/connection	Don't keep waiting, give up early	§3
Retry	Absorbing transient failures	Wait and retry (idempotency premised)	§1, §2
Backoff + jitter	Preventing the mutual collapse of you and the other side	Disperse retries in time	§2
Circuit breaker	Cutting off cascade failures	Fail immediately (fail fast)	§4
Bulkhead	Isolating resource exhaustion	Drop only a part	§5
Idempotency	The premise of all the above	Same result however many times executed	§1 (cross-cutting)

I proceed in this order. First the most important principle of "when may you retry." Miss this and all the rest become counterproductive.

1. The most important principle: retry only "transient × idempotent"

The first principle to absolutely drill in for resilience patterns is this.

Retry only for "transient faults" AND "idempotent operations."

A retry that doesn't satisfy both is destruction, not improvement. Let me look at them in order.

1-1. What is a "transient fault"

Azure's Retry pattern limits retry targets to "transient faults." It says, "These faults are typically self-correcting, and if the action that triggered a fault is repeated after a suitable delay it's likely to be successful."

Conversely, failures you must not retry are also clearly defined. The same doc lists 3 strategies.

Cancel: if the fault isn't transient, or a retry is unlikely to succeed, cancel the operation and report an exception.
Retry immediately: for an exceptional failure like a rare network-packet corruption, you may retry immediately.
Retry after delay: for a common connection failure or busy state, insert a delay and retry.

In practice, mechanically classifying by HTTP status and error kind is robust. Most 4xx (especially 400/401/403/404/422) are no-retry—these are client errors meaning "your way of sending is wrong," and the result doesn't change however many times you send. What you may retry are only signals indicating "the other side's circumstances."

Kind of failure	Example	Retry?	Reason
Network-related	Connection failure, ECONNRESET, transient DNS failure	Yes	A typical transient fault
Timeout	The response doesn't return	Yes (only if idempotent)	The other side may have already processed it → idempotency mandatory
5xx (some)	502/503/504	Yes	Server-side temporary overload / restart
429 Too Many Requests	Rate limit	Yes (honor Retry-After)	Follow the wait time the other side instructs
4xx client error	400/401/403/404/422	No	Resending yields the same result. Hides the bug
Business-logic exception	Insufficient balance, validation failure	No	Not transient. The Azure docs state this too

The Azure docs explicitly cite, as a case where "the Retry pattern is not useful," "for handling failures that aren't due to transient faults, such as internal exceptions caused by errors in the business logic." Retrying a 4xx is the worst self-deception, delaying a bug and making it invisible.

1-2. If it's not "idempotent," a retry becomes double execution

This is the point that kills people in payments. The Azure Retry pattern writes about idempotency:

"Consider whether the operation is idempotent. If so, it's inherently safe to retry. Otherwise, a retry might cause the operation to be performed more than once, with unintended side effects. For example, a service might receive the request, process the request successfully, but fail to send a response. At that point, the retry logic might re-send the request."

The AWS Builders' Library says the same thing in stronger words: "APIs with side effects aren't safe to retry unless they provide idempotency. This guarantees that no matter how many times you retry, the side effect happens only once."

That is, the essential scariness of a timeout lies in "the caller can't distinguish whether it failed, or it succeeded and only the response was lost." Naively resending POST /charge (a charge) after a timeout double-charges.

What prevents this is the idempotency key. The client generates a unique key and attaches it to the request, and the server judges "first time or a retry" by that key.

// クライアント側：操作ごとに「決定的な」冪等性キーを発行する。
// リトライ間で同じキーを使い回すのが肝（毎回新規発行するとキーの意味がない）。
import { randomUUID } from "node:crypto";

type ChargeRequest = {
  amount: number;
  currency: "JPY";
  customerId: string;
};

async function chargeWithIdempotency(req: ChargeRequest): Promise<void> {
  // 1回の「課金しようという意図」に対して1つのキー。リトライしても同じキー。
  const idempotencyKey = randomUUID();

  // この呼び出しは §7 の resilientFetch で包む（タイムアウト+リトライ+ブレーカー）
  await resilientFetch("https://api.example.com/charges", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Idempotency-Key": idempotencyKey, // 同じキーなら二重課金しない契約
    },
    body: JSON.stringify(req),
  });
}

The server guarantees, by this key, "if already processed, don't newly charge and return the same result as last time." This is the substance of "exactly-once." This is exactly the layer I designed in the payment platform's reliability layer, maintaining zero double charges in production. The concrete implementation of idempotent async processing (deduplication via DynamoDB conditional writes) is written with real code in the idempotent async-processing guide with SQS × Lambda × EventBridge, and idempotency keys and Webhook handling with Stripe in the Stripe payments production-operations guide, respectively.

The one sentence to remember from this article: before putting in a retry mechanism, guarantee that the operation is idempotent. Reverse the order and the resilience layer becomes a double-charge machine.

1-3. Express "retryability" with the type

To embed the judgment in the code, express failures as a typed Result. Throwing exceptions wildly scatters "is this retryable" somewhere in the call stack. Consolidate the judgment in one place.

// 失敗を「分類された値」として扱う。any や生の throw に逃げない。
export type Transient =
  | { kind: "network"; cause: unknown }
  | { kind: "timeout" }
  | { kind: "server"; status: 502 | 503 | 504 }
  | { kind: "throttled"; retryAfterMs?: number }; // 429

export type Permanent =
  | { kind: "client"; status: number } // 4xx（429除く）
  | { kind: "business"; code: string }; // ドメイン上の失敗

export type Result<T> =
  | { ok: true; value: T }
  | { ok: false; error: Transient | Permanent };

// 「この失敗はリトライしてよいか」をただ1箇所で判定する（DRY）
export function isRetryable(e: Transient | Permanent): e is Transient {
  return (
    e.kind === "network" ||
    e.kind === "timeout" ||
    e.kind === "server" ||
    e.kind === "throttled"
  );
}

The point is that isRetryable returns the type predicate e is Transient. This puts you in a state where "within a block judged retryable, you can access transient-failure-specific fields like retryAfterMs type-safely."

2. Exponential backoff + jitter: a naive retry amplifies the failure

Once you've decided you may retry, the next is "how to wait." The mines beginners almost always make here are a "naive immediate retry" and "fixed backoff without jitter."

2-1. Why "retrying simultaneously" is fatal (the retry storm)

Suppose a server temporarily returns 503 due to overload. There are 1000 clients, and all of them retry "at the same timing." What happens? Again 1000 requests rush the recovering server simultaneously. The AWS Builders' Library writes:

"If all the failed calls back off to the same time, they cause contention or overload again when they are retried."

This is a retry storm / thundering herd. Even worse is the amplification of multi-layer retries. AWS gives a concrete number:

"If each layer retries independently, the load on the database will increase 243x"—if 5 layers of services each retry 3 times, 3^5 = 243.

There are two countermeasures. Both Azure/AWS instruct them.

Retry at "a single point in the stack" only (AWS: "retry at a single point in the stack"). Azure also writes, "If a task with a retry policy calls another task that also has a retry policy, it can cause lengthy delays. Lower-level tasks should fail fast, and only the top should handle it with a policy."
Add jitter (randomness) to the backoff.

2-2. The official 4 backoff formulas (the identity of jitter)

Let me quote precisely the formulas the AWS Architecture Blog Exponential Backoff And Jitter shows. base is the reference wait, cap is the upper bound (capped backoff), and attempt is the attempt count.

1. Plain exponential backoff (no jitter — bad)
   sleep = min(cap, base * 2^attempt)

2. Full Jitter (recommended)
   sleep = random_between(0, min(cap, base * 2^attempt))

3. Equal Jitter
   temp  = min(cap, base * 2^attempt)
   sleep = temp/2 + random_between(0, temp/2)

4. Decorrelated Jitter
   sleep = min(cap, random_between(base, prev_sleep * 3))

Let me quote the same article's conclusion. "The no-jitter exponential backoff approach is the clear loser. It not only takes more work, but also takes more time than the jittered approaches." And in comparing Full and Decorrelated, it states the "'Full Jitter' approach uses less work," concluding that both bring large reductions in client work and server load.

Method	Wait-time spread	Work	Adoption decision
No jitter	Zero (all simultaneous)	Most, slowest	Don't use (the official "clear loser")
Full Jitter	Maximum (uniform 0–cap)	Least	The default first choice
Equal Jitter	Medium (half fixed)	Medium	When "I want to wait at least this long"
Decorrelated	Large	Slightly more	Linked to the previous value. When you can hold state

When in doubt, Full Jitter. The reason is simple: the official docs conclude it's "least work," and the implementation is the simplest too.

2-3. Naive implementation (bad) vs Full Jitter (good)

First, the example you must not do. This is "a bomb written out of kindness."

// アンチパターン：固定間隔・ジッターなし・冪等性チェックなし・上限なし
async function badRetry<T>(fn: () => Promise<T>): Promise<T> {
  while (true) {                 // ❌ 無制限（unbounded）— 一生回り続ける
    try {
      return await fn();
    } catch {
      await sleep(1000);         // ❌ 固定1秒 — 全クライアントが同時に殺到する
      // ❌ 何の失敗かを区別していない（4xxもビジネス失敗もリトライしてしまう）
    }
  }
}

Next, the correct implementation combining §1's isRetryable and Full Jitter.

export type RetryOptions = {
  maxAttempts: number; // 上限は必須。無制限は禁止
  baseMs: number;      // 基準待ち（例: 100）
  capMs: number;       // 上限待ち（例: 20_000）capped backoff
  signal?: AbortSignal; // 全体デッドライン（§3）と連動
};

// Full Jitter: sleep = random(0, min(cap, base * 2^attempt))
function fullJitterDelay(attempt: number, baseMs: number, capMs: number): number {
  const exp = Math.min(capMs, baseMs * 2 ** attempt);
  return Math.random() * exp; // 0〜exp の一様乱数。これがストームを砕く
}

export async function retry<T>(
  op: () => Promise<Result<T>>,
  opts: RetryOptions,
): Promise<Result<T>> {
  let lastError: Transient | Permanent = { kind: "timeout" };

  for (let attempt = 0; attempt < opts.maxAttempts; attempt++) {
    const res = await op();
    if (res.ok) return res;

    lastError = res.error;
    // 恒久的失敗は即座に諦める（4xx/ビジネス失敗をリトライしない）
    if (!isRetryable(res.error)) return res;

    // 最終試行ならもう待たない
    if (attempt === opts.maxAttempts - 1) break;

    // 429 は相手が指定した Retry-After を尊重しつつ、最低限ジッターを足す
    const jittered = fullJitterDelay(attempt, opts.baseMs, opts.capMs);
    const delay =
      lastError.kind === "throttled" && lastError.retryAfterMs
        ? Math.max(lastError.retryAfterMs, jittered)
        : jittered;

    await sleepAbortable(delay, opts.signal); // 全体デッドラインで中断可能に
    // 観測性：リトライ回数はメトリクスとして必ず出す（§8）
    emitMetric("retry.attempt", { attempt: attempt + 1, kind: lastError.kind });
  }
  return { ok: false, error: lastError };
}

The sleep utilities are like this. Practically important is making it follow the overall deadline with AbortSignal.

function sleep(ms: number): Promise<void> {
  return new Promise((r) => setTimeout(r, ms));
}

function sleepAbortable(ms: number, signal?: AbortSignal): Promise<void> {
  return new Promise((resolve, reject) => {
    if (signal?.aborted) return reject(new Error("aborted"));
    const t = setTimeout(resolve, ms);
    signal?.addEventListener(
      "abort",
      () => {
        clearTimeout(t);
        reject(new Error("aborted")); // デッドライン超過なら待たずに抜ける
      },
      { once: true },
    );
  });
}

2-4. Cap the "total retry amount" with a token bucket

Jitter disperses timing but doesn't suppress the total amount. As a failure continues, the absolute number of retries keeps growing even with jitter. The AWS Builders' Library's additional measure is this.

"Limiting retries locally using a token bucket ... allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted."

// リトライ「予算」を管理する。成功でトークン回復、リトライで消費。
// 障害が長引いてもリトライ総量に上限がかかる（自己DDoS防止）。
export class RetryBudget {
  private tokens: number;
  constructor(
    private readonly capacity: number,
    private readonly refillPerSuccess: number = 1,
  ) {
    this.tokens = capacity;
  }
  tryAcquire(): boolean {
    if (this.tokens <= 0) return false; // 予算切れ→リトライせず即失敗
    this.tokens -= 1;
    return true;
  }
  onSuccess(): void {
    this.tokens = Math.min(this.capacity, this.tokens + this.refillPerSuccess);
  }
}

This gives a double defense of "jitter (timing dispersion) + token bucket (total-amount cap)," structurally crushing the retry storm.

3. The timeout budget: not waiting forever is the foundation of resilience

All the retry discussion stands on the premise that "there's an appropriate timeout in the first place." Without a timeout, a slow dependency holds your thread and connection infinitely, inviting the situation the Azure Circuit Breaker describes: "hold critical system resources such as memory, threads, and database connections ... This problem can exhaust resources, which might fail other unrelated parts of the system."

3-1. Two kinds of timeout: per-attempt and overall

In practice, hold it in 2 layers.

Per-attempt timeout: the threshold to judge that one call won't return.
Overall deadline: the "total time allowed for this operation," including retries. Back-calculated from the upper limit the user can wait.

// 試行ごとのタイムアウト。AbortController で確実に下層をキャンセルする。
async function withTimeout<T>(
  fn: (signal: AbortSignal) => Promise<T>,
  ms: number,
  parentSignal?: AbortSignal, // 全体デッドラインを伝播
): Promise<T> {
  const ctrl = new AbortController();
  const onParentAbort = () => ctrl.abort();
  parentSignal?.addEventListener("abort", onParentAbort, { once: true });

  const timer = setTimeout(() => ctrl.abort(), ms);
  try {
    return await fn(ctrl.signal);
  } finally {
    clearTimeout(timer);
    parentSignal?.removeEventListener("abort", onParentAbort);
  }
}

// 全体デッドライン：リトライ総時間に蓋をする
function deadline(totalMs: number): AbortSignal {
  return AbortSignal.timeout(totalMs); // Node 18+/モダンブラウザ
}

Pass the overall deadline to retry's signal, and you can budget like "per-attempt is 5 seconds, overall up to 15 seconds." Reaching 15 seconds, it cuts off without waiting, even if attempts remain.

3-2. How to decide the timeout value (by percentile, not intuition)

Setting a timeout to "30 seconds or so" is abandoning design. The AWS Builders' Library makes the method explicit.

"choose an acceptable rate of false timeouts (such as 0.1%). Then, we look at the corresponding latency percentile."

That is, set the timeout based on the p99.9 of the downstream's measured latency distribution. Shorter than this kills normal responses; longer wastefully holds resources.

3-3. Beware the "re-delivery fork bomb" in async queues

A pitfall of timeout design is async processing. The AWS Builders' Library Avoiding insurmountable queue backlogs warns that when latency under overload exceeds SQS's VisibilityTimeout, "the service essentially fork-bombs itself." The visibility timeout expires and the same message is re-delivered, and multiple copies of the same message are processed in parallel, chaining the problem. The countermeasure is "stop processing when the message expires, or notify SQS with a heartbeat that 'still processing.'" Here too, idempotency is the last line of defense (the result doesn't break even if the same message is processed twice). The implementation of a transactional delivery guarantee is detailed in the Transactional Outbox pattern guide.

4. The circuit breaker: stop cascade failures by "immediate cutoff"

The retry is a pattern premised on "it will eventually recover." But when a failure drags on, continuing to retry is nothing but harm. The Azure Circuit Breaker pattern distinguishes it this way:

"The Circuit Breaker pattern serves a different purpose from the Retry pattern. The Retry pattern enables an application to retry an operation with the expectation that it eventually succeeds. The Circuit Breaker pattern prevents an application from performing an operation that's likely to fail."

The two are used in combination. As Azure says, "use the Retry pattern to invoke an operation through a circuit breaker. But the retry logic should be sensitive to the exceptions the breaker returns, and should stop retrying if the breaker indicates 'not transient.'"

4-1. The 3 states (faithful to the official definitions)

Drop Azure's definitions straight into the implementation.

Closed: normal operation. Requests pass. Count failures, and if a threshold is exceeded within a set period, transition to Open and start the timeout timer. The failure counter resets periodically on a time basis (so it doesn't open on occasional failures).
Open: requests fail immediately and return an exception (fail fast). On timer expiry, transition to Half-Open.
Half-Open: pass only a limited number of test requests. If they succeed, judge "it's fixed" and return to Closed, resetting the failure counter. If even one fails, judge "still no good" and immediately return to Open, restarting the timer.

Azure makes the significance of this Half-Open explicit. "The Half-Open state helps prevent a recovering service from suddenly being flooded with requests." Putting full load on a recovering service knocks it down again.

Transition	Trigger	Official basis
Closed → Open	Failures in the period exceed the threshold	"failure threshold is reached"
Open → Half-Open	The timeout timer expires	"time-out timer expired"
Half-Open → Closed	Consecutive successes reach the threshold	"success count threshold is reached"
Half-Open → Open	Even one test request fails	"the operation failed"

4-2. A typed circuit-breaker implementation

type BreakerState = "closed" | "open" | "half-open";

export type BreakerConfig = {
  failureThreshold: number;   // Closed中、この回数失敗でOpenへ
  successThreshold: number;   // Half-Open中、この連続成功でClosedへ
  openDurationMs: number;     // Open維持時間（満了でHalf-Openへ）
  rollingWindowMs: number;    // 失敗カウンタの時間窓（古い失敗は忘れる）
};

export class CircuitBreaker {
  private state: BreakerState = "closed";
  private failures: number[] = []; // 失敗タイムスタンプ（rolling window）
  private successesInHalfOpen = 0;
  private openedAt = 0;

  constructor(
    private readonly cfg: BreakerConfig,
    private readonly now: () => number = Date.now, // テスト容易性のため注入
    private readonly onStateChange?: (s: BreakerState) => void, // 観測性
  ) {}

  /** 呼び出し前に「通してよいか」を判定する */
  private canPass(): boolean {
    if (this.state === "open") {
      // Open維持時間が過ぎたらHalf-Openへ（試しに通してみる）
      if (this.now() - this.openedAt >= this.cfg.openDurationMs) {
        this.transition("half-open");
        this.successesInHalfOpen = 0;
        return true;
      }
      return false; // まだOpen → fail fast
    }
    return true; // closed / half-open は通す
  }

  async execute<T>(op: () => Promise<Result<T>>): Promise<Result<T>> {
    if (!this.canPass()) {
      // Open中は即座に失敗。下流を一切叩かない（これが本質）
      return { ok: false, error: { kind: "server", status: 503 } };
    }
    const res = await op();
    if (res.ok) this.onSuccess();
    else if (isRetryable(res.error)) this.onFailure(); // 一時的失敗のみ計上
    // 恒久的失敗（4xx等）はブレーカーの故障判定に含めない（下流の健康とは無関係）
    return res;
  }

  private onSuccess(): void {
    if (this.state === "half-open") {
      this.successesInHalfOpen++;
      if (this.successesInHalfOpen >= this.cfg.successThreshold) {
        this.failures = [];
        this.transition("closed"); // 回復確認 → 通常運転へ
      }
    } else {
      this.failures = []; // Closedでの成功はカウンタをきれいに
    }
  }

  private onFailure(): void {
    if (this.state === "half-open") {
      this.trip(); // 試験中の失敗は即Openへ戻す
      return;
    }
    const t = this.now();
    // rolling window 外の古い失敗は捨てる（たまの失敗で開かないように）
    this.failures = this.failures.filter((ts) => t - ts < this.cfg.rollingWindowMs);
    this.failures.push(t);
    if (this.failures.length >= this.cfg.failureThreshold) this.trip();
  }

  private trip(): void {
    this.openedAt = this.now();
    this.transition("open");
  }

  private transition(next: BreakerState): void {
    if (this.state !== next) {
      this.state = next;
      this.onStateChange?.(next); // 状態遷移は必ずメトリクス/アラートに出す
    }
  }
}

There are two design crux points. (1) Don't include permanent failures (4xx) in the breaker's failure judgment—a 4xx is a signal of "my way of sending is wrong," not "the downstream is unhealthy," so opening on it is a false detection. (2) Always observe state transitions (onStateChange). Azure also recommends, "If the circuit breaker raises an event on each state change, it can be used to monitor the health of the protected component, or to alert an administrator when it goes Open." The moment the breaker goes Open is almost always an alert target.

4-3. When to use, and when not

Azure makes the application conditions explicit. Use it when you want to "prevent cascade failures" or "protect against slow dependencies to defend the SLO."

Conversely, cases where you shouldn't use it are also listed by the official docs; not knowing them needlessly complicates things.

Managing access to a local private resource (an in-memory data structure, etc.) → the breaker is just overhead.
Using it as a substitute for business-logic exception handling → the purpose is different.
Message-driven / event-driven architectures → failed messages have a mechanism to go to a DLQ, and the built-in isolation/retries are often enough.
Failure recovery is managed at the infra/platform layer (a load balancer's or service mesh's health checks) → no need to duplicate it at the app layer.

In fact, if you use a service mesh (Istio/Envoy, etc.), pushing the breaker into the sidecar is the Azure recommendation. It avoids dirtying the app code.

5. The bulkhead: don't drag "everything" down with one dependency's failure

If the circuit breaker is "time-axis" isolation, the bulkhead is "resource-axis" isolation. The Azure Bulkhead pattern takes its name from a ship's watertight bulkheads—"even if the hull is damaged, only the damaged compartment floods, preventing sinking."

5-1. The problem it solves: cascade of connection-pool exhaustion

The typical accident Azure depicts. "If a consumer keeps sending requests to an unresponsive service, the resources those requests use (like the client's connection pool) get exhausted. At that point, the consumer's requests to other services are also affected. Eventually, you can no longer send requests to any service, not just the originally-unresponsive one."

A slow dependency A eats up the shared connection pool and drags down calls to unrelated dependencies B and C—this is a cascade from resource exhaustion.

5-2. The solution: split concurrency slots per dependency

Azure's solution is "the consumer partitions resources so that the resources to call service A don't affect those to call service B. Allocate a connection pool per service. Even if a service starts to fail, the impact stays only in that service's pool." In TypeScript you can express it with a semaphore (an upper limit on concurrency).

// 依存先ごとに「同時実行枠」を切る。Aが詰まってもB/Cの枠は無事。
export class Semaphore {
  private active = 0;
  private queue: Array<() => void> = [];
  constructor(private readonly max: number) {}

  async run<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.max) {
      await new Promise<void>((resolve) => this.queue.push(resolve));
    }
    this.active++;
    try {
      return await fn();
    } finally {
      this.active--;
      this.queue.shift()?.(); // 次の待機者を起こす
    }
  }
}

// 依存先ごとにバルクヘッドを分離。AとBは互いのリソースを侵さない。
const bulkheads = {
  paymentApi: new Semaphore(10), // 決済は重要：10枠確保
  emailApi: new Semaphore(3),    // メールは詰まっても本筋を止めない：3枠だけ
} as const;

// メールAPIが全部詰まっても、消費するのは最大3枠。決済の10枠は無傷。
await bulkheads.emailApi.run(() => sendReceiptEmail(order));

Azure cites granularity per bulkhead, thread pools, semaphores, and queue isolation, and also introduces libraries like resilience4j for Java and Polly for .NET. In Node, a lightweight semaphore like the above or a library like p-limit is enough.

There are also cases where you shouldn't use it in the official docs. "When you can't accept the reduced resource-utilization efficiency" and "when the added complexity is unneeded." A bulkhead has the tradeoff of not being able to use resources to the full, since it splits slots. It's most effective when critical dependencies and dependencies that are fine to drop are mixed.

6. The courage not to retry: the trap of fallback

When you think of resilience, you tend to picture "operate via an alternate means no matter what (fallback)." But the AWS Builders' Library Avoiding fallback in distributed systems strongly warns that fallback in distributed systems is dangerous in many cases. This is counterintuitive, so always grasp it.

6-1. Why fallback betrays you

Let me quote the core paradox.

"If hitting the database directly was more reliable than going through the cache, why bother with the cache in the first place?"

A fallback (e.g., hit the DB directly if the cache dies) "only makes sense on the premise that the backup means is inferior." Yet we pray that "when the primary dies, that inferior backup will succeed." In the 2001 Amazon outage AWS cites, when the cache died simultaneously, the fallback to hitting the DB directly "generated enough load to completely lock up the DB," expanding a partial failure into a full-site outage.

Furthermore, "distributed fallback strategies are hard to test" and "harbor latent bugs that show up only when an unlikely set of coincidences occur ... months or years after their introduction."

6-2. What to do instead of fallback

AWS's alternatives are these.

Raise the primary's reliability (rather than adding fallback branches, make it less prone to falling over in the first place. E.g., use the inherently highly-available DynamoDB).
Let the caller handle errors (don't fall back within the service; let the caller retry).
Turn it into failover: "Exercise backup paths continuously in production" so they're as reliable as the primary. Code that isn't normally used is surely broken when it matters.
Monitor the retry rate and watch that retries don't become a de-facto fallback.

A practical landing point: before writing a casual fallback, first consider "strengthening the primary" and "honestly returning an error to the caller (fail fast)." Especially in payments, a fallback like "the charge API is down so charge via another path" is the shortest route to breaking consistency. Honestly returning a failure and letting it be safely retried with an idempotency key is far more robust.

7. All-in-one: a resilient external-API client

Compose the layers so far into one wrapper. The order is decisively important. From the outside: "breaker → retry → bulkhead → timeout → the actual call."

// 依存先ごとに1インスタンス。状態（ブレーカー/予算/枠）を共有する。
export class ResilientClient {
  constructor(
    private readonly breaker: CircuitBreaker,
    private readonly budget: RetryBudget,
    private readonly bulkhead: Semaphore,
    private readonly retryOpts: RetryOptions,
    private readonly perAttemptTimeoutMs: number,
  ) {}

  async call<T>(
    fn: (signal: AbortSignal) => Promise<T>,
    parse: (raw: unknown) => Result<T>, // 境界で型を絞る（信頼しない）
    overallDeadline: AbortSignal,
  ): Promise<Result<T>> {
    // 最外層：ブレーカー。Open中はここで即fail（下流を一切叩かない）
    return this.breaker.execute(() =>
      // 次層：リトライ（冪等な操作前提・Full Jitter）
      retry<T>(async () => {
        if (!this.budget.tryAcquire()) {
          return { ok: false, error: { kind: "throttled" } }; // 予算切れ
        }
        // 次層：バルクヘッド（同時実行枠の隔離）
        return this.bulkhead.run(async () => {
          try {
            // 最内層：試行ごとタイムアウト（全体デッドラインも伝播）
            const raw = await withTimeout(fn, this.perAttemptTimeoutMs, overallDeadline);
            const res = parse(raw);
            if (res.ok) this.budget.onSuccess(); // 成功で予算回復
            return res;
          } catch (e) {
            // タイムアウト/ネットワーク例外を型付き Transient に正規化
            const err: Transient =
              e instanceof Error && e.name === "AbortError"
                ? { kind: "timeout" }
                : { kind: "network", cause: e };
            return { ok: false, error: err };
          }
        });
      }, { ...this.retryOpts, signal: overallDeadline }),
    );
  }
}

A concrete usage example against fetch. It's complete combined with the idempotency key (§1).

// 依存先ごとに設定。決済は重要なので枠もリトライも厚めに。
const paymentClient = new ResilientClient(
  new CircuitBreaker(
    { failureThreshold: 5, successThreshold: 2, openDurationMs: 30_000, rollingWindowMs: 10_000 },
    Date.now,
    (s) => emitMetric("breaker.state", { dependency: "payment", state: s }),
  ),
  new RetryBudget(100),
  new Semaphore(10),
  { maxAttempts: 4, baseMs: 100, capMs: 20_000 },
  5_000, // per-attempt 5s
);

async function chargeOrder(order: Order): Promise<Result<Charge>> {
  const overall = AbortSignal.timeout(15_000); // 全体15秒デッドライン
  const idempotencyKey = order.id; // 注文IDを決定的キーに（リトライで不変）

  return paymentClient.call<Charge>(
    (signal) =>
      fetch("https://api.example.com/charges", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "Idempotency-Key": idempotencyKey, // 二重課金を構造的に防ぐ
        },
        body: JSON.stringify({ amount: order.total, currency: "JPY" }),
        signal, // タイムアウト/デッドラインで確実にキャンセル
      }).then(async (r) => {
        // HTTPステータスを型付き Result に正規化（境界で分類）
        if (r.ok) return await r.json();
        if (r.status === 429) {
          const ra = r.headers.get("Retry-After");
          throw Object.assign(new Error("throttled"), {
            transient: { kind: "throttled", retryAfterMs: ra ? Number(ra) * 1000 : undefined },
          });
        }
        if (r.status >= 500) throw new Error(`server ${r.status}`);
        throw new Error(`client ${r.status}`); // 4xx → リトライされない
      }),
    (raw) => ({ ok: true, value: raw as Charge }), // 実際はzod等で検証
    overall,
  );
}

With this one wrapper, timeout, Full Jitter retry, retry budget, bulkhead, circuit breaker, and idempotency key take effect independently per dependency. The point is that the config values (thresholds, slot counts, timeouts) can be varied per dependency. Thick for payments, thin for email.

8. Common pitfalls (the ones that actually cause accidents)

Finally, let me list the mines frequent in reviews. If even one applies, the resilience layer instead becomes a failure source.

Pitfall	What happens	The correct fix
Retrying a non-idempotent write	Re-send after a timeout → double charge/double registration	Introduce the idempotency key first (§1). Don't reverse the order
Fixed/exponential backoff without jitter	All clients rush simultaneously (retry storm)	Full Jitter (§2). The official "clear loser"
Retrying 4xx	Delays/hides the bug, with wasted load	Exclude permanent failures with `isRetryable` (§1)
Unbounded retry (while true)	Spins forever, deadlocks	`maxAttempts` + an overall deadline + a token bucket
No timeout	A slow dependency holds threads/connections and exhausts	Per-attempt + overall in 2 layers (§3)
No breaker	Keeps retrying during a long failure and cascades	Fail fast with a circuit breaker (§4)
Each layer retries in a multi-layer setup	Load amplifies exponentially (3^5=243×)	Retry at one place only. Lower layers fail fast
A casual fallback	The backup collapses together when the primary fails	Strengthen the primary + fail fast (§6)
Not observing the breaker state	Doesn't notice it's Open and the failure drags on	Always make state transitions metrics/alerts (§4)
The breaker opens on 4xx	Mistakenly cuts off the downstream due to your own bug	Don't include permanent failures in the failure judgment (§4-2)

Three cross-cutting design principles. (1) Observability: always make retry counts, breaker state transitions, and timeout occurrences metrics. Resilience you can't see is the same as not having recovered. (2) Cost: retries are directly tied to API calls and compute time. The token bucket and overall deadline are not only reliability but also a cost cap. (3) Type safety: express failures with Result<T> and the discriminated union Transient | Permanent, and consolidate the "retryability" judgment in one place, isRetryable. Escape to any or raw throw and the judgment scatters across the codebase, and someone will surely retry a 4xx.

Summary: resilience is something you "stack"

It's a premise that external dependencies are unreliable. Even so, a system that doesn't fall over can be built by stacking defensive layers with different roles in the correct order. Finally, in 5 lines.

Retry only "transient × idempotent." Put in the idempotency key first. Don't retry 4xx or business failures (the core of the AWS/Azure docs).
Always use Full Jitter for backoff. The official docs declare no-jitter a "clear loser." Multi-layer retries amplify load 243×—retry at one place.
Timeouts in 2 layers (per-attempt + an overall deadline). Back-calculate values from the downstream's p99.9. Not waiting forever is the foundation of resilience.
Cut long failures off immediately with a circuit breaker (closed/open/half-open). Always observe and alert on state transitions. Don't open on 4xx.
Isolate resources with a bulkhead, and don't write casual fallbacks (strengthen the primary + fail fast). Idempotency, observability, and type safety are the weft running through all layers.

I have implemented this design philosophy in a serverless payment platform's reliability layer (exactly-once, atomic balance updates, zero double charges in production) and in the Stripe Connect integration of an award-winning B2B SaaS. "Fast, cheap, and safe with one person × generative AI (Claude Code)"—the substance of that "safe" is this article's resilience layer. For consultation on robust system design against unreliable external dependencies, reach out via Contact.

Designing Systems That Don't Fall Over When External Dependencies Do: A Retry, Exponential-Backoff + Jitter, and Circuit-Breaker Implementation Guide

0. The whole picture: resilience is the superposition of "6 layers"

1. The most important principle: retry only "transient × idempotent"

1-1. What is a "transient fault"

1-2. If it's not "idempotent," a retry becomes double execution

1-3. Express "retryability" with the type

2. Exponential backoff + jitter: a naive retry amplifies the failure

2-1. Why "retrying simultaneously" is fatal (the retry storm)

2-2. The official 4 backoff formulas (the identity of jitter)

2-3. Naive implementation (bad) vs Full Jitter (good)

2-4. Cap the "total retry amount" with a token bucket

3. The timeout budget: not waiting forever is the foundation of resilience

3-1. Two kinds of timeout: per-attempt and overall

3-2. How to decide the timeout value (by percentile, not intuition)

3-3. Beware the "re-delivery fork bomb" in async queues

4. The circuit breaker: stop cascade failures by "immediate cutoff"

4-1. The 3 states (faithful to the official definitions)

4-2. A typed circuit-breaker implementation

4-3. When to use, and when not

5. The bulkhead: don't drag "everything" down with one dependency's failure

5-1. The problem it solves: cascade of connection-pool exhaustion

5-2. The solution: split concurrency slots per dependency

6. The courage not to retry: the trap of fallback

6-1. Why fallback betrays you

6-2. What to do instead of fallback

7. All-in-one: a resilient external-API client

8. Common pitfalls (the ones that actually cause accidents)

Summary: resilience is something you "stack"

The Transactional Outbox Pattern: Make the DB Update and Event Publishing Atomic, and Cut Off Lost Events and Double Publishing

Building Idempotent Async Processing with SQS + Lambda + EventBridge: Duplicate, Ordering, and DLQ Design on the At-Least-Once Premise

Celery + Redis Production-Operations Guide — Async Task Design Faithful to the Official Docs (Idempotency, Retries, Observability)

Design Judgment for a Real-Time UI: Choosing WebSocket / SSE / Optimistic Update + Invalidation Correctly from the Requirements

Also worth reading

DynamoDB Capacity, Cost, and Performance Design Complete Guide (2026 Edition): On-Demand vs. Provisioned, Auto Scaling, Avoiding Hot Partitions, Cost Optimization

DynamoDB Streams × Event-Driven Architecture / CDC Complete Guide (2026 Edition): Safely Propagating Change Data with Lambda and EventBridge Pipes

DynamoDB Single-Table Design & Production Reliability Patterns — The Complete Guide (2026 Edition): Idempotency, Conditional Writes, and Transactions in Real Code

0. The whole picture: resilience is the superposition of "6 layers"

1. The most important principle: retry only "transient × idempotent"

1-1. What is a "transient fault"

1-2. If it's not "idempotent," a retry becomes double execution

1-3. Express "retryability" with the type

2. Exponential backoff + jitter: a naive retry amplifies the failure

2-1. Why "retrying simultaneously" is fatal (the retry storm)

2-2. The official 4 backoff formulas (the identity of jitter)

2-3. Naive implementation (bad) vs Full Jitter (good)

2-4. Cap the "total retry amount" with a token bucket

3. The timeout budget: not waiting forever is the foundation of resilience

3-1. Two kinds of timeout: per-attempt and overall

3-2. How to decide the timeout value (by percentile, not intuition)

3-3. Beware the "re-delivery fork bomb" in async queues

4. The circuit breaker: stop cascade failures by "immediate cutoff"

4-1. The 3 states (faithful to the official definitions)

4-2. A typed circuit-breaker implementation

4-3. When to use, and when not

5. The bulkhead: don't drag "everything" down with one dependency's failure

5-1. The problem it solves: cascade of connection-pool exhaustion

5-2. The solution: split concurrency slots per dependency

6. The courage not to retry: the trap of fallback

6-1. Why fallback betrays you

6-2. What to do instead of fallback

7. All-in-one: a resilient external-API client

8. Common pitfalls (the ones that actually cause accidents)

Summary: resilience is something you "stack"

Related articles

The Transactional Outbox Pattern: Make the DB Update and Event Publishing Atomic, and Cut Off Lost Events and Double Publishing

Building Idempotent Async Processing with SQS + Lambda + EventBridge: Duplicate, Ordering, and DLQ Design on the At-Least-Once Premise

Celery + Redis Production-Operations Guide — Async Task Design Faithful to the Official Docs (Idempotency, Retries, Observability)

Design Judgment for a Real-Time UI: Choosing WebSocket / SSE / Optimistic Update + Invalidation Correctly from the Requirements

Also worth reading

DynamoDB Capacity, Cost, and Performance Design Complete Guide (2026 Edition): On-Demand vs. Provisioned, Auto Scaling, Avoiding Hot Partitions, Cost Optimization

DynamoDB Streams × Event-Driven Architecture / CDC Complete Guide (2026 Edition): Safely Propagating Change Data with Lambda and EventBridge Pipes

DynamoDB Single-Table Design & Production Reliability Patterns — The Complete Guide (2026 Edition): Idempotency, Conditional Writes, and Transactions in Real Code