Skip to main content
友田 陽大
Observability & SRE
AWS
可観測性
OpenTelemetry
SRE
ECS
CloudWatch
Terraform
信頼性
監視

AWS ECS Fargate SRE Practical Guide: ADOT Distributed Tracing, EMF Metrics, and SLO / Error Budget / Burn-Rate Alert Design

Using ECS Fargate production operations as the subject, this is a definitive observability/SRE guide explaining distributed tracing with OpenTelemetry/ADOT, JSON structured logs and correlation IDs, EMF custom metrics, RED/USE, SLOs, error budgets, burn-rate alerts, composite alarms, and sampling design—in official-documentation-compliant real code (TypeScript/Terraform).

Published
Reading time
20 min read
Author
友田 陽大
Share

"A failure is happening. Plenty of logs are being emitted too. And yet, why it's slow, where it's failing—I can't tell."

"Alerts fire so much that nobody looks at them anymore. The one truly important one gets buried in noise."

"I put in console.log, but I can't trace across requests. There are CPU and memory metrics, but they don't connect to the user experience."

"We say 'aim for 99.9%,' but nobody can answer in numbers how well that's being kept right now."

If you have any of these in mind in production operations, what's missing isn't "the volume of logs" but the design of observability. I myself, on the payment platform of a Minister of Economy, Trade and Industry Award-winning B2B subscription SaaS, implemented idempotency, atomic transactions, and zero-downtime migrations and achieved zero double charges in production. What supported the back of that was an observability layer of correlation-ID-tagged structured logs, distributed tracing, and SLO-based alerts. I write this article from the experience of operating 221 endpoints on an API Gateway → NLB → ALB → ECS configuration.

This article explains, faithful to the AWS / OpenTelemetry official documentation (as of June 2026), "in which scene and how to use it," with real code. The official URLs referenced are noted at the end of each section, so you can proceed while checking primary sources in your own environment.

The scope this article covers (map)

  1. The three pillars of observability (logs / metrics / traces) and OpenTelemetry's positioning
  2. Structured logs: JSON-ification, correlation-ID propagation, log levels, PII redaction, CloudWatch Logs aggregation
  3. Distributed tracing: OpenTelemetry instrumentation + running the ADOT Collector as an ECS Fargate sidecar
  4. Metrics: RED / USE and EMF (Embedded Metric Format)
  5. SLO / SLI / error budget / burn rate
  6. Alert design: composite alarms, symptom-based, avoiding alert fatigue
  7. Resilience and cost: sampling, retention, dashboards
  8. Testability: synthetics monitoring, unit verification of instrumentation

Why "the logs are there but I can't tell the cause"

A typical failure has a clear structure.

SymptomRoot causeThe prescription in this article
Can't trace across requestsNo correlation ID (trace_id / request_id) in logsCh. 3: generating and propagating correlation IDs
Know "it's slow" but not "where"No distributed tracing, service boundaries invisibleCh. 4: OpenTelemetry + ADOT
See CPU/memory but not user impactMetrics skewed to resource indicators, RED missingCh. 5: RED / USE and EMF
Alerts fire too much and get ignoredA proliferation of cause-based threshold alertsCh. 7: SLO burn rate, symptom-based
Can't say in numbers "are we keeping the target"SLI/SLO not definedCh. 6: SLO / error budget

Observability is "the degree to which you can infer internal state from a system's external output (telemetry)." Being able to answer questions you didn't anticipate in advance ("why is this specific user's this payment 500ms slow?") without redeploying—that's the goal.


1. The three pillars of observability and OpenTelemetry's positioning

In OpenTelemetry, the kinds of telemetry are called signals, and the major ones are traces / metrics / logs. This is what's often expressed as "the three pillars."

  • Logs: records of individual discrete events. "What happened."
  • Metrics: time-series aggregated values. "At what scale it's happening."
  • Traces: the whole path of one request passing through the system. "Where it happened."

A trace is composed of spans, and a span represents a unit of work. Each span has a Span Context, which contains a trace_id (binding all spans within one trace) and a span_id (uniquely identifying an individual span). A child span references the parent's span_id, and this assembles spans spanning different processes, services, and data centers into one trace. The core concept enabling this assembly is Context Propagation.

What's important here is OpenTelemetry's vendor neutrality. Write instrumentation once with OpenTelemetry's standard API, and the export destination can be the OpenTelemetry Collector or any OSS / commercial backend. You can switch the destination to X-Ray, CloudWatch, or another vendor's SaaS without rewriting the app's code—this is the essence of avoiding lock-in.

Source: Traces — OpenTelemetry


2. Structured logs: turn "points" into "lines" with a correlation ID

Why JSON-structured

Plain text like console.log("user paid", amount) is readable to humans but not searchable/aggregatable by machines. For querying field-by-field in CloudWatch Logs Insights, converting to metrics, and correlating with traces, logs should be structured JSON.

An example of a minimal structured logger (a thin wrapper equivalent to pino). What matters is building in the three points "log level," "correlation ID," and "PII redaction" from the start.

// logger.ts — 構造化ロガーの最小実装
import { AsyncLocalStorage } from "node:async_hooks";

type LogLevel = "debug" | "info" | "warn" | "error";

// レベルを数値化して、環境変数で出力しきい値を制御する
const LEVELS: Record<LogLevel, number> = { debug: 10, info: 20, warn: 30, error: 40 };
const MIN_LEVEL = LEVELS[(process.env.LOG_LEVEL as LogLevel) ?? "info"];

// リクエストスコープの相関コンテキストを保持する(後述のミドルウェアで設定)
export const requestContext = new AsyncLocalStorage<{
  requestId: string;
  traceId?: string;
}>();

// PIIをログに残さないための墨消し。鍵名ベースで再帰的にマスクする
const PII_KEYS = new Set(["email", "password", "cardNumber", "phone", "token"]);
function redact(value: unknown): unknown {
  if (Array.isArray(value)) return value.map(redact);
  if (value && typeof value === "object") {
    return Object.fromEntries(
      Object.entries(value as Record<string, unknown>).map(([k, v]) =>
        PII_KEYS.has(k) ? [k, "[REDACTED]"] : [k, redact(v)],
      ),
    );
  }
  return value;
}

function emit(level: LogLevel, message: string, fields: Record<string, unknown> = {}): void {
  if (LEVELS[level] < MIN_LEVEL) return;
  const ctx = requestContext.getStore();
  const line = {
    level,
    message,
    timestamp: new Date().toISOString(),
    // 相関ID:これがログを横断検索可能にする要
    requestId: ctx?.requestId,
    traceId: ctx?.traceId,
    ...redact(fields),
  };
  // CloudWatch Logs は標準出力の1行=1イベントとして取り込む
  process.stdout.write(`${JSON.stringify(line)}\n`);
}

export const log = {
  debug: (m: string, f?: Record<string, unknown>) => emit("debug", m, f),
  info: (m: string, f?: Record<string, unknown>) => emit("info", m, f),
  warn: (m: string, f?: Record<string, unknown>) => emit("warn", m, f),
  error: (m: string, f?: Record<string, unknown>) => emit("error", m, f),
};

Generating and propagating the correlation ID

The starting point of correlation is "issue an ID at the request boundary and put it on AsyncLocalStorage." If you're instrumenting with OpenTelemetry, burning its traceId straight into the log lets logs and traces jump to each other bidirectionally.

// middleware.ts — Express の相関IDミドルウェア
import { randomUUID } from "node:crypto";
import { trace } from "@opentelemetry/api";
import { requestContext } from "./logger";
import type { Request, Response, NextFunction } from "express";

export function correlationMiddleware(req: Request, res: Response, next: NextFunction): void {
  // 上流(ALB/フロント)が付けた X-Request-Id を尊重し、無ければ採番する
  const requestId = (req.header("x-request-id") ?? randomUUID()).slice(0, 64);
  // 進行中のスパンから traceId を取得(OTel計装が前提)
  const traceId = trace.getActiveSpan()?.spanContext().traceId;

  res.setHeader("x-request-id", requestId);
  // このリクエストの処理が終わるまで、ログは自動で相関IDを含む
  requestContext.run({ requestId, traceId }, () => next());
}

Log-level design and aggregation

  • debug: only during development / failure investigation. Don't emit by default in production (LOG_LEVEL=info).
  • info: business events (payment succeeded, job completed, etc.). Candidates to later turn into metrics.
  • warn: auto-recovered anomalies (retry succeeded, fallback fired).
  • error: failures with user impact. The input to alerts and SLI calculation.

On ECS Fargate, the standard is to aggregate the container's standard output to CloudWatch Logs with the awslogs log driver (configured in the task definition's logConfiguration). The app side just needs to "write JSON one line at a time to standard output," and file rotation and a resident agent become unnecessary.

Source: Send Amazon ECS logs to CloudWatch — Amazon ECS / Analyzing log data with CloudWatch Logs Insights


3. Distributed tracing: OpenTelemetry instrumentation + ADOT sidecar

Instrumentation: auto-instrumentation first, manual if insufficient

OpenTelemetry's Node.js instrumentation is best started from auto-instrumentation. It can generate spans for Express, HTTP, gRPC, and various DB drivers with almost no code rewriting. Since it needs to load before the app body, inject it at startup with --import (or --require).

# 必要パッケージ(OpenTelemetry公式の手順に準拠)
npm install @opentelemetry/sdk-node \
  @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-proto
// instrumentation.ts — アプリより前にロードされる
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-proto";

const sdk = new NodeSDK({
  // 同一タスク内のADOTサイドカーへ OTLP/HTTP で送る(localhost:4318)
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT
      ? `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`
      : "http://localhost:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Business-meaningful sections ("payment processing," "inventory allocation," etc.) are best carved out into explicit spans with manual instrumentation, so the cause section is visible at a glance on the trace.

// payment.ts — 手動計装でビジネス区間を可視化する
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("payment-service");

export async function charge(orderId: string, amountJpy: number): Promise<void> {
  // span 名は「動作.対象」の規約に寄せると検索しやすい
  await tracer.startActiveSpan("payment.charge", async (span) => {
    // 高カーディナリティのIDは属性に。メトリクスのディメンションには使わない
    span.setAttribute("order.id", orderId);
    span.setAttribute("payment.amount_jpy", amountJpy);
    try {
      await callPaymentProvider(orderId, amountJpy);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      // 例外をスパンに記録すると、トレース上でエラー区間が赤くなる
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: "charge failed" });
      throw err;
    } finally {
      span.end();
    }
  });
}

Running the ADOT Collector as an ECS Fargate sidecar

Running the ADOT (AWS Distro for OpenTelemetry) Collector as a sidecar within the same task lets the app just send OTLP to the localhost Collector, and swapping the export destination (X-Ray, CloudWatch, Prometheus, etc.) is complete with just the Collector config. In AWS's official X-Ray-integration task-definition snippet, the image uses public.ecr.aws/aws-observability/aws-otel-collector, specifying the config file with --config.

# ecs_task.tf — アプリ + ADOTサイドカー の Fargate タスク定義
resource "aws_ecs_task_definition" "app" {
  family                   = "payment-api"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "1024"
  memory                   = "3072"
  task_role_arn            = aws_iam_role.task.arn       # アプリ + 計装の権限
  execution_role_arn       = aws_iam_role.execution.arn  # イメージpull/ログ書込

  container_definitions = jsonencode([
    {
      name      = "payment-api"
      image     = "${aws_ecr_repository.app.repository_url}:latest"
      essential = true
      environment = [
        # awsvpc では同一タスクのコンテナは localhost で通信できる
        { name = "OTEL_EXPORTER_OTLP_ENDPOINT", value = "http://localhost:4318" },
        { name = "OTEL_SERVICE_NAME", value = "payment-api" },
        { name = "LOG_LEVEL", value = "info" }
      ]
      # Collector を先に起動してから、アプリを起動する
      dependsOn = [{ containerName = "aws-otel-collector", condition = "START" }]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/payment-api"
          "awslogs-region"        = var.region
          "awslogs-stream-prefix" = "ecs"
          "awslogs-create-group"  = "true"
        }
      }
    },
    {
      name      = "aws-otel-collector"
      image     = "public.ecr.aws/aws-observability/aws-otel-collector:v0.30.0"
      essential = true
      # ECSが同梱する設定を使う例。X-Ray + EMF を有効化する設定に差し替え可能
      command   = ["--config=/etc/ecs/ecs-default-config.yaml"]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/ecs-aws-otel-sidecar-collector"
          "awslogs-region"        = var.region
          "awslogs-stream-prefix" = "ecs"
          "awslogs-create-group"  = "true"
        }
      }
    }
  ])
}

To write traces to X-Ray, attach AWSXRayDaemonWriteAccess to the task role, and to emit CloudWatch metrics/logs from the Collector, attach CloudWatchAgentServerPolicy (if you need least privilege, extract just the necessary actions from these as a reference).

# iam.tf — タスクロールに可観測性の最小権限を付与
resource "aws_iam_role_policy_attachment" "xray" {
  role       = aws_iam_role.task.name
  policy_arn = "arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess"
}

resource "aws_iam_role_policy_attachment" "cw_agent" {
  role       = aws_iam_role.task.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

In the Collector config, the standard is to combine an OTLP receiver for receiving (gRPC 4317 / HTTP 4318) with an awsxray exporter (traces) and an awsemf exporter (metrics to CloudWatch as EMF) for sending.

# otel-config.yaml — ADOT Collector の受信/処理/送信
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch: {}   # スパン/メトリクスをまとめて送り、API呼び出しを削減する

exporters:
  awsxray: {}                 # トレース → AWS X-Ray
  awsemf:                     # メトリクス → CloudWatch(EMF経由)
    namespace: PaymentApi
    log_group_name: /metrics/payment-api

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [awsxray]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [awsemf]

Source: OpenTelemetry Node.js — getting started / Specifying the ADOT sidecar for AWS X-Ray integration — Amazon ECS


4. Metrics: RED / USE and EMF

What to measure: RED and USE

Metrics become cost and noise if you increase them blindly. Narrow "what to measure" with two standard frameworks.

FrameworkTargetIndicatorsMain use
REDRequest-driven servicesRate / Errors / DurationUser impact, the raw material of SLIs
USEResources (CPU/memory/connection pools, etc.)Utilization / Saturation / ErrorsBottleneck diagnosis

The principle is "RED for SLIs and symptom alerts, USE for cause investigation." Users experience the success/failure and speed of requests, not internal CPU utilization.

EMF: emit custom metrics on the same path as logs

Embedded Metric Format (EMF) is a spec where, by embedding _aws metadata in a structured log event, CloudWatch Logs automatically extracts metric values. The app can emit custom metrics with just "emit one JSON log line," without separately hitting the PutMetricData API. Including _aws.CloudWatchMetrics (Namespace, Dimensions, Metrics) at the root node is the requirement.

// emf.ts — RED の Duration/Errors を EMF で発行する
type EmfUnit = "Milliseconds" | "Count";

function putEmf(
  namespace: string,
  metrics: { name: string; unit: EmfUnit; value: number }[],
  dimensions: Record<string, string>,
): void {
  const line: Record<string, unknown> = {
    _aws: {
      Timestamp: Date.now(),
      CloudWatchMetrics: [
        {
          Namespace: namespace,
          // 注意:高カーディナリティ(requestId等)をディメンションにしない。
          // 組み合わせごとに課金対象メトリクスが生成される
          Dimensions: [Object.keys(dimensions)],
          Metrics: metrics.map((m) => ({ Name: m.name, Unit: m.unit })),
        },
      ],
    },
    ...dimensions,
  };
  for (const m of metrics) line[m.name] = m.value;
  process.stdout.write(`${JSON.stringify(line)}\n`);
}

export function recordRequest(route: string, status: number, durationMs: number): void {
  putEmf(
    "PaymentApi",
    [
      { name: "Duration", unit: "Milliseconds", value: durationMs },
      { name: "Errors", unit: "Count", value: status >= 500 ? 1 : 0 },
      { name: "Count", unit: "Count", value: 1 },
    ],
    // ディメンションは低カーディナリティに保つ(route と環境程度)
    { Route: route, Environment: process.env.NODE_ENV ?? "production" },
  );
}

The official spec's minimal example (request latency) is this shape. Note that each combination of Dimensions becomes one CloudWatch metric.

{
  "_aws": {
    "Timestamp": 1574109732004,
    "CloudWatchMetrics": [
      {
        "Namespace": "PaymentApi",
        "Dimensions": [["Route", "Environment"]],
        "Metrics": [{ "Name": "Duration", "Unit": "Milliseconds" }]
      }
    ]
  },
  "Route": "/charge",
  "Environment": "production",
  "Duration": 135.5
}

Source: Specification: Embedded metric format — Amazon CloudWatch


5. SLO / SLI / error budget

Align the definitions

  • SLI (Service Level Indicator): an indicator expressing the service's good/bad. Often a ratio of "number of good events ÷ number of all events" (e.g., the proportion of non-5xx requests, the proportion of requests under a latency threshold).
  • SLO (Service Level Objective): the target value of the SLI (e.g., "availability 99.9% over 30 days").
  • Error Budget: the amount of failure allowed within the SLO period. For 99.9%, the allowed failure is 0.1%. Capturing this 0.1% as a "budget you can spend down" is the core.
  • Burn Rate: the speed of consuming the error budget. 1 is the pace of spending it exactly over the period, 10 is the pace of depleting it 10× as fast.

How to choose SLIs

Choose SLIs that are directly tied to the user experience and measurable. The previous chapter's RED metrics (Errors / Duration) become the raw material as-is.

SLI typeDefinition of a good eventData source
AvailabilityRequests whose HTTP status is not 5xxEMF's Errors / Count
LatencyRequests with a response of 300ms or underEMF's Duration
FreshnessJobs processed within a prescribed timeBatch-completion metrics

Decide with the error budget

The error budget is a mechanism to decide "when to stop feature development and invest in reliability" with numbers, not politics. If the budget is ample, push new features. If the budget is near depletion, freeze releases and pivot to resilience improvements. It frees the consensus-building of SRE and product from emotion.


6. Alert design: alert on symptoms, not on causes

Anti-pattern: a proliferation of cause-based thresholds

"Notify when CPU exceeds 80%," "notify on one 500 of a specific API"—piling up such cause-based alerts keeps them firing even without user impact, and eventually everyone ignores them (alert fatigue). The principle is to alert on symptoms: make only the signs that users are actually in trouble (rising error rate, worsening latency, SLO-violation pace) the targets of paging.

Burn-rate alerts (multi-window, multi-burn-rate)

The multiwindow, multi-burn-rate approach the Google SRE Workbook recommends combines two time windows, long and short, to detect "a genuinely ongoing problem" with high precision. Representative recommended values are below.

SeverityLong windowShort windowBurn rateBudget consumed during the period
Page (immediate response)1 hour5 minutes14.4x2%
Page (immediate response)6 hours30 minutes6x5%
Ticket3 days6 hours1x10%

The short window works for confirming "is the problem still continuing" (the speed of reset), and the long window for suppressing false positives. Alerting only when both exceed the threshold at the same time balances precision and recall.

Combine "both windows" with a CloudWatch composite alarm

Create "the long-window alarm" and "the short-window alarm" individually, and AND-combine them in a composite alarm's AlarmRule to implement the above burn-rate condition as-is. The rule expression is written with ALARM(...) / OK(...) and AND / OR / NOT.

# alarms.tf — 1時間/5分の2窓を AND 結合する Page アラーム
resource "aws_cloudwatch_metric_alarm" "burn_long" {
  alarm_name          = "slo-burn-1h"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  threshold           = 14.4 # 14.4倍速のバーンレート
  # metric_query でエラー率 / (1 - SLO) を計算する想定(式は省略)
  metric_name = "ErrorRate"
  namespace   = "PaymentApi"
  period      = 3600
  statistic   = "Average"
}

resource "aws_cloudwatch_metric_alarm" "burn_short" {
  alarm_name          = "slo-burn-5m"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  threshold           = 14.4
  metric_name         = "ErrorRate"
  namespace           = "PaymentApi"
  period              = 300
  statistic           = "Average"
}

resource "aws_cloudwatch_composite_alarm" "burn_page" {
  alarm_name = "slo-burn-page"
  # 両方の窓が同時に発火したときだけ、オンコールを呼び出す
  alarm_rule = join(" ", [
    "ALARM(${aws_cloudwatch_metric_alarm.burn_long.alarm_name})",
    "AND",
    "ALARM(${aws_cloudwatch_metric_alarm.burn_short.alarm_name})",
  ])
  alarm_actions = [aws_sns_topic.oncall.arn]
}

The minimum line for on-call operations

  • Paging is only for events where "a human must move right now or users are in trouble." Otherwise, demote to a ticket or dashboard check.
  • Give each alarm a link to a runbook (response procedure) (a composite alarm's description field supports Markdown).
  • Periodically take stock of alerts that fired but no one acted on, and delete them or review thresholds.

Source: Create a composite alarm — Amazon CloudWatch / Alerting on SLOs — Google SRE Workbook


7. Resilience and cost: sampling, retention, dashboards

Observability easily becomes high-cost if you "record everything at maximum precision." Weave the balance of securing resilience and cost into the design.

  • Trace sampling: you don't need to trace every request. For low traffic, 100%; for high traffic, probabilistic sampling (e.g., 10%)—but configure on the Collector side a policy like don't drop errors with tail sampling to reliably save on errors.
  • Log retention: set a retention period on CloudWatch Logs log groups—short for hot investigation use (e.g., 30 days), long-term storage (S3 export) for auditing. infinite retention is a source of accidents.
  • EMF cardinality management: as touched on in Chapter 4, making a high-cardinality value like requestId a dimension mass-produces billed metrics per combination. Keep dimensions low-cardinality.
  • Coordination with resilience patterns: retries (exponential backoff), circuit breakers, and timeouts each emit dedicated metrics (retry count, breaker open/close, timeout count) to make "is it working" observable. Recording the moment a circuit opens as a trace attribute lets you trace the cascade of a failure at a glance.
  • Dashboard design: lining up "RED (user impact) → SLO burn rate → USE (cause candidates)" on one screen gives an investigation flow that follows "in trouble? → how much → why" from top to bottom.
# logs.tf — ロググループに保持期間を必ず設定する
resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/payment-api"
  retention_in_days = 30 # 無期限保持はコスト事故。用途に応じて明示する
}

Source: Change log data retention in CloudWatch Logs / Sampling — OpenTelemetry


8. Testability: test observability itself

Because observability is first used "when a failure hits," if you don't verify its operation in normal times, the gaps surface at the very moment you need it.

  • Synthetics (synthetic) monitoring: with a CloudWatch Synthetics Canary, send pseudo-requests to the production URL periodically and measure external (user-perspective) availability and latency as SLIs. You can continuously generate SLIs even in hours with few real users, and it's also effective for detecting regressions right after a deploy.
  • Unit verification of instrumentation: OpenTelemetry has an InMemorySpanExporter, which lets you obtain and verify spans generated within a test as an array. You can pin "on payment success, one payment.charge span is generated and status=OK" in a unit test.
// payment.test.ts — 生成スパンを検証する(vitest想定)
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import {
  InMemorySpanExporter,
  SimpleSpanProcessor,
} from "@opentelemetry/sdk-trace-base";

const exporter = new InMemorySpanExporter();
const provider = new NodeTracerProvider({
  spanProcessors: [new SimpleSpanProcessor(exporter)],
});
provider.register();

afterEach(() => exporter.reset());

test("charge emits a single OK span", async () => {
  await charge("order-1", 1200);
  const spans = exporter.getFinishedSpans();
  expect(spans).toHaveLength(1);
  expect(spans[0].name).toBe("payment.charge");
  expect(spans[0].attributes["payment.amount_jpy"]).toBe(1200);
});

Source: Using synthetic monitoring — Amazon CloudWatch / Testing — OpenTelemetry JS


FAQ

Q1. Should I run the ADOT Collector as a "sidecar" or a "standalone service (central collector)"? For small-to-medium scale, if you want to start simple, a sidecar is easy (same task as the app, so localhost sending and network design are simple). For large scale, when you want to centrally manage sampling and aggregation, or optimize the number of Collectors, consider a central-aggregation type (a gateway configuration). The two can be migrated between in stages.

Q2. Should I export to X-Ray or OTLP? If you complete things AWS-natively and leverage AWS features like ServiceMap, the X-Ray exporter. If you want to leave room for future migration to another vendor's backend or OSS (Jaeger / Grafana Tempo, etc.), emit via OTLP and route on the Collector side. If you've instrumented with OpenTelemetry, you can switch this decision with the Collector config without changing the app's code.

Q3. Isn't emitting logs and metrics separately double the work? With EMF, just "emit one JSON log line" and CloudWatch Logs automatically extracts the metric. The extra PutMetricData API call becomes unnecessary, and because logs and metrics are generated from the same event, consistency holds too. Emitting RED metrics via EMF is efficient.

Q4. How do I decide the SLO number (99.9%, etc.)? Start not from "the highest value technically achievable" but from "the lowest line at which users are satisfied." An excessive SLO (99.999%, etc.) often raises the achievement cost exponentially while being imperceptible to users. First measure the current SLI, start from a realistic target, and adjust while watching the actual error-budget consumption.

Q5. Where do I start fixing an existing environment with too many alerts? First, take stock of and delete/demote "alerts that fired in the past 90 days but no one responded to." Next, replace the remaining important alerts with symptom-based (SLO burn rate). Rather than deleting cause-based alerts, demote them from "paging → dashboard/ticket," so you can keep the clues for investigation.


Summary: observability is "design," not "tool adoption"

  • Bind the three pillars of observability (logs / metrics / traces) into one with a correlation ID.
  • Write instrumentation with OpenTelemetry (vendor-neutral), and on ECS Fargate aggregate into an ADOT Collector sidecar. Switch the destination (X-Ray / CloudWatch / OSS) without changing the app.
  • Narrow metric targets with RED / USE and emit via EMF on the same path as logs.
  • Base alerts on SLO, error budget, and burn rate, and alert on symptoms, not causes. Combine multiple windows with composite alarms.
  • Design cost with sampling, retention, and cardinality, and verify observability itself with synthetics and instrumentation unit tests.

On the payment platform of a Minister of Economy, Trade and Industry Award-winning B2B SaaS, I implemented such observability, resilience, and IAM across the board, and achieved zero double charges in production with idempotency, atomic transactions, and zero-downtime migrations (API Gateway → NLB → ALB → ECS, operating 221 endpoints). And now, with "one person × generative AI (Claude Code)," I build such robust operational foundations fast and safely. AI is an accelerator, and before design decisions and production launches I always pass through a human verification gate.

"The logs are there but I can't tell the cause," "alerts fire so much they don't function"—if you have such observability/SRE challenges in production, feel free to reach out.

Contact us here


References (AWS / OpenTelemetry official)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading