# OpenTelemetry Production Observability Guide: Correlating Traces, Metrics, and Logs So You Can Spot a Stuck Process at a Glance

> An implementation guide for making production systems observable with OpenTelemetry. From the concepts of the three signals (traces / metrics / logs) and context propagation, to instrumenting FastAPI (Python) and Next.js (Node), the OTel Collector, head/tail sampling, log-to-trace correlation, PII scrubbing, and telemetry cost optimization — explained with official-spec-compliant, real code.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: 可観測性, OpenTelemetry, アーキテクチャ設計, Python, Next.js
- URL: https://tomodahinata.com/en/blog/opentelemetry-observability-production-tracing-metrics-logs
- Category: Observability & SRE

## Key points

- The heart of observability is correlating the three signals: notice an anomaly with metrics, pinpoint the location with traces, read the why in logs
- Keep span names low-cardinality (the kind of work) and put variable values like job.id into attributes
- A two-tier setup: cast a net over everything with zero-code instrumentation, and name only the domain concerns with manual spans
- Putting an OTel Collector in between lets you centralize vendor swap-out, sampling, and PII scrubbing, cutting off lock-in
- Don't put PII into telemetry; what correlation needs is not "who" but "the same person," so hash user.id

---

The scariest thing in production is not "going down." It's **"it's running but slow," "it's stuck somewhere, but I can't tell where."** Nothing shows up in the error logs. And yet users say "the process never comes back." You `grep` the logs, but you can't connect how a request passed through multiple services (front → API → worker → external API → DB).

What solves this "disconnectedness" is **Observability**, and its industry standard is **OpenTelemetry (OTel)**. This article is an implementation guide for making production systems observable with OpenTelemetry. As the subject matter, I'll weave in design decisions from an internal AI platform I built for a major domestic broadcaster ([a pipeline of long-running AI jobs](/case-studies/broadcaster-ai-content-platform)), where I made it possible to trace "which chunk, at which stage, the processing got stuck."

> **The rule of this article**: The sources for terms, APIs, and settings are based on the **OpenTelemetry official documentation (as of June 2026)**. Because SDKs, package names, and settings change quickly, always check the latest at the [official documentation](https://opentelemetry.io/docs/) before going to production. And the most important iron rule — **don't put PII (personal information) into telemetry**. The moment you put a personal name or contact info into an attribute, a log body, or a span name, the observability platform itself becomes a leak channel.

---

## 0. Why "logs alone" aren't enough

First, let's correct the mental model. The misconception that "if I emit logs, I'm observable" turns production-incident investigation into hell.

A log is a **point**. "Error at 14:03:12 in the payment service" — this is only a single-point fact. But an actual incident takes this shape:

> A user presses a button → a Next.js Route Handler calls FastAPI → FastAPI hits an external payment API → while waiting for that response, a DB lock gets stuck → as a result, the front gets "timeout after 30 seconds"

In this case, what you want to know is not "at which single point the error occurred," but **"in what order, through which services, and how many milliseconds in each, did one request pass."** This cannot be reconstructed from a set of points (logs). You need a **line (a trace)**.

OpenTelemetry deals with "outputs that describe a system's behavior = **Signals**." There are four primary signals defined by the official docs.

| Signal | Official definition | What it answers |
| --- | --- | --- |
| Traces | the path of a request through your application | "Where did this request go, and where did it eat up time?" |
| Metrics | a measurement captured at runtime | "Overall, are the error rate / latency / throughput healthy?" |
| Logs | a recording of an event | "At that moment, what concretely happened?" |
| Baggage | contextual information that is passed between signals | "I want to carry context like a tenant ID across the whole path" |

In addition, the official docs position **Events (a specific kind of log)** and **Profiles (code-level resource-usage records)** as under development.

Memorize the division of roles this way — **notice "there's an anomaly" with metrics, pinpoint "where" with traces, read "why" with logs.** Only when these three correlate does the opening "it's running but slow" become traceable at a glance.

You might wonder why you need all three. In a word, it's because **the trade-off between "resolution" and "cost" differs across the three**. Metrics are the cheapest (since they're aggregations of numbers) and can be viewed in full at all times, but their granularity is coarse. Logs are the most detailed, but their volume is enormous and, on their own, they lack context (which request). Traces are in between — they hold the path of a single request as a structure, but storing all of them is costly (which is why the sampling in Section 7 is needed). **Correlating these three layers** is OpenTelemetry's design philosophy, and any one alone won't complete a production incident investigation.

---

## 1. The minimal vocabulary: spans and traces

A trace is a tree structure of **spans**. One span = one unit of work (an HTTP handler, a DB query, an external API call, etc.); spans connect in parent-child relationships, forming a single trace overall.

In the official operational concepts, the minimal set you need for production is just this.

- **Span name**: the name of the work (e.g. `POST /jobs`, `transcribe_chunk`). Keep it low-cardinality (don't put IDs in it).
- **Attributes**: key-value pairs attached to a span (e.g. `job.id`, `chunk.index`, `http.response.status_code`). They become the axes for search and filtering.
- **Span Events**: point-in-time events that occurred within a span (e.g. "retry started").
- **Status**: `OK` / `ERROR`. Set `ERROR` on error.
- **record exception**: attach an exception to the span and leave a stack trace.
- **Span context**: the trace ID and span ID. This is **propagated**, so even across services it connects to the same trace (next section).

Here a design principle already kicks in. **Attributes DRY, span names KISS.** If you embed a high-cardinality value like `job.id` in a span name, the backend can't aggregate it and costs spike. Variable values go into **attributes**, and the name stays a low-cardinality value representing the **kind** — this is the iron rule.

---

## 2. Context propagation: how a trace stays connected across services

The heart of distributed tracing is **Context Propagation**. Once you understand this, it clicks why spans from separate services connect into one.

The official definitions are these.

- **Context**: an object that holds the information for a sender and a receiver to "correlate one signal with another signal." When service A calls service B, by passing the trace ID and span ID, B can create a new span that **belongs to the same trace and takes A's span as its parent**.
- **Propagation**: the mechanism that moves that Context between services and processes. It serializes/restores the Context object. Normally the instrumentation library handles it transparently.
- **Propagator**: what implements that movement. **The default propagation uses headers per the W3C TraceContext specification.**

In substance it's a single HTTP header, `traceparent`. The format is, per the official docs, `<version>-<trace-id>-<parent-id>-<trace-flags>`, and a concrete example takes this form.

```text
traceparent: 00-a0892f3577b34da6a3ce929d0e0e4736-f03067aa0ba902b7-01
```

On sending, service A **injects** its own context into the `traceparent` header, and on receiving, service B **extracts** it and puts it into its own local context. With this the parent-child relationship is established, and the spans of both services are linked into the same trace.

A practically important point: **as long as you've installed instrumentation libraries for the HTTP client/server, this inject/extract happens automatically.** What you write by hand is only "manual spans within a service." However, **when crossing message queues, SSE, or custom protocols**, automatic propagation may not work, so for those parts alone be conscious of designing to manually carry the `traceparent` equivalent. In the long-running jobs described later, this was precisely the crux.

Here, let me also touch on the other propagation target, **Baggage**. Whereas the trace ID/span ID carry "the skeleton of the trace," baggage carries "application-specific context" (e.g. `tenant.id`, `request.priority`) across the whole path. For example, you might want to use a tenant ID determined at the entry point as a metric dimension in downstream services too — baggage helps in such cases. However, because **values put into baggage flow to each service via propagation headers**, putting PII here is strictly forbidden (it's a problem prior to the scrubbing in Section 7 — just don't put it there in the first place). Remember baggage as "useful, but only for low-cardinality, non-sensitive values."

---

## 3. Instrumenting FastAPI (Python)

There are **two paths** to instrumenting Python. Using both is the right answer.

### 3.1 Zero-code instrumentation (first, cast a net over everything)

The official "zero-code" approach adds auto-instrumentation without touching your app's code. It injects instrumentation into FastAPI, request libraries, DB drivers, and so on via **runtime monkey-patching**.

```bash
# distro は API/SDK/bootstrap/instrument ツールを含む。OTLP エクスポーターも入れる
pip install opentelemetry-distro opentelemetry-exporter-otlp

# 入っているライブラリを走査し、対応する計装ライブラリを自動インストール
opentelemetry-bootstrap -a install
```

Launch is just wrapping your app's command with `opentelemetry-instrument`.

```bash
# 例：uvicorn で動く FastAPI を計装して起動
opentelemetry-instrument \
    --service_name broadcaster-ai-api \
    --traces_exporter otlp \
    --metrics_exporter otlp \
    --exporter_otlp_endpoint http://otel-collector:4317 \
    uvicorn app.main:app --host 0.0.0.0 --port 8000
```

The same thing can also be set via environment variables (this is easier to handle in container operations).

```bash
export OTEL_SERVICE_NAME=broadcaster-ai-api
export OTEL_TRACES_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
```

> CLI arguments and environment variables correspond one-to-one (`--service_name` ↔ `OTEL_SERVICE_NAME`, etc.). **Don't bake secrets or destinations into code; put them in env** — observability settings are no exception to this principle.

With this, automatic spans for "HTTP handlers, DB, external HTTP" start coming out. **First casting a net over everything** is the value of zero-code.

### 3.2 Manual instrumentation (naming the business concerns)

Auto-instrumentation only sees "technical boundaries." **Domain concerns** like "the OCR stage," "the speech-recognition stage," "the malware-scan stage" become visible only when you carve out spans yourself. The official manual instrumentation API (traces is Stable, metrics is also Stable, logs is Development) is used like this.

```python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

# トレーサーはモジュール単位で一度だけ取得して使い回す（SRP: 取得と利用を分離）
tracer = trace.get_tracer("broadcaster.ai.pipeline")


async def transcribe_chunk(job_id: str, chunk_index: int, audio: bytes) -> str:
    # スパン名は「種類」。可変値（job_id 等）は属性へ → 低カーディナリティを死守
    with tracer.start_as_current_span("transcribe_chunk") as span:
        # 属性 = 後で絞り込む軸。IDはここに置く（スパン名には置かない）
        span.set_attribute("job.id", job_id)
        span.set_attribute("chunk.index", chunk_index)
        span.set_attribute("chunk.size_bytes", len(audio))

        try:
            span.add_event("asr.request.start")  # 時点イベント
            text = await call_asr(audio)
            span.add_event("asr.request.done")
            return text
        except Exception as ex:
            # エラーはステータスを立て、例外を添付（スタックトレースが残る）
            span.set_status(Status(StatusCode.ERROR))
            span.record_exception(ex)
            raise
```

Nest them and they become parent-child spans, so the nesting of processing becomes the tree of the trace as-is.

```python
with tracer.start_as_current_span("process_job") as parent:
    parent.set_attribute("job.id", job_id)
    for i, chunk in enumerate(chunks):
        # この子スパンは自動的に process_job を親に持つ
        await transcribe_chunk(job_id, i, chunk)
```

> **The two-tier setup of zero-code (the net) + manual (the key spots)** is the production sweet spot. Use auto-instrumentation to eliminate gaps, and name only the business-critical stages by hand. This is neither "write everything by hand (DRY violation, exploding effort)" nor "auto only (the domain is invisible)" — it's the just-right middle.

### 3.3 Metrics with the same tracer mindset

You take "the overall trend" with metrics. The official counter is created like this.

```python
from opentelemetry import metrics

meter = metrics.get_meter("broadcaster.ai.pipeline")

chunk_counter = meter.create_counter(
    "asr.chunks.processed",
    unit="1",
    description="処理済みチャンク数",
)

# 属性で次元を切る（work.type ごとに集計可能になる）
chunk_counter.add(1, {"result": "ok"})
```

For "the distribution of processing time" use a histogram, and for instantaneous values like "the number of concurrently running jobs" use an observable gauge (observed via a callback). **Metrics are the most sensitive to cardinality.** If you put a value that grows infinitely, like `job.id`, into an attribute, the time series explodes and costs become unbounded. Limit metric attributes to a **finite set** (`result=ok|error`, `stage=ocr|asr`, etc.) — this is the line that prevents cost blowup.

---

## 4. Instrumenting Next.js (Node)

This is the case where the front / BFF is Next.js (the Node runtime), just like this portfolio site. For Node too, think in the "auto-instrumentation + manual" two-tier setup.

### 4.1 Boot auto-instrumentation with NodeSDK

The key to the official Node setup is loading the SDK **before the app**.

```bash
npm install @opentelemetry/sdk-node \
  @opentelemetry/api \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/sdk-metrics \
  @opentelemetry/sdk-trace-node
```

Create `instrumentation.ts` and start the `NodeSDK` (the below is the official configuration; in production you replace the Console exporter with OTLP).

```ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { ConsoleSpanExporter } from "@opentelemetry/sdk-trace-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import {
  PeriodicExportingMetricReader,
  ConsoleMetricExporter,
} from "@opentelemetry/sdk-metrics";

const sdk = new NodeSDK({
  traceExporter: new ConsoleSpanExporter(), // 本番では OTLP エクスポーターに置換
  metricReader: new PeriodicExportingMetricReader({
    exporter: new ConsoleMetricExporter(),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
```

For execution, have this file loaded before the app.

```bash
# TypeScript（Node v20+）
npx tsx --import ./instrumentation.ts app.ts

# JavaScript（ESM。ファイルは instrumentation.mjs）
node --import ./instrumentation.mjs app.js
```

> **A Next.js-specific note**: Next.js (App Router) has an official hook called `instrumentation.ts`, where `register()` is called at startup on the Node runtime. It's proper to place OTel initialization here. **However, instrumentation only works on the Node runtime** — behavior changes on the Edge runtime or on routes statically generated at build time. If you mistake "which routes run on Node," you'll get stuck with "no spans coming out," so keep the processing you want to instrument on the Node runtime (this site's rules, too, follow the policy of placing anything that needs server data on the RSC / Route Handler side).

### 4.2 Visualize "your own concerns" with manual spans

The official Node manual API is callback-style. **Always calling `span.end()`** is the difference from Python's `with`, and forgetting it leaves the span unclosed.

```ts
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("portfolio-bff", "1.0.0");

export async function submitContact(input: ContactInput): Promise<void> {
  return tracer.startActiveSpan("contact.submit", async (span) => {
    try {
      // 属性は「集計・検索の軸」だけ。本文・メール・氏名は載せない（PII禁止）
      span.setAttribute("contact.phase", input.phase);
      await sendViaResend(input);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end(); // Node では明示的に閉じる。忘れるとスパンがリークする
    }
  });
}
```

Metrics can be written symmetrically too.

```ts
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("portfolio-bff", "1.0.0");
const submits = meter.createCounter("contact.submit.count");
submits.add(1, { phase: "discovery" }); // 属性は有限集合に限る
```

Here too the principle is the same — **never put names, emails, phone numbers, or free-text bodies into attributes.** What you may put in is only **aggregable, non-personally-identifying labels** like a "phase tag" (this site's policy, too, is to leave nothing beyond a phase tag for the contact form's PII).

---

## 5. The OTel Collector: the linchpin for cutting off vendor lock-in

The apps so far send telemetry "somewhere" via OTLP. What **hides that "somewhere" from the app** is the **OpenTelemetry Collector**. In the official words, "a **vendor-agnostic** way to receive, process, and export telemetry."

Why put a Collector in between instead of sending from the app directly to the backend? The reasons the official docs recommend it for production are themselves the design advantages.

- **The app can let go quickly**: the app just throws to the Collector and is done. Retries, batching, encryption, and sensitive-data filtering are handled by the Collector.
- **You can swap vendors**: even if you change the backend "Cloud Monitoring → Datadog → Grafana," **all you change is one place — the Collector config**. The app's code is untouched. This is ETC (Easy To Change) itself.
- **Centralize cross-cutting processing**: sampling, PII scrubbing, attribute shaping can be consolidated in the Collector rather than scattered across each app (DRY).

A Collector's pipeline is composed of four kinds + extensions.

| Component | Role |
| --- | --- |
| Receivers | Receive telemetry (e.g. OTLP) |
| Processors | Transform, filter, batch, sample |
| Exporters | Send to the backend |
| Connectors | Connect pipelines to each other |
| Extensions | Add-on features such as health checks |

A minimal `config.yaml` is written like this. The flow of **receive → process → export** appears directly in `service.pipelines`.

```yaml
receivers:
  otlp:
    protocols:
      grpc:          # 既定の OTLP gRPC（4317）
      http:          # OTLP HTTP（4318）

processors:
  memory_limiter:    # メモリ上限を守り、Collector自身の落下を防ぐ（信頼性）
    check_interval: 1s
    limit_mib: 512
  batch:             # まとめて送り、送出回数とコストを下げる

exporters:
  otlp/backend:
    endpoint: your-backend:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/backend]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/backend]
```

There are two deployment forms. **Agent** (co-located with each service) and **Gateway** (centrally aggregated). The standard is: Agent if starting small, Gateway if you want to apply cross-cutting policies (sampling, scrubbing). Many production setups assemble it in two tiers: "an Agent on each node → a central Gateway → the backend."

> In the broadcaster's platform, I **exported malware-scan results to Cloud Monitoring via OpenTelemetry**. The sweet spot of the Collector-in-between design is the point that **the app doesn't need to know about GCP**. Whether the destination is Cloud Monitoring or another vendor, the app's code just "throws to the Collector via OTLP." You can decouple the observability exit from the business logic.

---

## 6. PII scrubbing: don't make telemetry a leak channel

This is not "nice to have" but **mandatory**. Because the observability platform is a hub into which all services' data flows, if PII rides in here, the damage in a leak is maximized. Defend it **doubly**.

**The first line = the app side (don't put it in).** As repeated in Section 4, **don't write personal information into attributes, span names, or log bodies in the first place**. Names go into `user.id` (an irreversible ID), email bodies aren't included, free text is excluded without even summarizing. The cheapest thing is "not putting it in from the start."

**The second line = the Collector side (strip it).** In case it still leaks, delete or mask attributes with the Collector's processors before export. The official docs list "sensitive data filtering" as a Collector role precisely for this second line.

```yaml
processors:
  # 機微属性を削除/ハッシュ化してから外部へ出す（出口での最終防衛）
  attributes/scrub:
    actions:
      - key: enduser.email
        action: delete          # メールはそもそも消す
      - key: http.request.header.authorization
        action: delete          # 認証ヘッダを残さない
      - key: user.id
        action: hash             # 必要なIDは不可逆化して相関だけ残す
```

The design philosophy is clear — **what correlation needs is "whether it's the same person," not "who."** So if you hash `user.id`, correlation is preserved. Names and emails are utterly unneeded for observability, so don't put them in from the start, and delete them at the exit. This is the concrete form of "reconciling observability and privacy."

---

## 7. Sampling: the trade-off between cost and coverage

Storing all traces would be ideal, but at high traffic, **cost and storage break down**. So with sampling you keep only "a part that represents the whole." The official docs use "consider it if you exceed 1000 traces/sec" as one rule of thumb. The decision axis is the binary of **Head or Tail**.

| Aspect | Head-based sampling | Tail-based sampling |
| --- | --- | --- |
| Decision point | At trace start (early) | After the trace completes (after seeing all spans) |
| Representative method | Consistent Probability (probabilistic sampling based on trace ID, e.g. 5%) | Conditional sampling by error / latency / attributes |
| "Keep all errors" | **Can't guarantee** (because it decides before completion) | Can (e.g. always keeping errors is possible) |
| Implementation / operation | Simple, lightweight, efficient | Hard, stateful, resource-intensive |
| Suited scale | Small to medium, mostly healthy traffic | Large scale, where smart selection is needed |

The official core trade-off is this. **Head is light but can miss errors. Tail can target errors and latency, but it's stateful and heavy because it buffers all spans once.**

A practical prescription:

- **Start with Head (probabilistic sampling).** If you apply consistent trace-ID-based probabilistic sampling on the SDK side, the same trace is consistently "taken / not taken" across all services (preventing the accident of being sampled disjointly and the tree being incomplete). This is KISS, and sufficient for most medium-scale systems.
- **Add Tail at the stage of "I don't want to miss a single error or slowdown."** With the Collector's **Tail Sampling Processor**, you compose a policy like "100% of errors, 1% of normal." But because it's stateful and heavy, add it only after you're convinced "there are important traces that Head misses" (YAGNI — don't introduce a heavy mechanism preemptively).

An example of probabilistic sampling on the Collector side:

```yaml
processors:
  probabilistic_sampler:
    sampling_percentage: 10   # 10% を一貫サンプリング（trace IDベースで整合）

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, probabilistic_sampler, batch]
      exporters: [otlp/backend]
```

> **Cost isn't determined by sampling alone.** Metric attribute cardinality, log verbosity, span granularity — all of these affect billing. "Observe everything at the highest precision" is YAGNI. **Thick for paths related to the SLO, thin for the rest** — once you regard observability too as a problem of investment allocation, cost becomes controllable.

---

## 8. Correlating logs and traces: the moment incident investigation transforms

This is the tastiest part of this article. **Put a trace ID on every single log line.** This alone dramatically changes the investigation experience.

The production observability operational flow goes like this.

1. **Metrics**: the dashboard notices "`asr.chunks.processed{result=error}` is surging."
2. **Traces**: open the ERROR trace for that time window. In the `process_job → transcribe_chunk` tree, you see at a glance that **the 4th chunk hung for 13 seconds and timed out the external API**.
3. **Logs**: cross-search the logs of all services with that span's trace ID. The logs of FastAPI, the worker, and the external call line up **for just that one request**. The work of picking up points with `grep` disappears.

The implementation is simple. Take the current span's trace ID from the OTel context and attach it to structured logs.

```python
from opentelemetry import trace

def log_context() -> dict[str, str]:
    """現在のスパンから trace_id / span_id を取り、ログに添えて相関させる。"""
    span = trace.get_current_span()
    ctx = span.get_span_context()
    if not ctx.is_valid:
        return {}
    return {
        "trace_id": format(ctx.trace_id, "032x"),
        "span_id": format(ctx.span_id, "016x"),
    }

# 構造化ログに必ず混ぜる（本文にPIIは入れない）
logger.info("asr chunk failed", extra={**log_context(), "chunk.index": i})
```

With this, "pinpoint the bottleneck with a trace → jump to the logs by that trace ID → read the causal exception" becomes seamless. **Metrics (something's wrong) → Traces (it's here) → Logs (this is the reason)** — only when the three signals are stitched together by the trace ID does observability become a tool for "solving incidents" rather than "staring at dashboards."

---

## 9. Observability underpins reliability: where does a long-running job get stuck?

Finally, let me close with a real example of how all of this affects **Reliability**.

The core of the broadcaster's platform was **long-running AI jobs**. The stack was FastAPI (async) + Cloud Run Jobs + Cloud Workflows. We parallelized OCR and speech recognition with Cloud Workflows, **shortening processing time by about 30% (sequential 18 min → parallel 13 min)**. Progress was delivered in near-real-time with Firestore snapshots + SSE.

For this "parallelized, 13-minute job," what's always asked is — **"Of the 13 minutes, which stage is dominant?" "If it fails, at which chunk, at which stage, does it get stuck?"** This can't be answered without observability.

- Looking at the span tree, you see at a glance, by time width, **which of the OCR stage and the ASR stage is the bottleneck**. Verification of the parallelization effect, too, is backed by traces rather than guesswork.
- Because spans are carved out per chunk, biases like **"only the 4th chunk is slow every time"** surface. You can separate whether it's a problem on the input-data side or the external-API side.
- Malware-scan results are exported to Cloud Monitoring via OpenTelemetry. You can discern by signal **"was it rejected by the scan, or did it get stuck in processing?"**

The design crux that paid off here is **the manual supplementing of propagation** touched on in Section 2. When crossing Cloud Run Jobs / Workflows / SSE, traces tend to break with HTTP's automatic propagation alone. By intentionally carrying the context (trace ID) across the stage boundaries, I kept **"one job = one trace."** It's precisely because of this that "at which chunk, at which stage, it stopped" can be traced at a glance.

Let me connect just one line to the context of reliability. If you **decide on an SLO (e.g. job success rate 99%, p95 processing time 15 min) and set an error budget**, then you measure its attainment with metrics, and dig into the causes of deviation with traces and logs. **Observability is the foundation that turns an SLO into something "measurable and keepable."** What you can't observe, you can't promise (an SLO) either.

---

## 10. Summary: a production observability cheat sheet

A quick-reference table for when you're unsure.

- **The roles of the three signals**: notice with metrics → pinpoint the location with traces → read the reason in logs.
- **The basics of traces**: span names low-cardinality (the kind), variable values (IDs) into attributes.
- **Propagation**: for HTTP, the instrumentation library auto-injects/extracts `traceparent`. Manually supplement only for MQ / SSE / custom paths.
- **Python**: cast a zero-code net with `opentelemetry-distro`, and manually instrument only the key spots with `start_as_current_span`.
- **Node/Next.js**: auto with `NodeSDK` + `getNodeAutoInstrumentations()`, manual with `startActiveSpan` (don't forget `span.end()`). Keep the processing to be instrumented on the Node runtime.
- **Collector**: the app just throws via OTLP. Consolidate vendor swap-out, sampling, and scrubbing → avoid lock-in (ETC).
- **PII**: the first line = don't put it in, the second line = strip/hash at the Collector. What correlation needs is not "who" but "the same person."
- **Sampling**: first Head (probabilistic). Add Tail at the stage where you can't miss a single error or slowdown (YAGNI).
- **Correlation**: put trace_id on every log. This alone changes investigation from `grep` to "follow the trace."
- **Reliability**: the foundation for measuring SLOs / error budgets. What you can't observe, you can't protect.

Observability is not "installing a tool" but **"designing a state in which, when an incident occurs, you can trace at a glance where and what got stuck."** With the broadcaster's internal AI platform, I made long-running AI jobs observable with OpenTelemetry and assembled it into a production system where "at which chunk, at which stage, it stopped" can be followed with traces. Accelerating the implementation with generative AI (Claude Code) while tightening **the design gates that humans should judge** — like PII scrubbing and the sampling policy — by my own hand: with this combination, I introduce observability fast, cheap, and safe.

**"I want my company's system to be in a state where I can trace it when it stops" — from that design through instrumentation and Collector operation, I can accompany you end to end.** Even from a consultation about retrofitting instrumentation onto an existing system, feel free to reach out.

---

### Reference (official documentation)

- [Signals (the concept of signals)](https://opentelemetry.io/docs/concepts/signals/) — definitions of Traces / Metrics / Logs / Baggage
- [Context propagation](https://opentelemetry.io/docs/concepts/context-propagation/) — Context / Propagation / Propagator and the W3C traceparent
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/) / [Python Instrumentation](https://opentelemetry.io/docs/languages/python/instrumentation/) — manual instrumentation API, metrics
- [Python Zero-code instrumentation](https://opentelemetry.io/docs/zero-code/python/) — `opentelemetry-distro` / `opentelemetry-bootstrap` / `opentelemetry-instrument` and environment variables
- [OpenTelemetry JavaScript](https://opentelemetry.io/docs/languages/js/) / [Node.js Getting Started](https://opentelemetry.io/docs/languages/js/getting-started/nodejs/) / [JS Instrumentation](https://opentelemetry.io/docs/languages/js/instrumentation/) — `NodeSDK`, manual instrumentation API
- [Sampling](https://opentelemetry.io/docs/concepts/sampling/) — Head vs Tail and the trade-offs
- [Collector](https://opentelemetry.io/docs/collector/) — the receivers / processors / exporters pipeline and its operational value