Category
可観測性・SRE(OpenTelemetry / SLO)の実践ガイド
可観測性は「ログを出すこと」ではなく「止まった処理を一目で追えること」です。OpenTelemetryで三本柱(ログ・メトリクス・トレース)を相関させ、構造化ログに相関IDを通し、SLO/エラーバジェットで判断し、原因ではなく症状でアラートを鳴らす——本番の信頼性を数字で運用する設計を扱います。
3 articles in total
Foundational guide
Foundational guide (start here)
OpenTelemetry Production Observability Guide: Correlating Traces, Metrics, and Logs So You Can Spot a Stuck Process at a Glance
An implementation guide for making production systems observable with OpenTelemetry. From the concepts of the three signals (traces / metrics / logs) and context propagation, to instrumenting FastAPI (Python) and Next.js (Node), the OTel Collector, head/tail sampling, log-to-trace correlation, PII scrubbing, and telemetry cost optimization — explained with official-spec-compliant, real code.
Related practical articles
- アーキテクチャ設計AWSTypeScriptサーバーレス
A practical guide to incident response 2026: designing Incident Commander, Runbooks, postmortems, and on-call the SRE way
Explaining how to build a team strong against production failures, faithful to Google SRE's official knowledge. From the Incident Commander model, severity design of SEV1–4, a detect→mitigate→verify→communicate Runbook template, blameless postmortems, on-call hygiene (reducing toil/alert fatigue), to MTTD/MTTR and error budgets, it shows the practical knowledge of designing with operations included, in real code and templates.
20 min read - AWS可観測性OpenTelemetrySREECS
AWS ECS Fargate SRE Practical Guide: ADOT Distributed Tracing, EMF Metrics, and SLO / Error Budget / Burn-Rate Alert Design
Using ECS Fargate production operations as the subject, this is a definitive observability/SRE guide explaining distributed tracing with OpenTelemetry/ADOT, JSON structured logs and correlation IDs, EMF custom metrics, RED/USE, SLOs, error budgets, burn-rate alerts, composite alarms, and sampling design—in official-documentation-compliant real code (TypeScript/Terraform).
20 min read