An internal AI platform supporting program production at a major Japanese broadcaster (built a multi-service foundation and auth hub)
A self-built OIDC auth hub (BFF) binding 5 AI services with Google Workspace SSO, broadcast-quality speech synthesis, a caption typo-detection pipeline cross-checking OCR × speech recognition, and generative-AI compliance review — built on GCP with IaC
Client
An internal content-production AI platform for a major Japanese TV broadcaster (company undisclosed, NDA) | Domain: broadcast program-production workflows (AI voice narration of scripts, automatic typo detection in captions, generative-AI content/compliance review, malware scanning of uploaded assets) | Form: internal multi-tenant SaaS (offered to all staff via Google Workspace SSO) | Architecture: a microservices monorepo on GCP (a shared BFF auth hub + multiple AI-tool services)
My role
Technical architect and full-stack developer. Implemented across frontend, backend, GCP infrastructure (Terraform), CI/CD, and observability — from the shared foundation (BFF auth hub, OIDC provider, back-channel logout, PII encryption) to the speech-synthesis service (ElevenLabs / Google Chirp3), the ML/OCR caption typo-detection pipeline (FastAPI + Python + Cloud Workflows), generative-AI compliance review (Gemini 2.5 / Vertex AI), and the malware scanner (ClamAV on Cloud Run). Unified each service's caching, auth, logging, and Terraform modules under a single convention, building it as a horizontally scalable platform.
Challenge (Situation & Task)
A broadcaster's production floor has many highly specialized, easily siloed tasks — narration recording, caption proofreading, and compliance review (broadcast-standard checks). Several AI tools to support these had to be delivered while simultaneously satisfying competing requirements: (1) staff use them across the board with a single login, (2) they meet the broadcaster's internal-control and information-security standards, (3) each tool can be deployed and improved independently, and (4) heavy, expensive AI processing (GPU/LLM/speech synthesis) runs stably at production quality. Not a one-off PoC, but an "always-on internal platform" embedded in the company-wide production workflow.
An internal AI platform for a broadcaster concentrated enterprise-specific difficulties.
-
Integrating multiple AI tools with SSO: building tools of different natures (speech synthesis, caption typo detection, generative-AI review, video summarization) separately fragments auth, permissions, and operations. Starting from Google Workspace SSO, an "auth hub" was needed to single-sign-on into each tool securely while separating permissions and deployments per tool.
-
Broadcaster-grade security and internal control: even for an internal system, authorization bypass, PII leakage, and impersonation are unacceptable. Token lifetimes, audit logs, PII encryption, WAF, MFA, least-privilege IAM, and keyless CI/CD — a multi-layered defense that withstands internal control — had to be built in as preconditions.
-
Production operation of heavy, expensive AI processing: caption typo detection is a long-running job combining OCR (images) and speech recognition (matching captions to speech); LLM OCR is accurate but costly and slow. Speech synthesis demands broadcast quality (sample rate, fades, pronunciation dictionaries). These had to run cost-controlled, able to complete and resume.
-
Safe asset intake (a zero-trust entry point): externally supplied video/image assets had to be malware-scanned before reaching the platform and routed clean/quarantined. Scanning had to avoid exhausting memory even for multi-GiB assets and be idempotent under retries.
Why these technologies (Rationale)
Unify the BFF and each tool's UI on Next.js 16 / React 19 (App Router · RSC): the auth hub (BFF) keeps sensitive logic server-side and runs IP-allowlist + session validation in Edge Runtime middleware. Each tool follows the same Next.js convention to minimize learning cost and review burden.
Auth is a "self-built OIDC provider + per-tool short-lived JWT": starting from Google Workspace SSO (NextAuth.js v5 / Identity Platform), the BFF issues short-lived tokens scoped to each tool's audience (10-minute access token, 1-hour ID token). PKCE S256 is required, structurally eliminating cross-tool impersonation and auth-code interception.
Split heavy AI work into jobs + workflows: caption typo detection runs on FastAPI (async) + Cloud Run Jobs + Cloud Workflows. OCR and speech recognition run in parallel, and long jobs are decoupled from HTTP to guarantee progress, resumption, and idempotency. Speech synthesis places two providers (ElevenLabs / Google Chirp3) behind a provider abstraction, with a dummy mode for testing without API cost.
Generative-AI compliance review on Gemini 2.5 + grounding: per-criterion prompts (broadcast standards, reporters' handbooks, etc.) assess risks in scripts, drafts, and storyboards, attaching grounded citations so the rationale is traceable. The API-key approach and the Vertex AI (GCP service-account) approach are abstracted and switched by control and cost.
Consolidate data on GCP managed services: Cloud SQL (PostgreSQL / MySQL) runs with IAM auth, required TLS (ENCRYPTED_ONLY), and private IP; cache is Memorystore (Redis); real-time progress is Firestore; assets are Cloud Storage. No dedicated VMs — leaning on managed services to reduce operational load and cost.
Infrastructure is 100% Terraform (IaC): VPC through Cloud Run, Cloud SQL, Cloud Armor, Secret Manager, Identity Platform, and Workflows codified in ~71 modules. Separates stg/prod state, and separates responsibilities — Cloud Build owns "container image and latest env," Terraform owns "infrastructure config" — to prevent drift.
CI/CD is keyless via Workload Identity Federation (OIDC): GitHub Actions authenticates to GCP without issuing service-account keys. Cloud Build separates stg/prod, DB migrations are split into a dedicated job, and CodeQL and dependency updates are automated.
What I did (Action)
[Self-built OIDC auth hub (BFF)] Starting from Google Workspace SSO, the BFF issues audience-scoped short-lived JWTs to each tool (10-minute access token, 1-hour ID token). The authorization flow requires PKCE S256, and token hand-off uses an auto-POSTed HTML form (not the URL) + an HMAC signature over the originating origin to prevent MITM. A per-device session ID carries
sidGenerationandentitlementsVersion, enabling per-device revocation and detection of permission changes.[Back-channel logout and permission sync] Built a mechanism to asynchronously deliver logout and permission changes to each tool. Events are HMAC-SHA256-signed, retried up to 3 times with exponential backoff (1s/5s/30s), and deduplicated/replay-protected by a
(eventId, toolId)unique constraint. Tools stay loosely coupled while company-wide logout propagates reliably.[PII encryption and audit logs] External users' email, phone, and name are stored encrypted with AES-256-GCM, and search uses HMAC-SHA-256 tokens for partial matching without decryption. Secret comparison uses constant-time comparison (timingSafeEqual) to prevent timing attacks. Session creation, revocation, and renewal are recorded in audit logs to provide internal-control evidence.
[Caption typo-detection pipeline] Captions are extracted from video, local OCR detects caption "transitions," and LLM OCR is applied only to the set of unique captions (not every frame). In parallel, speech recognition transcribes the audio and is cross-checked against the OCR results to detect typos, NG words, and proper-noun mismatches. Running OCR and speech recognition in parallel on Cloud Workflows cut processing time by ~30% (sequential 18 min → parallel 13 min).
[Idempotent, resumable long jobs] Each segment is assigned a
global caption ID = segment_index × stride + local_id, so IDs stay stable through downstream merge under parallelism and are idempotent on re-run. Progress is delivered to the UI near-real-time via a Firestore snapshot subscription + SSE, withresolve_monotonic_progress_percentpreventing progress from going backward. Uploads use chunked parallel signed URLs (up to 8 concurrent) for large assets.[Broadcast-quality speech synthesis] Two TTS systems (ElevenLabs / Google Chirp3) are unified behind a provider abstraction, with output as MP3 44.1kHz/192kbps and FFmpeg fade normalization. It supports pronunciation dictionaries and text preprocessing (kanji conversion, whitespace removal), and scripts use optimistic locking (a version column) to detect concurrent-edit conflicts.
dummyModetests the synthesis path without hitting real APIs, and usage logs (voice ID, character count, status) are fully recorded to make cost observable.[Asset malware scanning (zero-trust entry)] Uploaded assets are passed to ClamAV (Cloud Run) via Eventarc/GCS events, streaming-scanned up to 10GiB (without buffering, to avoid memory exhaustion) and routed to clean/quarantine buckets. Intermediate chunks of a GCS composite upload are skipped by regex; zero-length, in-progress, and deleted files are safely ignored. The atomicity of
File.movemakes it idempotent under retries, and scan results are emitted to Cloud Monitoring via OpenTelemetry.[IaC, security, CI/CD] The whole of GCP is codified in ~71 Terraform modules. Multi-layered defense with Cloud Armor (OWASP CRS 3.3 + adaptive DDoS protection + rate limiting), Cloud SQL (IAM auth, required TLS, private IP), Identity Platform (SMS MFA, reCAPTCHA Enterprise), Secret Manager (latest-version only), and least-privilege service accounts. CI/CD authenticates keylessly via Workload Identity Federation (OIDC), with the WAF fully enabled in stg to eliminate false positives before production.
The platform's design philosophy was to "bind multiple AI tools under a single auth/operations convention, at a quality that withstands a broadcaster's controls."
Loosely coupled integration via an auth hub: Giving each tool its own login makes auth quality vary per tool and breaks propagation of permission changes and logout. So the BFF is the sole identity provider, and tools only receive audience-scoped short-lived JWTs. Token hand-off uses an auto-POST form + origin HMAC signature instead of the URL, eliminating token leakage through redirects. Logout and permission changes are delivered over the back channel — signed, retried, and deduplicated — so tools stay loosely coupled yet follow company-wide session control.
Making heavy AI work "complete, resumable, idempotent": Caption typo detection combines heavy OCR (images) and speech recognition (speech). Applying LLM OCR to every frame would blow up cost, so local OCR detects caption transitions and LLM is applied only to unique captions. OCR and speech recognition have no mutual dependency, so Cloud Workflows parallelizes them, cutting time by ~30%. Long jobs are segmented, and deterministic global caption IDs make the final result converge uniquely under parallelism, partial re-runs, or retries.
Building control as a first-class feature: PII is encrypted at rest and searched via HMAC tokens without decryption. Secret comparison is constant-time, the auth flow requires PKCE, tokens are short-lived, and operations leave audit logs — all built into the initial design, not bolted on. On the infrastructure side, Cloud SQL uses IAM auth, required TLS, and private IP; the entry point is Cloud Armor; CI/CD is keyless (OIDC) — multi-layered — with the WAF fully enabled in stg to surface false positives and misconfigurations before production.
Optimizing cost and operations by consolidating on managed services: With no dedicated VMs or Kubernetes, it leans on managed services — Cloud Run (services + jobs), Cloud SQL, Memorystore, Firestore, Cloud Workflows. The primary region keeps one instance warm while the secondary scales to zero for DR — an asymmetric setup — and the Artifact Registry image-retention cap keeps steady-state cost low while preserving recovery during incidents.
Key technical decisions
Self-built OIDC provider + per-tool short-lived JWT (PKCE S256 required): loosely coupled SSO across multiple AI tools
Back-channel logout (HMAC-signed, retried, deduplicated): reliable propagation of company-wide session control
Parallelizing OCR × speech recognition on Cloud Workflows: ~30% faster caption typo detection
Hybrid OCR (local transition detection → LLM only on the diff): accuracy preserved while curbing LLM cost
Cloud SQL (IAM auth, required TLS, private IP) + Cloud Armor + keyless CI/CD: broadcaster-grade multi-layered defense
Responsibilities
- Platform design and BFF auth hub (OIDC / back-channel logout / PII encryption) implementation
- Speech-synthesis service (ElevenLabs / Google Chirp3) frontend & backend
- Caption typo-detection pipeline (FastAPI / Python / OCR / speech recognition / Cloud Workflows)
- Generative-AI compliance review (Gemini 2.5 / Vertex AI / grounding)
- Malware scanner (ClamAV on Cloud Run / Eventarc)
- GCP infrastructure build & operations (Terraform / Cloud Run / Cloud SQL / Cloud Armor / Identity Platform)
- CI/CD (Cloud Build / GitHub Actions / Workload Identity Federation) and observability
Technologies
Results in numbers
- AI services integrated
- 5servicesA shared BFF auth hub + speech synthesis, caption typo detection, generative-AI review, and malware scanning unified under a single SSO.
- Caption typo detection speedup
- 30%OCR and speech recognition run in parallel on Cloud Workflows (sequential 18 min → parallel 13 min).
- Access-token lifetime
- 10minShort-lived JWT scoped to each tool's audience. ID token is 1 hour; PKCE S256 required.
- Terraform modules
- 71modulesVPC, Cloud Run, Cloud SQL, Cloud Armor, Identity Platform — GCP 100% codified (IaC).
- PII encryption
- 256bit (AES-GCM)Email, phone, and name stored encrypted; search via HMAC tokens without decryption.
- Streaming-scan limit
- 10GiBScanned by ClamAV without buffering to avoid memory exhaustion. Idempotent under retries.
Results
- Integrated 5 AI services of different natures into an internal platform usable across the board with a single Google Workspace SSO login — a loosely coupled setup where each tool deploys and improves independently
- A self-built OIDC auth hub (BFF) issues audience-scoped short-lived JWTs per tool (10-minute access token). PKCE S256 is required, and an auto-POST + origin HMAC signature (never putting tokens in URLs) structurally eliminates impersonation and interception
- Logout and permission changes propagate reliably to all tools via HMAC-signed, exponential-backoff, deduplicated back-channel events (company-wide session control while staying loosely coupled)
- PII is stored encrypted with AES-256-GCM and searched via HMAC tokens without decryption. With constant-time comparison and audit logs, it meets the broadcaster's internal-control requirements
- Automatically detects caption typos by cross-checking OCR (images) and speech recognition (speech). Cloud Workflows parallelism cut processing time by ~30% (sequential 18 min → parallel 13 min)
- Hybrid OCR (local caption-transition detection → LLM OCR only on the diff) minimizes LLM calls and cost while preserving accuracy
- Deterministic global caption IDs make long jobs idempotent and resumable — the result converges uniquely under parallel processing, partial re-runs, or retries
- Broadcast-quality AI narration (two TTS systems, ElevenLabs / Google Chirp3, 44.1kHz/192kbps, pronunciation dictionaries) built maintainably with a provider abstraction and a dummy mode
- A zero-trust entry point that streaming-scans uploaded assets with ClamAV (up to 10GiB, avoiding memory exhaustion) and routes them clean/quarantine atomically and idempotently
- GCP-wide IaC in ~71 Terraform modules. Multi-layered defense with Cloud SQL (IAM auth, required TLS, private IP), Cloud Armor (OWASP CRS 3.3), Identity Platform (MFA), and least-privilege IAM, with CI/CD made keyless via Workload Identity Federation (OIDC)
同様の課題、抱えていませんか?
あなたのビジネス課題も、最新の技術で解決できます。 まずは30分の無料技術相談から、状況をお聞かせください。
自社の課題もSaaS化できるか相談するプロジェクト単位(請負)・技術顧問、どちらにも対応可能です