OpenAI Whisper production-operation guide: a transcription design that uses self-hosting (large-v3-turbo) and the Audio API (gpt-4o-transcribe) differently

"I want to transcribe audio" — as a requirement it's one line. But the moment you try to put it into production, the things to judge multiply at once. Self-host or hit the API? Which model to choose? What to do with long audio over 25MB? How to suppress the misconversion of proper nouns? How to kill the "hallucination" that wells up in silent intervals?

This article is an implementation guide for operating OpenAI Whisper at production quality. As subject matter, I mix in the design judgments from the internal AI platform I built for a major domestic broadcaster (where speech recognition was used as "transcription of speech = ears" in the telop-typo-detection pipeline).

The rules of this article: model specs, API specs, and pricing are based on the OpenAI official documentation (as of June 2026). Since pricing and model names are revised, always confirm the latest values on the official pricing page before production. The code is arranged to be usable in actual operation, but secrets are on the premise of environment variables (hardcoding strictly forbidden).

0. The first fork: the word "Whisper" has two entities

If you start designing while confusing this, you definitely get stuck later. Whisper has two provision forms with different natures.

	① Self-hosting (OSS)	② OpenAI Audio API
Entity	`openai-whisper` (PyPI) / Hugging Face weights	a hosted endpoint of `api.openai.com`
Representative model	`large-v3` / `turbo` (large-v3-turbo)	`whisper-1` / `gpt-4o-transcribe` / `gpt-4o-mini-transcribe`
Execution place	your own GPU / CPU (audio doesn't go outside)	OpenAI's server (sends the audio)
Billing	compute resources (GPU time)	per-minute billing or token billing
File limit	none (depends on memory)	25MB / request
Main strength	privacy, fixed cost, unlimited length	zero operation, high accuracy, same-day introduction

When you say "use Whisper," fix first which of these ① and ② you mean. This article handles both and shows the selection criteria (chapter 3).

1. The self-host version: the latest model and correct usage

1.1 The model list (official numbers)

The model table of the official README is as follows. Relative speed is the relative value with large as 1x.

Size	Parameters	Required VRAM (guideline)	Relative speed	English-only (`.en`)
tiny	39M	~1GB	~10x	yes
base	74M	~1GB	~7x	yes
small	244M	~2GB	~4x	yes
medium	769M	~5GB	~2x	yes
large	1550M	~10GB	1x	none (multilingual only)
turbo	809M	~6GB	~8x	none

turbo (= large-v3-turbo) is the current sweet spot. It's a distilled/pruned model that cut large-v3's decoder layers from 32 → 4 layers, achieving about 8x speed while minimizing accuracy degradation. For a "transcription" use of multiple languages including Japanese, trying turbo first is the standard.

The pitfall of turbo the official clearly states: since turbo is trained excluding translation data, even if you specify --task translate, it may return in the original language. When you want "Japanese audio → English text" translation, use medium or large. The official also states that for an English-only app, the .en model gives better accuracy.

1.2 Installation (ffmpeg required)

# Python パッケージ
pip install -U openai-whisper

# 音声デコードに ffmpeg が必須（OSごとに導入）
# macOS:        brew install ffmpeg
# Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg

1.3 CLI: first, run it

# turbo で文字起こし（複数ファイルもまとめて渡せる）
whisper audio.mp3 --model turbo --language Japanese --output_format srt

# 日本語音声を「英語に翻訳」したい場合は medium/large + translate
whisper meeting.wav --model large --language Japanese --task translate

You can specify srt / vtt / txt / json for --output_format, and the subtitle file is output as-is. If the goal is subtitle generation, before assembling timestamps yourself, first confirm whether the CLI's output format suffices (YAGNI).

1.4 The Python API: the form used in production

The minimal example is 3 lines, but in production put in language fixing, word timestamps, and hallucination suppression.

import whisper

# モデルは一度だけロードして使い回す（プロセス内で再利用 = コールドスタート回避）
model = whisper.load_model("turbo")

result = model.transcribe(
    "audio.mp3",
    language="ja",            # 言語を固定 → 言語自動検出のブレと初動コストを排除
    word_timestamps=True,     # 単語単位の時刻（字幕・検索・ハイライトに必須）
    # --- 幻覚（hallucination）抑制の三点セット（詳細は第6章） ---
    condition_on_previous_text=False,  # 直前文への引きずられを断つ
    no_speech_threshold=0.6,           # 無音判定を厳しめに
    temperature=0,                     # 決定的に（再現性とテスト容易性）
)

print(result["text"])
for seg in result["segments"]:
    print(f"[{seg['start']:6.2f}–{seg['end']:6.2f}] {seg['text'].strip()}")

Attaching word_timestamps=True puts words (word / start / end / probability) into each segment. Since a low-probability word = a place the model is unconfident about, you can build into the design a "human confirmation gate" like highlighting it yellow in a proofreading UI.

A design tip: don't load model per request. The large family takes seconds to a dozen-plus seconds just to load. For a web server, load it once at process startup and share it among workers (SRP: separate loading and inference).

2. The OpenAI Audio API version: go for high accuracy with zero operation

You don't want to have a GPU, you want high-accuracy transcription same-day — in that case, the hosted API is the fastest.

2.1 Models and endpoints (the official correspondence table)

Model	Use	Supported `response_format`	Streaming
`gpt-4o-transcribe`	highest-accuracy transcription	`json` / `text`	✅ (`stream=True`)
`gpt-4o-mini-transcribe`	low-cost, fast	`json` / `text`	✅
`gpt-4o-transcribe-diarize`	with speaker diarization	`diarized_json`, etc.	—
`whisper-1`	the legacy all-rounder, rich subtitle formats	`json` / `text` / `srt` / `verbose_json` / `vtt`	—

The gpt-4o-transcribe family is a new generation that improved WER (word error rate) based on GPT-4o, with improved recognition accuracy and language detection. On the other hand, if you need subtitle formats like SRT/VTT or verbose_json's fine-grained timestamps, choose whisper-1. Remembering "accuracy-focused = gpt-4o family, machine-readable subtitle format-focused = whisper-1" keeps you from getting lost.

Transcriptions: audio → "original-language" text.
Translations: audio → English text. whisper-1-only.

2.2 Basics: Python and TypeScript

from openai import OpenAI

client = OpenAI()  # APIキーは環境変数 OPENAI_API_KEY から（ハードコード禁止）

with open("speech.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="ja",  # 言語を渡すと精度とレイテンシが安定する
    )

print(transcription.text)

For Next.js / Node (TypeScript, the same as this site):

import OpenAI from "openai";
import { createReadStream } from "node:fs";

const openai = new OpenAI(); // process.env.OPENAI_API_KEY を自動参照

const transcription = await openai.audio.transcriptions.create({
  file: createReadStream("audio.mp3"),
  model: "gpt-4o-transcribe",
  language: "ja",
});

console.log(transcription.text);

2.3 Subtitles (SRT) and word timestamps: whisper-1

When you need machine-readable subtitles or per-word times, use whisper-1 + verbose_json.

with open("lecture.mp3", "rb") as audio_file:
    result = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="ja",
        response_format="verbose_json",
        timestamp_granularities=["segment", "word"],  # verbose_json 必須
    )

# セグメント単位（字幕のかたまり）
for seg in result.segments:
    print(f"[{seg.start:.2f}-{seg.end:.2f}] {seg.text}")

# 単語単位（検索インデックス・ハイライトに）
for w in result.words:
    print(w.word, w.start, w.end)

If you just want the SRT file itself, specifying response_format="srt" returns the subtitle text without writing conversion code (DRY: don't write your own SRT serializer).

2.4 Make it spell proper nouns / technical terms correctly: `prompt`

The most common complaint in transcription is the misconversion of proper nouns, company names, and technical terms. Passing "the expected spelling" to the prompt parameter makes the model lean toward it.

GLOSSARY = "登場する固有名詞: 友田陽大, ハコキット, Next.js, Supabase, RLS, gpt-4o-transcribe"

with open("podcast.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="ja",
        prompt=GLOSSARY,  # ドメイン語彙を事前に教える
    )

This is a cheap, high-impact tuning means. In program-production transcription, since the notation variance of "person names, program names, product names" directly ties to accidents, the operation of pouring a glossary into the prompt per project is effective.

2.5 Streaming: output while speaking (gpt-4o family)

For a UX where you "don't want to keep them waiting," like meeting minutes or live subtitles, you can receive partial results sequentially with stream=True.

stream = client.audio.transcriptions.create(
    model="gpt-4o-transcribe",
    file=open("speech.mp3", "rb"),
    response_format="text",
    stream=True,  # gpt-4o-transcribe / mini で利用可
)

for event in stream:
    # delta: 増分テキスト / done: 確定
    if event.type == "transcript.text.delta":
        print(event.delta, end="", flush=True)
    elif event.type == "transcript.text.done":
        print("\n--- 確定 ---")

For the use of transcribing mic input in real time, using the Realtime API (WebSocket / WebRTC), not file streaming, is the right path. Since "sequential return of a recorded file" and "live sequential recognition" are different things, don't mistake the requirements.

3. Selection framework: how to decide self-host vs. API

Not "which is correct," but choose the one that fits the requirements on four axes.

Judgment axis	Self-hosting (turbo/large) is advantageous	The Audio API is advantageous
Privacy	medical, government, confidential where you can't send audio outside	external sending is acceptable
Cost structure	large-volume, always-on, want to make it a fixed cost (GPU amortization)	small-volume, spiky, where a variable cost is reasonable
Length	want to run a multi-hour long file in one go (no limit)	fits in 25MB / splitting is fine as a premise
Operational capacity	can operate GPU, model updates, and scaling	don't want to operate / want accuracy same-day

The intuition of cost

The official per-minute billing (as of June 2026, needs confirmation) is roughly the following level.

whisper-1 / gpt-4o-transcribe: $0.006/min (≒ $0.36/hour)
gpt-4o-mini-transcribe: $0.003/min (≒ $0.18/hour)

For a few dozen hours a month, the API is overwhelmingly cheaper (cheaper than renting one GPU, with zero operation). Conversely, for a workload that constantly processes hundreds of hours a day, a break-even point appears where running turbo on a GPU instance lowers the unit price. "First validate the value with the API → consider migrating to self-host once the volume is predictable" is the correct order in many projects (YAGNI / cost efficiency).

4. Crossing the 25MB wall: chunk-splitting long audio

The API's file limit is 25MB. A one-hour meeting recording easily exceeds it. Cutting at silent intervals (VAD) is the standard, and since you don't cut mid-sentence, it suppresses accuracy degradation.

"""長尺音声を無音で分割し、各チャンクを文字起こしして連結する。
固定長で切ると単語が割れるため、無音境界で切るのが要点。"""
from openai import OpenAI
from pydub import AudioSegment, silence

client = OpenAI()

def split_on_silence_bounded(path: str, max_ms: int = 15 * 60_000) -> list[AudioSegment]:
    """無音を境界に、各チャンクを max_ms 以下に収める（25MB制限の安全側）。"""
    audio = AudioSegment.from_file(path)
    silences = silence.detect_silence(audio, min_silence_len=700, silence_thresh=-40)
    cut_points = [s[0] for s in silences] + [len(audio)]

    chunks: list[AudioSegment] = []
    start = 0
    for point in cut_points:
        if point - start >= max_ms:
            chunks.append(audio[start:point])
            start = point
    if start < len(audio):
        chunks.append(audio[start:])
    return chunks

def transcribe_long(path: str, language: str = "ja") -> str:
    parts: list[str] = []
    context = ""  # 直前チャンク末尾を次の prompt に渡し、境界の文脈を維持
    for i, chunk in enumerate(split_on_silence_bounded(path)):
        buf = chunk.export(format="mp3", bitrate="64k")  # 軽量化で25MBに余裕
        res = client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=("chunk.mp3", buf, "audio/mpeg"),
            language=language,
            prompt=context[-200:],  # 境界をまたぐ固有名詞・文脈のヒント
        )
        parts.append(res.text)
        context = res.text
    return "\n".join(parts)

There are three points.

Cut at silence: fixed-length splitting cuts words and lowers accuracy. Cut at VAD boundaries.
Lower the bitrate: transcription doesn't need high audio quality. 64kbps mp3 is small enough, and the worry of 25MB disappears (cost efficiency).
Pass the context between chunks: put the previous chunk's tail into the next prompt to keep the continuity of proper nouns and topics crossing the boundary.

5. Production-operation design: idempotency, retry, observability

Transcription is a "long-running, external-API, billed" job. Calling it naively, a mid-failure means redoing everything + a double charge. Let me show the design that actually worked in a broadcaster platform, in its minimal form.

5.1 A per-chunk idempotent cache

If you cache each chunk keyed by the content hash, you can skip completed chunks on re-execution, and the retry becomes idempotent (= resumption from a partial failure = avoidance of a double charge).

import hashlib
from pathlib import Path

def chunk_key(data: bytes, model: str, language: str) -> str:
    """音声内容＋パラメータで決まる決定的キー。同入力 → 同キー → キャッシュヒット。"""
    h = hashlib.sha256()
    h.update(data)
    h.update(f"{model}:{language}".encode())
    return h.hexdigest()

def transcribe_chunk_idempotent(data: bytes, model: str, language: str, cache_dir: Path) -> str:
    key = chunk_key(data, model, language)
    cached = cache_dir / f"{key}.txt"
    if cached.exists():
        return cached.read_text(encoding="utf-8")  # 再実行はAPIを叩かない（冪等・節約）

    res = client.audio.transcriptions.create(
        model=model, file=("c.mp3", data, "audio/mpeg"), language=language,
    )
    cached.write_text(res.text, encoding="utf-8")
    return res.text

5.2 Retry with exponential backoff

An external API definitely fails with rate limits and temporary failures. Apply retry only to idempotent operations.

import time
from openai import APIConnectionError, RateLimitError, APIStatusError

def with_retry(fn, *, max_attempts: int = 4, base: float = 1.0):
    """指数バックオフ。リトライ対象を限定し、4xx(=入力不正)は即時失敗させる。"""
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except (RateLimitError, APIConnectionError) as e:
            if attempt == max_attempts:
                raise
            time.sleep(base * (2 ** (attempt - 1)))  # 1s, 2s, 4s, ...
        except APIStatusError as e:
            if 400 <= e.status_code < 500:
                raise  # 入力不正はリトライしても無駄 → 即失敗（fail fast）
            if attempt == max_attempts:
                raise
            time.sleep(base * (2 ** (attempt - 1)))

5.3 Observability: what to definitely record

In a transcription job, record metadata, not the body (PII) (reconciling privacy and observability).

Job ID / chunk number / input hash / model / language
Processing time (the real-time ratio to audio length = RTF)
Billing amount (minutes or tokens) and estimated cost
Failure type (rate limit / timeout / invalid input) and retry count

Emitting these in structured logs (OpenTelemetry, etc.) lets you trace at a glance "which chunk got stuck" and "is the cost reasonable." Not leaving PII (personal name, contact) in the body log is an absolute condition of internal-control projects.

6. Hallucination and the silence problem: the last 10% of accuracy

Whisper's known weakness is generating "sentences that don't exist" in silent, noise, or BGM-only intervals. A boilerplate like "thank you for watching" welling up is a typical example. In production, kill it with the following multi-layered countermeasures.

VAD (voice-activity detection) in preprocessing: drop intervals with no speech before inference. If you remove silence with silero-vad, etc., the hotbed of hallucination itself disappears.
condition_on_previous_text=False: sever the dragging by the previous text (the chain of a once-welled-up hallucination).
Discard by threshold: raise no_speech_threshold and discard low-confidence segments with logprob_threshold.
Collate with an independent second basis: the strongest countermeasure is a cross-check. In the broadcaster's telop-typo detection, I cross-checked the two independent lines of "the screen's telop (OCR = eyes)" and "the speech transcription (ASR = ears)" and detected the discrepancy. Rather than the confidence of a single information source, using the mismatch of two information sources as a clue surfaces both hallucinations and misconversions.

The "confidence" of one information source isn't reliable. The "discrepancy" of two independent paths is exactly what becomes a trustworthy detection signal — this is a general principle when putting AI into production, not limited to speech recognition.

7. Security and privacy: where to draw the line

Audio can be personal information: voice, name, contact, and medical history can be included. Audio containing confidential or special-care personal information often can't be sent to the API. In that case, self-hosting (turbo/large) is the only choice.
The API key on the server side: don't hit OpenAI directly from the browser (key leakage). For Next.js, via a Route Handler / Server Action, with the key in an environment variable.
Validation of input: validate the file format (mp3/mp4/mpeg/mpga/m4a/wav/webm), size (25MB), and length at the boundary before passing to the API. Don't pass user-derived input straight through.
Make the storage policy explicit: if you save the transcription result, decide the retention period, encryption, and deletion flow first. Encrypting PII with AES-256-GCM, etc., and partial-matching with tokenization (HMAC) without decryption for search, is the standard design in internal control.

8. Summary: a selection cheat sheet

Finally, a quick reference for when you're lost.

High accuracy and same-day for now: gpt-4o-transcribe (API). Pass the language and guide proper nouns with prompt.
Cost is the top priority: gpt-4o-mini-transcribe (API, $0.003/min).
You need SRT/VTT subtitles / word timestamps: whisper-1 + verbose_json.
You can't send audio outside / run a lot of long files: self-hosted turbo (multilingual) / large (translation).
Live subtitles: the Realtime API. For sequential return of a recorded file, stream=True.
Common equipment for productionization: silence splitting, idempotent cache, exponential backoff, observability that doesn't emit PII, multi-layered hallucination countermeasures.

Transcription looks like "a one-line requirement" but is the work of designing the trade-off of cost, accuracy, privacy, and reliability. On an internal AI platform for a broadcaster, I operated speech recognition in production as "the second eye of telop-typo detection," assembling it as a long-running job with idempotency, resumption, and observability guaranteed.

"How to transcribe your audio and how to incorporate it into operations" — from that design to implementation and operation, I can accompany you end-to-end. Even from the requirement-organizing stage, please feel free to consult me.

References (official documentation)

OpenAI Whisper (GitHub, OSS) — the model list, CLI, Python API, turbo caveats
openai/whisper-large-v3-turbo (Hugging Face) — turbo's architecture (32→4 layers)
OpenAI Speech-to-Text guide — the Audio API's parameters, formats, streaming
OpenAI API pricing — the latest values of per-minute / token billing (confirm before production)

OpenAI Whisper production-operation guide: a transcription design that uses self-hosting (large-v3-turbo) and the Audio API (gpt-4o-transcribe) differently

0. The first fork: the word "Whisper" has two entities

1. The self-host version: the latest model and correct usage

1.1 The model list (official numbers)

1.2 Installation (ffmpeg required)

1.3 CLI: first, run it

1.4 The Python API: the form used in production

2. The OpenAI Audio API version: go for high accuracy with zero operation

2.1 Models and endpoints (the official correspondence table)

2.2 Basics: Python and TypeScript

2.3 Subtitles (SRT) and word timestamps: whisper-1

2.4 Make it spell proper nouns / technical terms correctly: `prompt`

2.5 Streaming: output while speaking (gpt-4o family)

3. Selection framework: how to decide self-host vs. API

The intuition of cost

4. Crossing the 25MB wall: chunk-splitting long audio

5. Production-operation design: idempotency, retry, observability

5.1 A per-chunk idempotent cache

5.2 Retry with exponential backoff

5.3 Observability: what to definitely record

6. Hallucination and the silence problem: the last 10% of accuracy

7. Security and privacy: where to draw the line

8. Summary: a selection cheat sheet

References (official documentation)

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

Also worth reading

Making marshmallow Production-Quality: Performance Optimization, Testing, and Error Design

FastAPI Production-Operations Guide: Building APIs That Don't Fall Over with the Right Use of async, Pydantic v2 Boundary Validation, DI, and Observability

Python Data Types Complete Guide: The 'Right Use' of Numbers, Strings, and Collections, and Designs That Don't Break in Production

0. The first fork: the word "Whisper" has two entities

1. The self-host version: the latest model and correct usage

1.1 The model list (official numbers)

1.2 Installation (ffmpeg required)

1.3 CLI: first, run it

1.4 The Python API: the form used in production

2. The OpenAI Audio API version: go for high accuracy with zero operation

2.1 Models and endpoints (the official correspondence table)

2.2 Basics: Python and TypeScript

2.3 Subtitles (SRT) and word timestamps: whisper-1

2.4 Make it spell proper nouns / technical terms correctly: prompt

2.5 Streaming: output while speaking (gpt-4o family)

3. Selection framework: how to decide self-host vs. API

The intuition of cost

4. Crossing the 25MB wall: chunk-splitting long audio

5. Production-operation design: idempotency, retry, observability

5.1 A per-chunk idempotent cache

5.2 Retry with exponential backoff

5.3 Observability: what to definitely record

6. Hallucination and the silence problem: the last 10% of accuracy

7. Security and privacy: where to draw the line

8. Summary: a selection cheat sheet

References (official documentation)

Related articles

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

Also worth reading

Making marshmallow Production-Quality: Performance Optimization, Testing, and Error Design

FastAPI Production-Operations Guide: Building APIs That Don't Fall Over with the Right Use of async, Pydantic v2 Boundary Validation, DI, and Observability

Python Data Types Complete Guide: The 'Right Use' of Numbers, Strings, and Collections, and Designs That Don't Break in Production

2.4 Make it spell proper nouns / technical terms correctly: `prompt`