# OpenAI Whisper production-operation guide: a transcription design that uses self-hosting (large-v3-turbo) and the Audio API (gpt-4o-transcribe) differently

> An implementation guide for using OpenAI Whisper at production quality. Faithful to the official documentation, it organizes the model list (large-v3 / turbo) and the Audio API (whisper-1 / gpt-4o-transcribe / gpt-4o-mini-transcribe), and explains in real code a self-host vs. API selection framework, circumventing the 25MB limit, SRT subtitle generation, prompt-guiding proper nouns, hallucination countermeasures, and idempotency, resumption, and observability.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: Python, 音声認識, OpenAI API, アーキテクチャ設計, パフォーマンス, 可観測性
- URL: https://tomodahinata.com/en/blog/openai-whisper-production-guide-selfhost-vs-api
- Category: Voice AI
- Pillar guide: https://tomodahinata.com/en/blog/voice-ai-production-guide-stt-tts-voice-agents

## Key points

- Whisper has two entities, self-hosting (OSS) and the OpenAI Audio API; first fix which one you use.
- For self-hosting, turbo (large-v3-turbo) is the current sweet spot, and if subtitles matter, choose whisper-1 for the API.
- Selection is on the four axes of privacy, cost structure, length, and operational capacity; if you can't send confidential audio outside, self-hosting is the only choice.
- For the 25MB limit, split at silent intervals (VAD), don't cut mid-sentence, and pass the boundary's context to the next prompt.
- For hallucination, in addition to preprocessing VAD and threshold discarding, detect it by the discrepancy of an independent second basis (OCR × ASR).

---

"I want to transcribe audio" — as a requirement it's one line. But the moment you try to put it into production, the things to judge multiply at once. **Self-host or hit the API? Which model to choose? What to do with long audio over 25MB? How to suppress the misconversion of proper nouns? How to kill the "hallucination" that wells up in silent intervals?**

This article is an implementation guide for operating OpenAI Whisper at **production quality.** As subject matter, I mix in the design judgments from the internal AI platform I built for a major domestic broadcaster (where speech recognition was used as "transcription of speech = ears" in the [telop-typo-detection pipeline](/case-studies/broadcaster-ai-content-platform)).

> **The rules of this article**: model specs, API specs, and pricing are based on the **OpenAI official documentation (as of June 2026).** Since pricing and model names are revised, always confirm the latest values on the [official pricing page](https://openai.com/api/pricing/) before production. The code is arranged to be usable in actual operation, but secrets are on the premise of environment variables (hardcoding strictly forbidden).

---

## 0. The first fork: the word "Whisper" has two entities

If you start designing while confusing this, you definitely get stuck later. Whisper has **two provision forms with different natures.**

| | ① Self-hosting (OSS) | ② OpenAI Audio API |
| --- | --- | --- |
| Entity | `openai-whisper` (PyPI) / Hugging Face weights | a hosted endpoint of `api.openai.com` |
| Representative model | `large-v3` / `turbo` (large-v3-turbo) | `whisper-1` / `gpt-4o-transcribe` / `gpt-4o-mini-transcribe` |
| Execution place | your own GPU / CPU (audio doesn't go outside) | OpenAI's server (sends the audio) |
| Billing | compute resources (GPU time) | per-minute billing or token billing |
| File limit | none (depends on memory) | **25MB / request** |
| Main strength | privacy, fixed cost, unlimited length | zero operation, high accuracy, same-day introduction |

**When you say "use Whisper," fix first which of these ① and ② you mean.** This article handles both and shows the selection criteria (chapter 3).

---

## 1. The self-host version: the latest model and correct usage

### 1.1 The model list (official numbers)

The model table of the official README is as follows. `Relative speed` is the relative value with `large` as 1x.

| Size | Parameters | Required VRAM (guideline) | Relative speed | English-only (`.en`) |
| --- | --- | --- | --- | --- |
| tiny | 39M | ~1GB | ~10x | yes |
| base | 74M | ~1GB | ~7x | yes |
| small | 244M | ~2GB | ~4x | yes |
| medium | 769M | ~5GB | ~2x | yes |
| large | 1550M | ~10GB | 1x | none (multilingual only) |
| **turbo** | **809M** | **~6GB** | **~8x** | none |

**`turbo` (= large-v3-turbo) is the current sweet spot.** It's a distilled/pruned model that cut large-v3's decoder layers from **32 → 4 layers,** achieving about 8x speed while minimizing accuracy degradation. For a "transcription" use of multiple languages including Japanese, trying turbo first is the standard.

> **The pitfall of turbo the official clearly states**: since turbo is trained **excluding translation data,** even if you specify `--task translate`, it **may return in the original language.** When you want "Japanese audio → English text" translation, use `medium` or `large`. The official also states that for an English-only app, the `.en` model gives better accuracy.

### 1.2 Installation (ffmpeg required)

```bash
# Python パッケージ
pip install -U openai-whisper

# 音声デコードに ffmpeg が必須（OSごとに導入）
# macOS:        brew install ffmpeg
# Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg
```

### 1.3 CLI: first, run it

```bash
# turbo で文字起こし（複数ファイルもまとめて渡せる）
whisper audio.mp3 --model turbo --language Japanese --output_format srt

# 日本語音声を「英語に翻訳」したい場合は medium/large + translate
whisper meeting.wav --model large --language Japanese --task translate
```

You can specify `srt` / `vtt` / `txt` / `json` for `--output_format`, and the subtitle file is output as-is. **If the goal is subtitle generation, before assembling timestamps yourself, first confirm whether the CLI's output format suffices** (YAGNI).

### 1.4 The Python API: the form used in production

The minimal example is 3 lines, but in production put in **language fixing, word timestamps, and hallucination suppression.**

```python
import whisper

# モデルは一度だけロードして使い回す（プロセス内で再利用 = コールドスタート回避）
model = whisper.load_model("turbo")

result = model.transcribe(
    "audio.mp3",
    language="ja",            # 言語を固定 → 言語自動検出のブレと初動コストを排除
    word_timestamps=True,     # 単語単位の時刻（字幕・検索・ハイライトに必須）
    # --- 幻覚（hallucination）抑制の三点セット（詳細は第6章） ---
    condition_on_previous_text=False,  # 直前文への引きずられを断つ
    no_speech_threshold=0.6,           # 無音判定を厳しめに
    temperature=0,                     # 決定的に（再現性とテスト容易性）
)

print(result["text"])
for seg in result["segments"]:
    print(f"[{seg['start']:6.2f}–{seg['end']:6.2f}] {seg['text'].strip()}")
```

Attaching `word_timestamps=True` puts `words` (`word` / `start` / `end` / `probability`) into each `segment`. Since **a low-`probability` word = a place the model is unconfident about,** you can build into the design a "human confirmation gate" like highlighting it yellow in a proofreading UI.

> **A design tip**: don't load `model` per request. The `large` family takes seconds to a dozen-plus seconds just to load. For a web server, load it once at process startup and share it among workers (SRP: separate loading and inference).

---

## 2. The OpenAI Audio API version: go for high accuracy with zero operation

You don't want to have a GPU, you want high-accuracy transcription same-day — in that case, the hosted API is the fastest.

### 2.1 Models and endpoints (the official correspondence table)

| Model | Use | Supported `response_format` | Streaming |
| --- | --- | --- | --- |
| `gpt-4o-transcribe` | highest-accuracy transcription | `json` / `text` | ✅ (`stream=True`) |
| `gpt-4o-mini-transcribe` | low-cost, fast | `json` / `text` | ✅ |
| `gpt-4o-transcribe-diarize` | with speaker diarization | `diarized_json`, etc. | — |
| `whisper-1` | the legacy all-rounder, **rich subtitle formats** | `json` / `text` / `srt` / `verbose_json` / `vtt` | — |

**The `gpt-4o-transcribe` family is a new generation that improved WER (word error rate)** based on GPT-4o, with improved recognition accuracy and language detection. On the other hand, **if you need subtitle formats like SRT/VTT or `verbose_json`'s fine-grained timestamps, choose `whisper-1`.** Remembering "accuracy-focused = gpt-4o family, machine-readable subtitle format-focused = whisper-1" keeps you from getting lost.

- **Transcriptions**: audio → "original-language" text.
- **Translations**: audio → **English** text. **`whisper-1`-only.**

### 2.2 Basics: Python and TypeScript

```python
from openai import OpenAI

client = OpenAI()  # APIキーは環境変数 OPENAI_API_KEY から（ハードコード禁止）

with open("speech.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="ja",  # 言語を渡すと精度とレイテンシが安定する
    )

print(transcription.text)
```

For Next.js / Node (TypeScript, the same as this site):

```ts
import OpenAI from "openai";
import { createReadStream } from "node:fs";

const openai = new OpenAI(); // process.env.OPENAI_API_KEY を自動参照

const transcription = await openai.audio.transcriptions.create({
  file: createReadStream("audio.mp3"),
  model: "gpt-4o-transcribe",
  language: "ja",
});

console.log(transcription.text);
```

### 2.3 Subtitles (SRT) and word timestamps: whisper-1

When you need machine-readable subtitles or per-word times, use `whisper-1` + `verbose_json`.

```python
with open("lecture.mp3", "rb") as audio_file:
    result = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="ja",
        response_format="verbose_json",
        timestamp_granularities=["segment", "word"],  # verbose_json 必須
    )

# セグメント単位（字幕のかたまり）
for seg in result.segments:
    print(f"[{seg.start:.2f}-{seg.end:.2f}] {seg.text}")

# 単語単位（検索インデックス・ハイライトに）
for w in result.words:
    print(w.word, w.start, w.end)
```

If you just want the SRT file itself, specifying `response_format="srt"` returns the subtitle text without writing conversion code (DRY: don't write your own SRT serializer).

### 2.4 Make it spell proper nouns / technical terms correctly: `prompt`

The most common complaint in transcription is **the misconversion of proper nouns, company names, and technical terms.** Passing "the expected spelling" to the `prompt` parameter makes the model lean toward it.

```python
GLOSSARY = "登場する固有名詞: 友田陽大, ハコキット, Next.js, Supabase, RLS, gpt-4o-transcribe"

with open("podcast.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        language="ja",
        prompt=GLOSSARY,  # ドメイン語彙を事前に教える
    )
```

This is a **cheap, high-impact** tuning means. In program-production transcription, since the notation variance of "person names, program names, product names" directly ties to accidents, the operation of pouring a glossary into the `prompt` per project is effective.

### 2.5 Streaming: output while speaking (gpt-4o family)

For a UX where you "don't want to keep them waiting," like meeting minutes or live subtitles, you can receive partial results sequentially with `stream=True`.

```python
stream = client.audio.transcriptions.create(
    model="gpt-4o-transcribe",
    file=open("speech.mp3", "rb"),
    response_format="text",
    stream=True,  # gpt-4o-transcribe / mini で利用可
)

for event in stream:
    # delta: 増分テキスト / done: 確定
    if event.type == "transcript.text.delta":
        print(event.delta, end="", flush=True)
    elif event.type == "transcript.text.done":
        print("\n--- 確定 ---")
```

For the use of transcribing mic input in real time, using the **Realtime API** (WebSocket / WebRTC), not file streaming, is the right path. Since "sequential return of a recorded file" and "live sequential recognition" are different things, don't mistake the requirements.

---

## 3. Selection framework: how to decide self-host vs. API

Not "which is correct," but choose the one that fits the requirements on **four axes.**

| Judgment axis | Self-hosting (turbo/large) is advantageous | The Audio API is advantageous |
| --- | --- | --- |
| **Privacy** | medical, government, confidential where you **can't send audio outside** | external sending is acceptable |
| **Cost structure** | large-volume, always-on, want to make it a **fixed cost** (GPU amortization) | small-volume, spiky, where a **variable cost is reasonable** |
| **Length** | want to run a multi-hour long file **in one go** (no limit) | fits in 25MB / splitting is fine as a premise |
| **Operational capacity** | can **operate** GPU, model updates, and scaling | don't want to operate / want accuracy same-day |

### The intuition of cost

The official per-minute billing (as of June 2026, needs confirmation) is roughly the following level.

- `whisper-1` / `gpt-4o-transcribe`: **$0.006/min** (≒ $0.36/hour)
- `gpt-4o-mini-transcribe`: **$0.003/min** (≒ $0.18/hour)

**For a few dozen hours a month, the API is overwhelmingly cheaper** (cheaper than renting one GPU, with zero operation). Conversely, for a workload that **constantly processes hundreds of hours a day,** a break-even point appears where running turbo on a GPU instance lowers the unit price. "First validate the value with the API → consider migrating to self-host once the volume is predictable" is the correct order in many projects (YAGNI / cost efficiency).

---

## 4. Crossing the 25MB wall: chunk-splitting long audio

The API's file limit is **25MB.** A one-hour meeting recording easily exceeds it. **Cutting at silent intervals (VAD)** is the standard, and since you don't cut mid-sentence, it suppresses accuracy degradation.

```python
"""長尺音声を無音で分割し、各チャンクを文字起こしして連結する。
固定長で切ると単語が割れるため、無音境界で切るのが要点。"""
from openai import OpenAI
from pydub import AudioSegment, silence

client = OpenAI()

def split_on_silence_bounded(path: str, max_ms: int = 15 * 60_000) -> list[AudioSegment]:
    """無音を境界に、各チャンクを max_ms 以下に収める（25MB制限の安全側）。"""
    audio = AudioSegment.from_file(path)
    silences = silence.detect_silence(audio, min_silence_len=700, silence_thresh=-40)
    cut_points = [s[0] for s in silences] + [len(audio)]

    chunks: list[AudioSegment] = []
    start = 0
    for point in cut_points:
        if point - start >= max_ms:
            chunks.append(audio[start:point])
            start = point
    if start < len(audio):
        chunks.append(audio[start:])
    return chunks

def transcribe_long(path: str, language: str = "ja") -> str:
    parts: list[str] = []
    context = ""  # 直前チャンク末尾を次の prompt に渡し、境界の文脈を維持
    for i, chunk in enumerate(split_on_silence_bounded(path)):
        buf = chunk.export(format="mp3", bitrate="64k")  # 軽量化で25MBに余裕
        res = client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=("chunk.mp3", buf, "audio/mpeg"),
            language=language,
            prompt=context[-200:],  # 境界をまたぐ固有名詞・文脈のヒント
        )
        parts.append(res.text)
        context = res.text
    return "\n".join(parts)
```

There are three points.

1. **Cut at silence**: fixed-length splitting cuts words and lowers accuracy. Cut at VAD boundaries.
2. **Lower the bitrate**: transcription doesn't need high audio quality. 64kbps mp3 is small enough, and the worry of 25MB disappears (cost efficiency).
3. **Pass the context between chunks**: put the previous chunk's tail into the next `prompt` to keep the continuity of proper nouns and topics crossing the boundary.

---

## 5. Production-operation design: idempotency, retry, observability

Transcription is a "long-running, external-API, billed" job. **Calling it naively, a mid-failure means redoing everything + a double charge.** Let me show the design that actually worked in a broadcaster platform, in its minimal form.

### 5.1 A per-chunk idempotent cache

If you cache each chunk **keyed by the content hash,** you can skip completed chunks on re-execution, and the retry becomes idempotent (= resumption from a partial failure = avoidance of a double charge).

```python
import hashlib
from pathlib import Path

def chunk_key(data: bytes, model: str, language: str) -> str:
    """音声内容＋パラメータで決まる決定的キー。同入力 → 同キー → キャッシュヒット。"""
    h = hashlib.sha256()
    h.update(data)
    h.update(f"{model}:{language}".encode())
    return h.hexdigest()

def transcribe_chunk_idempotent(data: bytes, model: str, language: str, cache_dir: Path) -> str:
    key = chunk_key(data, model, language)
    cached = cache_dir / f"{key}.txt"
    if cached.exists():
        return cached.read_text(encoding="utf-8")  # 再実行はAPIを叩かない（冪等・節約）

    res = client.audio.transcriptions.create(
        model=model, file=("c.mp3", data, "audio/mpeg"), language=language,
    )
    cached.write_text(res.text, encoding="utf-8")
    return res.text
```

### 5.2 Retry with exponential backoff

An external API definitely fails with rate limits and temporary failures. Apply retry **only to idempotent operations.**

```python
import time
from openai import APIConnectionError, RateLimitError, APIStatusError

def with_retry(fn, *, max_attempts: int = 4, base: float = 1.0):
    """指数バックオフ。リトライ対象を限定し、4xx(=入力不正)は即時失敗させる。"""
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except (RateLimitError, APIConnectionError) as e:
            if attempt == max_attempts:
                raise
            time.sleep(base * (2 ** (attempt - 1)))  # 1s, 2s, 4s, ...
        except APIStatusError as e:
            if 400 <= e.status_code < 500:
                raise  # 入力不正はリトライしても無駄 → 即失敗（fail fast）
            if attempt == max_attempts:
                raise
            time.sleep(base * (2 ** (attempt - 1)))
```

### 5.3 Observability: what to definitely record

In a transcription job, record **metadata, not the body (PII)** (reconciling privacy and observability).

- Job ID / chunk number / input hash / model / language
- Processing time (the real-time ratio to audio length = RTF)
- Billing amount (minutes or tokens) and estimated cost
- Failure type (rate limit / timeout / invalid input) and retry count

Emitting these in structured logs (OpenTelemetry, etc.) lets you trace at a glance "which chunk got stuck" and "is the cost reasonable." **Not leaving PII (personal name, contact) in the body log** is an absolute condition of internal-control projects.

---

## 6. Hallucination and the silence problem: the last 10% of accuracy

Whisper's known weakness is **generating "sentences that don't exist" in silent, noise, or BGM-only intervals.** A boilerplate like "thank you for watching" welling up is a typical example. In production, kill it with the following multi-layered countermeasures.

1. **VAD (voice-activity detection) in preprocessing**: drop intervals with no speech before inference. If you remove silence with `silero-vad`, etc., the hotbed of hallucination itself disappears.
2. **`condition_on_previous_text=False`**: sever the dragging by the previous text (the chain of a once-welled-up hallucination).
3. **Discard by threshold**: raise `no_speech_threshold` and discard low-confidence segments with `logprob_threshold`.
4. **Collate with an independent second basis**: the strongest countermeasure is a **cross-check.** In the broadcaster's telop-typo detection, I cross-checked the **two independent lines** of "the screen's telop (OCR = eyes)" and "the speech transcription (ASR = ears)" and detected the discrepancy. Rather than the confidence of a single information source, using **the mismatch of two information sources** as a clue surfaces both hallucinations and misconversions.

> The "confidence" of one information source isn't reliable. **The "discrepancy" of two independent paths** is exactly what becomes a trustworthy detection signal — this is a general principle when putting AI into production, not limited to speech recognition.

---

## 7. Security and privacy: where to draw the line

- **Audio can be personal information**: voice, name, contact, and medical history can be included. **Audio containing confidential or special-care personal information often can't be sent to the API.** In that case, self-hosting (turbo/large) is the only choice.
- **The API key on the server side**: don't hit OpenAI directly from the browser (key leakage). For Next.js, via a Route Handler / Server Action, with the key in an environment variable.
- **Validation of input**: validate the file format (`mp3`/`mp4`/`mpeg`/`mpga`/`m4a`/`wav`/`webm`), size (25MB), and length **at the boundary** before passing to the API. Don't pass user-derived input straight through.
- **Make the storage policy explicit**: if you save the transcription result, decide the retention period, encryption, and deletion flow first. Encrypting PII with AES-256-GCM, etc., and partial-matching with tokenization (HMAC) without decryption for search, is the standard design in internal control.

---

## 8. Summary: a selection cheat sheet

Finally, a quick reference for when you're lost.

- **High accuracy and same-day for now**: `gpt-4o-transcribe` (API). Pass the language and guide proper nouns with `prompt`.
- **Cost is the top priority**: `gpt-4o-mini-transcribe` (API, $0.003/min).
- **You need SRT/VTT subtitles / word timestamps**: `whisper-1` + `verbose_json`.
- **You can't send audio outside / run a lot of long files**: self-hosted `turbo` (multilingual) / `large` (translation).
- **Live subtitles**: the Realtime API. For sequential return of a recorded file, `stream=True`.
- **Common equipment for productionization**: silence splitting, idempotent cache, exponential backoff, observability that doesn't emit PII, multi-layered hallucination countermeasures.

Transcription looks like "a one-line requirement" but is **the work of designing the trade-off of cost, accuracy, privacy, and reliability.** On an internal AI platform for a broadcaster, I operated speech recognition in production as "the second eye of telop-typo detection," assembling it as a long-running job with idempotency, resumption, and observability guaranteed.

**"How to transcribe your audio and how to incorporate it into operations" — from that design to implementation and operation, I can accompany you end-to-end.** Even from the requirement-organizing stage, please feel free to consult me.

---

### References (official documentation)

- [OpenAI Whisper (GitHub, OSS)](https://github.com/openai/whisper) — the model list, CLI, Python API, turbo caveats
- [openai/whisper-large-v3-turbo (Hugging Face)](https://huggingface.co/openai/whisper-large-v3-turbo) — turbo's architecture (32→4 layers)
- [OpenAI Speech-to-Text guide](https://developers.openai.com/api/docs/guides/speech-to-text) — the Audio API's parameters, formats, streaming
- [OpenAI API pricing](https://openai.com/api/pricing/) — the latest values of per-minute / token billing (confirm before production)
