# Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

> A guide to implementing an accessible audio player that reads articles and documents aloud with Next.js 16 and Qwen-TTS. With type-safe real code it explains server-side TTS generation (Zod validation, content-hash cache, hiding the key) and a WCAG 2.2-compliant React player (keyboard operation, aria-live, no autoplay, prefers-reduced-motion, focus management).

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Next.js, 音声合成, フロントエンド, Qwen, a11y, TypeScript
- URL: https://tomodahinata.com/en/blog/nextjs-qwen-tts-accessible-audio-player-text-to-speech
- Category: Voice AI
- Pillar guide: https://tomodahinata.com/en/blog/voice-ai-production-guide-stt-tts-voice-agents

## Key points

- TTS can be a powerful ally of accessibility, but it easily becomes a WCAG violation with autoplay, no keyboard support, and no state announcement.
- Close generation on the server side: validate the boundary with Zod, cache by content hash, and don't expose the API key to the browser.
- Don't generate the same manuscript twice (content hash = idempotent cache), improving both cost and the perceived experience from the second time on.
- Make the player default to no autoplay, all operations keyboard-operable, state announced with aria-live, and respecting prefers-reduced-motion.
- Always show the same text as what's read aloud on screen, and don't make info 'audio-only' (WCAG 1.2.1).

---

"I want to let articles be listened to as audio" — a measure effective for both accessibility improvement and the UX of listening while doing something else. But implemented naively, it **disturbs the screen reader with autoplay, locks out keyboard users with mouse-premised operation, and doesn't tell you whether it's loading or an error** — the well-meaning effort easily becomes a **WCAG violation.**

This article is a guide to implementing a **truly accessible read-aloud player** at production quality with **Next.js 16 and [Qwen-TTS](/blog/qwen-tts-qwen3-tts-flash-production-guide).** It shows server-side TTS generation (hiding the key, cache) and a **WCAG 2.2-compliant** React player, with type-safe real code. The a11y basics are aligned in the [Next.js × React accessibility (WCAG 2.2) guide](/blog/react-nextjs-web-accessibility-wcag22-guide).

> **Rules for this article**: Qwen-TTS's API spec is based on the [official documentation](https://www.alibabacloud.com/help/en/model-studio/qwen-tts) (as of June 2026). The code assumes Next.js 16 / React 19 and is written so it can be reused as-is. **The API key is environment-variable, server-side only** (don't expose it to the browser).

---

## 0. Design: split responsibility into two

A read-aloud feature splits into **two responsibilities of differing nature.** Mix them and you get key leakage or tight coupling (SRP).

```text
[server]  manuscript → synthesize with Qwen-TTS → cache by content hash → return a stable URL (holds the key)
[client]  receive the stable URL → play accessibly (no autoplay, keyboard-operable, state announced)
```

- **Server** = management of generation, cost, and secrets (key, cache, validation).
- **Client** = playback and accessibility (operation, state, focus).

By this split, even if you swap the TTS provider ([selection change to another TTS](/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison)), the client is unscathed (ETC).

---

## 1. Server: generation, validation, cache, hiding the key

In a Route Handler, **validate the boundary with Zod** and **cache by content hash.** Same manuscript, voice, and language and it **never generates again** — this is idempotency and the biggest cost reduction.

```ts
// app/api/tts/route.ts — サーバー専用。APIキーはここから出ない。
import { z } from "zod";
import { createHash } from "node:crypto";
import { put, head } from "@vercel/blob"; // 保存先は任意（S3等でも可）

// 許可する音色・言語を型で固定（不正値をAPIに渡さない＝課金事故と例外を防ぐ）
const VOICES = ["Cherry", "Ethan", "Serena", "Jennifer"] as const;
const LANGS = ["Japanese", "English", "Chinese", "Korean"] as const;

const Body = z.object({
  text: z.string().min(1).max(4000),
  voice: z.enum(VOICES).default("Cherry"),
  language_type: z.enum(LANGS).default("Japanese"),
});

/** 入力で一意に決まる決定的キー。同入力 → 同キー → 再生成しない（冪等）。 */
function cacheKey(input: z.infer<typeof Body>): string {
  return createHash("sha256")
    .update(`qwen3-tts-flash\0${input.voice}\0${input.language_type}\0${input.text}`)
    .digest("hex");
}

export async function POST(req: Request): Promise<Response> {
  const parsed = Body.safeParse(await req.json());
  if (!parsed.success) {
    return Response.json({ error: parsed.error.flatten() }, { status: 400 });
  }
  const key = cacheKey(parsed.data);
  const blobPath = `tts/${key}.wav`;

  // 1) キャッシュヒットなら即返す（生成も課金も発生しない）
  const cached = await head(blobPath).catch(() => null);
  if (cached) return Response.json({ url: cached.url, cached: true });

  // 2) ミス時のみ Qwen-TTS で生成（鍵はサーバー環境変数）
  const upstream = await fetch(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.DASHSCOPE_API_KEY!}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model: "qwen3-tts-flash", input: parsed.data }),
    },
  );
  if (!upstream.ok) return Response.json({ error: "tts_failed" }, { status: 502 });

  // 3) 生成URLは24時間で失効するため、自前ストレージへ即退避（重要）
  const { output } = (await upstream.json()) as { output: { audio: { url: string } } };
  const audio = await fetch(output.audio.url).then((r) => r.arrayBuffer());
  const saved = await put(blobPath, audio, { access: "public", contentType: "audio/wav" });

  return Response.json({ url: saved.url, cached: false });
}
```

There are 3 points: **① the key is server-only ② idempotent cache by content hash (cost reduction + speedup) ③ stash the temporary URL that expires in 24 hours into your own storage.** For Qwen-TTS pricing and model details, see the [production-operations guide](/blog/qwen-tts-qwen3-tts-flash-production-guide).

---

## 2. Client: a WCAG 2.2-compliant player

This is the protagonist of this article. **With HTML5 `<audio>` as the base, overlay a custom UI accessibly.** Requirements: no autoplay, all operations keyboard-operable, state announced to the screen reader, respecting `prefers-reduced-motion`, and variable playback speed.

```tsx
"use client";

import { useCallback, useEffect, useId, useRef, useState } from "react";
import { Play, Pause, Loader2, AlertCircle } from "lucide-react";
import { cn } from "@/lib/utils";

type Status = "idle" | "loading" | "ready" | "playing" | "paused" | "error";

interface Props {
  /** 読み上げる原稿。画面にも必ず併記すること（WCAG 1.2.1）。 */
  text: string;
  voice?: "Cherry" | "Ethan" | "Serena" | "Jennifer";
  language?: "Japanese" | "English" | "Chinese" | "Korean";
}

const SPEEDS = [0.75, 1, 1.25, 1.5, 2] as const;

export function ReadAloud({ text, voice = "Cherry", language = "Japanese" }: Props) {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [status, setStatus] = useState<Status>("idle");
  const [progress, setProgress] = useState(0); // 0..1
  const [duration, setDuration] = useState(0);
  const [speed, setSpeed] = useState<(typeof SPEEDS)[number]>(1);
  const statusId = useId();

  // 初回再生時に「のみ」生成リクエスト（自動では絶対に鳴らさない＝WCAG 1.4.2）
  const ensureLoaded = useCallback(async (): Promise<boolean> => {
    const el = audioRef.current;
    if (!el || el.src) return true;
    setStatus("loading");
    try {
      const res = await fetch("/api/tts", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, voice, language_type: language }),
      });
      if (!res.ok) throw new Error("tts_failed");
      const { url } = (await res.json()) as { url: string };
      el.src = url;
      setStatus("ready");
      return true;
    } catch {
      setStatus("error");
      return false;
    }
  }, [text, voice, language]);

  const toggle = useCallback(async () => {
    const el = audioRef.current;
    if (!el) return;
    if (el.paused) {
      if (!(await ensureLoaded())) return;
      await el.play().catch(() => setStatus("error")); // ユーザー操作起点でのみ再生
    } else {
      el.pause();
    }
  }, [ensureLoaded]);

  const onSeek = useCallback((value: number) => {
    const el = audioRef.current;
    if (el && Number.isFinite(el.duration)) el.currentTime = value * el.duration;
  }, []);

  // 再生速度をオーディオ要素へ反映
  useEffect(() => {
    if (audioRef.current) audioRef.current.playbackRate = speed;
  }, [speed]);

  const isBusy = status === "loading";
  const isPlaying = status === "playing";
  const label = isPlaying ? "一時停止" : status === "loading" ? "読み込み中" : "読み上げを再生";

  return (
    <section
      aria-label="記事の音声読み上げ"
      className="flex items-center gap-3 rounded-xl border bg-card p-3"
    >
      {/* 再生/一時停止：name/role/value を満たす（WCAG 4.1.2）。aria-pressedで状態を表現 */}
      <button
        type="button"
        onClick={toggle}
        disabled={status === "error"}
        aria-pressed={isPlaying}
        aria-label={label}
        aria-describedby={statusId}
        className={cn(
          "inline-flex size-11 shrink-0 items-center justify-center rounded-full",
          "bg-primary text-primary-foreground transition-colors",
          "focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2",
          "disabled:opacity-50",
        )}
      >
        {isBusy ? (
          // prefers-reduced-motionを尊重：motion-reduce で回転を止める
          <Loader2 className="size-5 animate-spin motion-reduce:animate-none" aria-hidden />
        ) : status === "error" ? (
          <AlertCircle className="size-5" aria-hidden />
        ) : isPlaying ? (
          <Pause className="size-5" aria-hidden />
        ) : (
          <Play className="size-5 translate-x-0.5" aria-hidden />
        )}
      </button>

      {/* シーク：ネイティブ range はキーボード操作可。時刻を aria-valuetext で読み上げ */}
      <input
        type="range"
        min={0}
        max={1}
        step={0.01}
        value={progress}
        onChange={(e) => onSeek(Number(e.target.value))}
        aria-label="再生位置"
        aria-valuetext={`${formatTime(progress * duration)} / ${formatTime(duration)}`}
        className="h-1.5 flex-1 cursor-pointer accent-primary"
      />

      <span className="tabular-nums text-xs text-muted-foreground" aria-hidden>
        {formatTime(progress * duration)}
      </span>

      {/* 再生速度：label と結びついた select（キーボード可・名前あり） */}
      <label className="text-xs text-muted-foreground">
        <span className="sr-only">再生速度</span>
        <select
          value={speed}
          onChange={(e) => setSpeed(Number(e.target.value) as (typeof SPEEDS)[number])}
          className="rounded-md border bg-background px-1.5 py-1 focus-visible:ring-2 focus-visible:ring-ring"
        >
          {SPEEDS.map((s) => (
            <option key={s} value={s}>{s}×</option>
          ))}
        </select>
      </label>

      {/* 状態の告知：視覚に依存せずスクリーンリーダーへ（WCAG 4.1.3） */}
      <span id={statusId} role="status" aria-live="polite" className="sr-only">
        {status === "loading" && "音声を準備しています"}
        {status === "playing" && "再生中"}
        {status === "paused" && "一時停止中"}
        {status === "error" && "音声の読み込みに失敗しました"}
      </span>

      <audio
        ref={audioRef}
        preload="none" // 自動ダウンロードしない（コスト・帯域・自動再生防止）
        onPlay={() => setStatus("playing")}
        onPause={() => audioRef.current && !audioRef.current.ended && setStatus("paused")}
        onTimeUpdate={(e) => {
          const el = e.currentTarget;
          if (Number.isFinite(el.duration)) setProgress(el.currentTime / el.duration);
        }}
        onLoadedMetadata={(e) => setDuration(e.currentTarget.duration)}
        onEnded={() => { setStatus("ready"); setProgress(0); }}
        onError={() => setStatus("error")}
      />
    </section>
  );
}

function formatTime(sec: number): string {
  if (!Number.isFinite(sec)) return "0:00";
  const m = Math.floor(sec / 60);
  const s = Math.floor(sec % 60);
  return `${m}:${s.toString().padStart(2, "0")}`;
}
```

---

## 3. The crux of accessibility (the WCAG this code satisfies)

The player above satisfies the following WCAG 2.2 success criteria **by design.**

| Success criterion | How it's satisfied |
| --- | --- |
| **1.2.1 Audio-only (Prerecorded)** | **show the same text as what's read aloud on screen** (don't make it audio-only). The player is merely an aid. |
| **1.4.2 Audio Control** | **no autoplay** (`preload="none"` + user-action origin). Provide play/pause/speed. |
| **2.1.1 Keyboard** | **all operations keyboard-operable** with native `button` / `input[range]` / `select`. |
| **2.4.7 Focus Visible** | a clear focus ring with `focus-visible:ring-*`. |
| **4.1.2 Name, Role, Value** | give meaning to the custom UI with `aria-label`, `aria-pressed`, `aria-valuetext`. |
| **4.1.3 Status Messages** | announce loading/error with `role="status"` + `aria-live="polite"` **without stealing focus.** |
| **Consideration for motion** | the spinner respects `prefers-reduced-motion` with `motion-reduce:animate-none`. |

> **The most important is 1.2.1**: no matter how cool the player is, the essence is **not making information obtainable only from audio.** Read-aloud is "an alternative means to text," not "a substitute for text."

---

## 4. Performance and cost

- **Generation is lazy**: `preload="none"` + generate only on the first playback. **It doesn't bill just by opening.**
- **Idempotent cache by content hash**: the same article is generated only once. The second person on plays instantly and for free (chapter 1).
- **Intent-prediction prefetch (optional)**: add `onPointerEnter` to the play button and start generating before the press for a better perceived experience (but a trade-off with wasted billing. Limit to high-traffic articles only).
- **Deliver the base with RSC**: only the player is `"use client"`, the body stays a server component. Minimize the bundle ([Next.js 16 cache design](/blog/nextjs-16-app-router-cache-components-data-fetching)).

---

## 5. Type safety, errors, resilience

- **The boundary is Zod**: the server `safeParse`s the input, and the client fixes the voice/language with a union type. **Make illegal states unrepresentable** ([the discipline of type safety](/blog/typescript-type-safety-discipline-zod-nevererror-no-any)).
- **Don't swallow errors**: a generation failure becomes `status="error"`, made explicit in the UI + retriable. Announce to the screen reader with `aria-live` too.
- **State is a finite set**: express `Status` as the sum of `idle | loading | ready | playing | paused | error` and branch the UI exhaustively.
- **Testability**: verify the player's state transitions with [Playwright](/blog/playwright-e2e-testing-production-design-guide), checking keyboard operation and aria attributes (a WCAG sweep with @axe-core).

---

## 6. Conclusion: an implementation checklist

- **Responsibility split**: server (generation, key, cache) / client (playback, a11y).
- **Server**: Zod validation, idempotent cache by content hash, hiding the key, immediate stash of the 24h URL.
- **Client**: no autoplay, all operations keyboard-operable, state announced with `aria-live`, respecting `prefers-reduced-motion`.
- **WCAG**: 1.2.1 (text alongside) / 1.4.2 (no autoplay) / 2.1.1 / 2.4.7 / 4.1.2 / 4.1.3.
- **Cost**: lazy generation + cache = prevent "billing just by opening."
- **Type safety**: state is a finite sum, the boundary is Zod, errors are visualized + announced.

A read-aloud feature isn't "accessible just by adding it." Only with this design — **no autoplay, complete by keyboard, state announced** — does it become a feature usable by everyone. I've built frontends that satisfy UX, a11y, and cost at the same time on learning platforms and content foundations ([subscription learning platform](/case-studies/subscription-learning-platform)). **"Deliver your content in a form anyone can listen to, fast and cheap" — I accompany you end-to-end from design through implementation and testing.** Feel free to reach out.

---

### References (official documentation)

- [Qwen-TTS speech synthesis (Alibaba Cloud Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts) — model, voice, API parameters
- [WAI-ARIA Authoring Practices](https://www.w3.org/WAI/ARIA/apg/) — a11y patterns for custom UI
- [MDN: HTMLAudioElement](https://developer.mozilla.org/en-US/docs/Web/API/HTMLAudioElement) — `<audio>` events and properties
- [WCAG 2.2 Understanding](https://www.w3.org/WAI/WCAG22/Understanding/) — the intent and implementation of each success criterion
