Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

"I want to let articles be listened to as audio" — a measure effective for both accessibility improvement and the UX of listening while doing something else. But implemented naively, it disturbs the screen reader with autoplay, locks out keyboard users with mouse-premised operation, and doesn't tell you whether it's loading or an error — the well-meaning effort easily becomes a WCAG violation.

This article is a guide to implementing a truly accessible read-aloud player at production quality with Next.js 16 and Qwen-TTS. It shows server-side TTS generation (hiding the key, cache) and a WCAG 2.2-compliant React player, with type-safe real code. The a11y basics are aligned in the Next.js × React accessibility (WCAG 2.2) guide.

Rules for this article: Qwen-TTS's API spec is based on the official documentation (as of June 2026). The code assumes Next.js 16 / React 19 and is written so it can be reused as-is. The API key is environment-variable, server-side only (don't expose it to the browser).

0. Design: split responsibility into two

A read-aloud feature splits into two responsibilities of differing nature. Mix them and you get key leakage or tight coupling (SRP).

[server]  manuscript → synthesize with Qwen-TTS → cache by content hash → return a stable URL (holds the key)
[client]  receive the stable URL → play accessibly (no autoplay, keyboard-operable, state announced)

Server = management of generation, cost, and secrets (key, cache, validation).
Client = playback and accessibility (operation, state, focus).

By this split, even if you swap the TTS provider (selection change to another TTS), the client is unscathed (ETC).

1. Server: generation, validation, cache, hiding the key

In a Route Handler, validate the boundary with Zod and cache by content hash. Same manuscript, voice, and language and it never generates again — this is idempotency and the biggest cost reduction.

// app/api/tts/route.ts — サーバー専用。APIキーはここから出ない。
import { z } from "zod";
import { createHash } from "node:crypto";
import { put, head } from "@vercel/blob"; // 保存先は任意（S3等でも可）

// 許可する音色・言語を型で固定（不正値をAPIに渡さない＝課金事故と例外を防ぐ）
const VOICES = ["Cherry", "Ethan", "Serena", "Jennifer"] as const;
const LANGS = ["Japanese", "English", "Chinese", "Korean"] as const;

const Body = z.object({
  text: z.string().min(1).max(4000),
  voice: z.enum(VOICES).default("Cherry"),
  language_type: z.enum(LANGS).default("Japanese"),
});

/** 入力で一意に決まる決定的キー。同入力 → 同キー → 再生成しない（冪等）。 */
function cacheKey(input: z.infer<typeof Body>): string {
  return createHash("sha256")
    .update(`qwen3-tts-flash\0${input.voice}\0${input.language_type}\0${input.text}`)
    .digest("hex");
}

export async function POST(req: Request): Promise<Response> {
  const parsed = Body.safeParse(await req.json());
  if (!parsed.success) {
    return Response.json({ error: parsed.error.flatten() }, { status: 400 });
  }
  const key = cacheKey(parsed.data);
  const blobPath = `tts/${key}.wav`;

  // 1) キャッシュヒットなら即返す（生成も課金も発生しない）
  const cached = await head(blobPath).catch(() => null);
  if (cached) return Response.json({ url: cached.url, cached: true });

  // 2) ミス時のみ Qwen-TTS で生成（鍵はサーバー環境変数）
  const upstream = await fetch(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.DASHSCOPE_API_KEY!}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model: "qwen3-tts-flash", input: parsed.data }),
    },
  );
  if (!upstream.ok) return Response.json({ error: "tts_failed" }, { status: 502 });

  // 3) 生成URLは24時間で失効するため、自前ストレージへ即退避（重要）
  const { output } = (await upstream.json()) as { output: { audio: { url: string } } };
  const audio = await fetch(output.audio.url).then((r) => r.arrayBuffer());
  const saved = await put(blobPath, audio, { access: "public", contentType: "audio/wav" });

  return Response.json({ url: saved.url, cached: false });
}

There are 3 points: ① the key is server-only ② idempotent cache by content hash (cost reduction + speedup) ③ stash the temporary URL that expires in 24 hours into your own storage. For Qwen-TTS pricing and model details, see the production-operations guide.

2. Client: a WCAG 2.2-compliant player

This is the protagonist of this article. With HTML5 <audio> as the base, overlay a custom UI accessibly. Requirements: no autoplay, all operations keyboard-operable, state announced to the screen reader, respecting prefers-reduced-motion, and variable playback speed.

"use client";

import { useCallback, useEffect, useId, useRef, useState } from "react";
import { Play, Pause, Loader2, AlertCircle } from "lucide-react";
import { cn } from "@/lib/utils";

type Status = "idle" | "loading" | "ready" | "playing" | "paused" | "error";

interface Props {
  /** 読み上げる原稿。画面にも必ず併記すること（WCAG 1.2.1）。 */
  text: string;
  voice?: "Cherry" | "Ethan" | "Serena" | "Jennifer";
  language?: "Japanese" | "English" | "Chinese" | "Korean";
}

const SPEEDS = [0.75, 1, 1.25, 1.5, 2] as const;

export function ReadAloud({ text, voice = "Cherry", language = "Japanese" }: Props) {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [status, setStatus] = useState<Status>("idle");
  const [progress, setProgress] = useState(0); // 0..1
  const [duration, setDuration] = useState(0);
  const [speed, setSpeed] = useState<(typeof SPEEDS)[number]>(1);
  const statusId = useId();

  // 初回再生時に「のみ」生成リクエスト（自動では絶対に鳴らさない＝WCAG 1.4.2）
  const ensureLoaded = useCallback(async (): Promise<boolean> => {
    const el = audioRef.current;
    if (!el || el.src) return true;
    setStatus("loading");
    try {
      const res = await fetch("/api/tts", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, voice, language_type: language }),
      });
      if (!res.ok) throw new Error("tts_failed");
      const { url } = (await res.json()) as { url: string };
      el.src = url;
      setStatus("ready");
      return true;
    } catch {
      setStatus("error");
      return false;
    }
  }, [text, voice, language]);

  const toggle = useCallback(async () => {
    const el = audioRef.current;
    if (!el) return;
    if (el.paused) {
      if (!(await ensureLoaded())) return;
      await el.play().catch(() => setStatus("error")); // ユーザー操作起点でのみ再生
    } else {
      el.pause();
    }
  }, [ensureLoaded]);

  const onSeek = useCallback((value: number) => {
    const el = audioRef.current;
    if (el && Number.isFinite(el.duration)) el.currentTime = value * el.duration;
  }, []);

  // 再生速度をオーディオ要素へ反映
  useEffect(() => {
    if (audioRef.current) audioRef.current.playbackRate = speed;
  }, [speed]);

  const isBusy = status === "loading";
  const isPlaying = status === "playing";
  const label = isPlaying ? "一時停止" : status === "loading" ? "読み込み中" : "読み上げを再生";

  return (
    <section
      aria-label="記事の音声読み上げ"
      className="flex items-center gap-3 rounded-xl border bg-card p-3"
    >
      {/* 再生/一時停止：name/role/value を満たす（WCAG 4.1.2）。aria-pressedで状態を表現 */}
      <button
        type="button"
        onClick={toggle}
        disabled={status === "error"}
        aria-pressed={isPlaying}
        aria-label={label}
        aria-describedby={statusId}
        className={cn(
          "inline-flex size-11 shrink-0 items-center justify-center rounded-full",
          "bg-primary text-primary-foreground transition-colors",
          "focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2",
          "disabled:opacity-50",
        )}
      >
        {isBusy ? (
          // prefers-reduced-motionを尊重：motion-reduce で回転を止める
          <Loader2 className="size-5 animate-spin motion-reduce:animate-none" aria-hidden />
        ) : status === "error" ? (
          <AlertCircle className="size-5" aria-hidden />
        ) : isPlaying ? (
          <Pause className="size-5" aria-hidden />
        ) : (
          <Play className="size-5 translate-x-0.5" aria-hidden />
        )}
      </button>

      {/* シーク：ネイティブ range はキーボード操作可。時刻を aria-valuetext で読み上げ */}
      <input
        type="range"
        min={0}
        max={1}
        step={0.01}
        value={progress}
        onChange={(e) => onSeek(Number(e.target.value))}
        aria-label="再生位置"
        aria-valuetext={`${formatTime(progress * duration)} / ${formatTime(duration)}`}
        className="h-1.5 flex-1 cursor-pointer accent-primary"
      />

      <span className="tabular-nums text-xs text-muted-foreground" aria-hidden>
        {formatTime(progress * duration)}
      </span>

      {/* 再生速度：label と結びついた select（キーボード可・名前あり） */}
      <label className="text-xs text-muted-foreground">
        <span className="sr-only">再生速度</span>
        <select
          value={speed}
          onChange={(e) => setSpeed(Number(e.target.value) as (typeof SPEEDS)[number])}
          className="rounded-md border bg-background px-1.5 py-1 focus-visible:ring-2 focus-visible:ring-ring"
        >
          {SPEEDS.map((s) => (
            <option key={s} value={s}>{s}×</option>
          ))}
        </select>
      </label>

      {/* 状態の告知：視覚に依存せずスクリーンリーダーへ（WCAG 4.1.3） */}
      <span id={statusId} role="status" aria-live="polite" className="sr-only">
        {status === "loading" && "音声を準備しています"}
        {status === "playing" && "再生中"}
        {status === "paused" && "一時停止中"}
        {status === "error" && "音声の読み込みに失敗しました"}
      </span>

      <audio
        ref={audioRef}
        preload="none" // 自動ダウンロードしない（コスト・帯域・自動再生防止）
        onPlay={() => setStatus("playing")}
        onPause={() => audioRef.current && !audioRef.current.ended && setStatus("paused")}
        onTimeUpdate={(e) => {
          const el = e.currentTarget;
          if (Number.isFinite(el.duration)) setProgress(el.currentTime / el.duration);
        }}
        onLoadedMetadata={(e) => setDuration(e.currentTarget.duration)}
        onEnded={() => { setStatus("ready"); setProgress(0); }}
        onError={() => setStatus("error")}
      />
    </section>
  );
}

function formatTime(sec: number): string {
  if (!Number.isFinite(sec)) return "0:00";
  const m = Math.floor(sec / 60);
  const s = Math.floor(sec % 60);
  return `${m}:${s.toString().padStart(2, "0")}`;
}

3. The crux of accessibility (the WCAG this code satisfies)

The player above satisfies the following WCAG 2.2 success criteria by design.

Success criterion	How it's satisfied
1.2.1 Audio-only (Prerecorded)	show the same text as what's read aloud on screen (don't make it audio-only). The player is merely an aid.
1.4.2 Audio Control	no autoplay (`preload="none"` + user-action origin). Provide play/pause/speed.
2.1.1 Keyboard	all operations keyboard-operable with native `button` / `input[range]` / `select`.
2.4.7 Focus Visible	a clear focus ring with `focus-visible:ring-*`.
4.1.2 Name, Role, Value	give meaning to the custom UI with `aria-label`, `aria-pressed`, `aria-valuetext`.
4.1.3 Status Messages	announce loading/error with `role="status"` + `aria-live="polite"` without stealing focus.
Consideration for motion	the spinner respects `prefers-reduced-motion` with `motion-reduce:animate-none`.

The most important is 1.2.1: no matter how cool the player is, the essence is not making information obtainable only from audio. Read-aloud is "an alternative means to text," not "a substitute for text."

4. Performance and cost

Generation is lazy: preload="none" + generate only on the first playback. It doesn't bill just by opening.
Idempotent cache by content hash: the same article is generated only once. The second person on plays instantly and for free (chapter 1).
Intent-prediction prefetch (optional): add onPointerEnter to the play button and start generating before the press for a better perceived experience (but a trade-off with wasted billing. Limit to high-traffic articles only).
Deliver the base with RSC: only the player is "use client", the body stays a server component. Minimize the bundle (Next.js 16 cache design).

5. Type safety, errors, resilience

The boundary is Zod: the server safeParses the input, and the client fixes the voice/language with a union type. Make illegal states unrepresentable (the discipline of type safety).
Don't swallow errors: a generation failure becomes status="error", made explicit in the UI + retriable. Announce to the screen reader with aria-live too.
State is a finite set: express Status as the sum of idle | loading | ready | playing | paused | error and branch the UI exhaustively.
Testability: verify the player's state transitions with Playwright, checking keyboard operation and aria attributes (a WCAG sweep with @axe-core).

6. Conclusion: an implementation checklist

Responsibility split: server (generation, key, cache) / client (playback, a11y).
Server: Zod validation, idempotent cache by content hash, hiding the key, immediate stash of the 24h URL.
Client: no autoplay, all operations keyboard-operable, state announced with aria-live, respecting prefers-reduced-motion.
WCAG: 1.2.1 (text alongside) / 1.4.2 (no autoplay) / 2.1.1 / 2.4.7 / 4.1.2 / 4.1.3.
Cost: lazy generation + cache = prevent "billing just by opening."
Type safety: state is a finite sum, the boundary is Zod, errors are visualized + announced.

A read-aloud feature isn't "accessible just by adding it." Only with this design — no autoplay, complete by keyboard, state announced — does it become a feature usable by everyone. I've built frontends that satisfy UX, a11y, and cost at the same time on learning platforms and content foundations (subscription learning platform). "Deliver your content in a form anyone can listen to, fast and cheap" — I accompany you end-to-end from design through implementation and testing. Feel free to reach out.

References (official documentation)

Qwen-TTS speech synthesis (Alibaba Cloud Model Studio) — model, voice, API parameters
WAI-ARIA Authoring Practices — a11y patterns for custom UI
MDN: HTMLAudioElement — <audio> events and properties
WCAG 2.2 Understanding — the intent and implementation of each success criterion

Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

0. Design: split responsibility into two

1. Server: generation, validation, cache, hiding the key

2. Client: a WCAG 2.2-compliant player

3. The crux of accessibility (the WCAG this code satisfies)

4. Performance and cost

5. Type safety, errors, resilience

6. Conclusion: an implementation checklist

References (official documentation)

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

Qwen-TTS voice-cloning production guide: self-hosting the OSS version (Apache-2.0), and the governance design of consent, disclosure, and provenance

Also worth reading

Vercel image-optimization guide: raise Core Web Vitals with next/image without making the bill spike

Vercel Routing Middleware implementation guide: auth gates, personalization, A/B, and redirects before the cache

Next.js × Prisma production implementation guide: solidify the App Router, Server Components, Server Actions, the Zod boundary, and connection management type-safely

0. Design: split responsibility into two

1. Server: generation, validation, cache, hiding the key

2. Client: a WCAG 2.2-compliant player

3. The crux of accessibility (the WCAG this code satisfies)

4. Performance and cost

5. Type safety, errors, resilience

6. Conclusion: an implementation checklist

References (official documentation)

Related articles

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

Qwen-TTS voice-cloning production guide: self-hosting the OSS version (Apache-2.0), and the governance design of consent, disclosure, and provenance

Also worth reading

Vercel image-optimization guide: raise Core Web Vitals with next/image without making the bill spike

Vercel Routing Middleware implementation guide: auth gates, personalization, A/B, and redirects before the cache

Next.js × Prisma production implementation guide: solidify the App Router, Server Components, Server Actions, the Zod boundary, and connection management type-safely