Skip to main content
友田 陽大
Voice AI
Next.js
音声合成
フロントエンド
Qwen
a11y
TypeScript

Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

A guide to implementing an accessible audio player that reads articles and documents aloud with Next.js 16 and Qwen-TTS. With type-safe real code it explains server-side TTS generation (Zod validation, content-hash cache, hiding the key) and a WCAG 2.2-compliant React player (keyboard operation, aria-live, no autoplay, prefers-reduced-motion, focus management).

Published
Reading time
10 min read
Author
友田 陽大
Share

"I want to let articles be listened to as audio" — a measure effective for both accessibility improvement and the UX of listening while doing something else. But implemented naively, it disturbs the screen reader with autoplay, locks out keyboard users with mouse-premised operation, and doesn't tell you whether it's loading or an error — the well-meaning effort easily becomes a WCAG violation.

This article is a guide to implementing a truly accessible read-aloud player at production quality with Next.js 16 and Qwen-TTS. It shows server-side TTS generation (hiding the key, cache) and a WCAG 2.2-compliant React player, with type-safe real code. The a11y basics are aligned in the Next.js × React accessibility (WCAG 2.2) guide.

Rules for this article: Qwen-TTS's API spec is based on the official documentation (as of June 2026). The code assumes Next.js 16 / React 19 and is written so it can be reused as-is. The API key is environment-variable, server-side only (don't expose it to the browser).


0. Design: split responsibility into two

A read-aloud feature splits into two responsibilities of differing nature. Mix them and you get key leakage or tight coupling (SRP).

[server]  manuscript → synthesize with Qwen-TTS → cache by content hash → return a stable URL (holds the key)
[client]  receive the stable URL → play accessibly (no autoplay, keyboard-operable, state announced)
  • Server = management of generation, cost, and secrets (key, cache, validation).
  • Client = playback and accessibility (operation, state, focus).

By this split, even if you swap the TTS provider (selection change to another TTS), the client is unscathed (ETC).


1. Server: generation, validation, cache, hiding the key

In a Route Handler, validate the boundary with Zod and cache by content hash. Same manuscript, voice, and language and it never generates again — this is idempotency and the biggest cost reduction.

// app/api/tts/route.ts — サーバー専用。APIキーはここから出ない。
import { z } from "zod";
import { createHash } from "node:crypto";
import { put, head } from "@vercel/blob"; // 保存先は任意(S3等でも可)

// 許可する音色・言語を型で固定(不正値をAPIに渡さない=課金事故と例外を防ぐ)
const VOICES = ["Cherry", "Ethan", "Serena", "Jennifer"] as const;
const LANGS = ["Japanese", "English", "Chinese", "Korean"] as const;

const Body = z.object({
  text: z.string().min(1).max(4000),
  voice: z.enum(VOICES).default("Cherry"),
  language_type: z.enum(LANGS).default("Japanese"),
});

/** 入力で一意に決まる決定的キー。同入力 → 同キー → 再生成しない(冪等)。 */
function cacheKey(input: z.infer<typeof Body>): string {
  return createHash("sha256")
    .update(`qwen3-tts-flash\0${input.voice}\0${input.language_type}\0${input.text}`)
    .digest("hex");
}

export async function POST(req: Request): Promise<Response> {
  const parsed = Body.safeParse(await req.json());
  if (!parsed.success) {
    return Response.json({ error: parsed.error.flatten() }, { status: 400 });
  }
  const key = cacheKey(parsed.data);
  const blobPath = `tts/${key}.wav`;

  // 1) キャッシュヒットなら即返す(生成も課金も発生しない)
  const cached = await head(blobPath).catch(() => null);
  if (cached) return Response.json({ url: cached.url, cached: true });

  // 2) ミス時のみ Qwen-TTS で生成(鍵はサーバー環境変数)
  const upstream = await fetch(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.DASHSCOPE_API_KEY!}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model: "qwen3-tts-flash", input: parsed.data }),
    },
  );
  if (!upstream.ok) return Response.json({ error: "tts_failed" }, { status: 502 });

  // 3) 生成URLは24時間で失効するため、自前ストレージへ即退避(重要)
  const { output } = (await upstream.json()) as { output: { audio: { url: string } } };
  const audio = await fetch(output.audio.url).then((r) => r.arrayBuffer());
  const saved = await put(blobPath, audio, { access: "public", contentType: "audio/wav" });

  return Response.json({ url: saved.url, cached: false });
}

There are 3 points: ① the key is server-only ② idempotent cache by content hash (cost reduction + speedup) ③ stash the temporary URL that expires in 24 hours into your own storage. For Qwen-TTS pricing and model details, see the production-operations guide.


2. Client: a WCAG 2.2-compliant player

This is the protagonist of this article. With HTML5 <audio> as the base, overlay a custom UI accessibly. Requirements: no autoplay, all operations keyboard-operable, state announced to the screen reader, respecting prefers-reduced-motion, and variable playback speed.

"use client";

import { useCallback, useEffect, useId, useRef, useState } from "react";
import { Play, Pause, Loader2, AlertCircle } from "lucide-react";
import { cn } from "@/lib/utils";

type Status = "idle" | "loading" | "ready" | "playing" | "paused" | "error";

interface Props {
  /** 読み上げる原稿。画面にも必ず併記すること(WCAG 1.2.1)。 */
  text: string;
  voice?: "Cherry" | "Ethan" | "Serena" | "Jennifer";
  language?: "Japanese" | "English" | "Chinese" | "Korean";
}

const SPEEDS = [0.75, 1, 1.25, 1.5, 2] as const;

export function ReadAloud({ text, voice = "Cherry", language = "Japanese" }: Props) {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [status, setStatus] = useState<Status>("idle");
  const [progress, setProgress] = useState(0); // 0..1
  const [duration, setDuration] = useState(0);
  const [speed, setSpeed] = useState<(typeof SPEEDS)[number]>(1);
  const statusId = useId();

  // 初回再生時に「のみ」生成リクエスト(自動では絶対に鳴らさない=WCAG 1.4.2)
  const ensureLoaded = useCallback(async (): Promise<boolean> => {
    const el = audioRef.current;
    if (!el || el.src) return true;
    setStatus("loading");
    try {
      const res = await fetch("/api/tts", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, voice, language_type: language }),
      });
      if (!res.ok) throw new Error("tts_failed");
      const { url } = (await res.json()) as { url: string };
      el.src = url;
      setStatus("ready");
      return true;
    } catch {
      setStatus("error");
      return false;
    }
  }, [text, voice, language]);

  const toggle = useCallback(async () => {
    const el = audioRef.current;
    if (!el) return;
    if (el.paused) {
      if (!(await ensureLoaded())) return;
      await el.play().catch(() => setStatus("error")); // ユーザー操作起点でのみ再生
    } else {
      el.pause();
    }
  }, [ensureLoaded]);

  const onSeek = useCallback((value: number) => {
    const el = audioRef.current;
    if (el && Number.isFinite(el.duration)) el.currentTime = value * el.duration;
  }, []);

  // 再生速度をオーディオ要素へ反映
  useEffect(() => {
    if (audioRef.current) audioRef.current.playbackRate = speed;
  }, [speed]);

  const isBusy = status === "loading";
  const isPlaying = status === "playing";
  const label = isPlaying ? "一時停止" : status === "loading" ? "読み込み中" : "読み上げを再生";

  return (
    <section
      aria-label="記事の音声読み上げ"
      className="flex items-center gap-3 rounded-xl border bg-card p-3"
    >
      {/* 再生/一時停止:name/role/value を満たす(WCAG 4.1.2)。aria-pressedで状態を表現 */}
      <button
        type="button"
        onClick={toggle}
        disabled={status === "error"}
        aria-pressed={isPlaying}
        aria-label={label}
        aria-describedby={statusId}
        className={cn(
          "inline-flex size-11 shrink-0 items-center justify-center rounded-full",
          "bg-primary text-primary-foreground transition-colors",
          "focus-visible:outline-none focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2",
          "disabled:opacity-50",
        )}
      >
        {isBusy ? (
          // prefers-reduced-motionを尊重:motion-reduce で回転を止める
          <Loader2 className="size-5 animate-spin motion-reduce:animate-none" aria-hidden />
        ) : status === "error" ? (
          <AlertCircle className="size-5" aria-hidden />
        ) : isPlaying ? (
          <Pause className="size-5" aria-hidden />
        ) : (
          <Play className="size-5 translate-x-0.5" aria-hidden />
        )}
      </button>

      {/* シーク:ネイティブ range はキーボード操作可。時刻を aria-valuetext で読み上げ */}
      <input
        type="range"
        min={0}
        max={1}
        step={0.01}
        value={progress}
        onChange={(e) => onSeek(Number(e.target.value))}
        aria-label="再生位置"
        aria-valuetext={`${formatTime(progress * duration)} / ${formatTime(duration)}`}
        className="h-1.5 flex-1 cursor-pointer accent-primary"
      />

      <span className="tabular-nums text-xs text-muted-foreground" aria-hidden>
        {formatTime(progress * duration)}
      </span>

      {/* 再生速度:label と結びついた select(キーボード可・名前あり) */}
      <label className="text-xs text-muted-foreground">
        <span className="sr-only">再生速度</span>
        <select
          value={speed}
          onChange={(e) => setSpeed(Number(e.target.value) as (typeof SPEEDS)[number])}
          className="rounded-md border bg-background px-1.5 py-1 focus-visible:ring-2 focus-visible:ring-ring"
        >
          {SPEEDS.map((s) => (
            <option key={s} value={s}>{s}×</option>
          ))}
        </select>
      </label>

      {/* 状態の告知:視覚に依存せずスクリーンリーダーへ(WCAG 4.1.3) */}
      <span id={statusId} role="status" aria-live="polite" className="sr-only">
        {status === "loading" && "音声を準備しています"}
        {status === "playing" && "再生中"}
        {status === "paused" && "一時停止中"}
        {status === "error" && "音声の読み込みに失敗しました"}
      </span>

      <audio
        ref={audioRef}
        preload="none" // 自動ダウンロードしないコスト帯域自動再生防止onPlay={() => setStatus("playing")}
        onPause={() => audioRef.current && !audioRef.current.ended && setStatus("paused")}
        onTimeUpdate={(e) => {
          const el = e.currentTarget;
          if (Number.isFinite(el.duration)) setProgress(el.currentTime / el.duration);
        }}
        onLoadedMetadata={(e) => setDuration(e.currentTarget.duration)}
        onEnded={() => { setStatus("ready"); setProgress(0); }}
        onError={() => setStatus("error")}
      />
    </section>
  );
}

function formatTime(sec: number): string {
  if (!Number.isFinite(sec)) return "0:00";
  const m = Math.floor(sec / 60);
  const s = Math.floor(sec % 60);
  return `${m}:${s.toString().padStart(2, "0")}`;
}

3. The crux of accessibility (the WCAG this code satisfies)

The player above satisfies the following WCAG 2.2 success criteria by design.

Success criterionHow it's satisfied
1.2.1 Audio-only (Prerecorded)show the same text as what's read aloud on screen (don't make it audio-only). The player is merely an aid.
1.4.2 Audio Controlno autoplay (preload="none" + user-action origin). Provide play/pause/speed.
2.1.1 Keyboardall operations keyboard-operable with native button / input[range] / select.
2.4.7 Focus Visiblea clear focus ring with focus-visible:ring-*.
4.1.2 Name, Role, Valuegive meaning to the custom UI with aria-label, aria-pressed, aria-valuetext.
4.1.3 Status Messagesannounce loading/error with role="status" + aria-live="polite" without stealing focus.
Consideration for motionthe spinner respects prefers-reduced-motion with motion-reduce:animate-none.

The most important is 1.2.1: no matter how cool the player is, the essence is not making information obtainable only from audio. Read-aloud is "an alternative means to text," not "a substitute for text."


4. Performance and cost

  • Generation is lazy: preload="none" + generate only on the first playback. It doesn't bill just by opening.
  • Idempotent cache by content hash: the same article is generated only once. The second person on plays instantly and for free (chapter 1).
  • Intent-prediction prefetch (optional): add onPointerEnter to the play button and start generating before the press for a better perceived experience (but a trade-off with wasted billing. Limit to high-traffic articles only).
  • Deliver the base with RSC: only the player is "use client", the body stays a server component. Minimize the bundle (Next.js 16 cache design).

5. Type safety, errors, resilience

  • The boundary is Zod: the server safeParses the input, and the client fixes the voice/language with a union type. Make illegal states unrepresentable (the discipline of type safety).
  • Don't swallow errors: a generation failure becomes status="error", made explicit in the UI + retriable. Announce to the screen reader with aria-live too.
  • State is a finite set: express Status as the sum of idle | loading | ready | playing | paused | error and branch the UI exhaustively.
  • Testability: verify the player's state transitions with Playwright, checking keyboard operation and aria attributes (a WCAG sweep with @axe-core).

6. Conclusion: an implementation checklist

  • Responsibility split: server (generation, key, cache) / client (playback, a11y).
  • Server: Zod validation, idempotent cache by content hash, hiding the key, immediate stash of the 24h URL.
  • Client: no autoplay, all operations keyboard-operable, state announced with aria-live, respecting prefers-reduced-motion.
  • WCAG: 1.2.1 (text alongside) / 1.4.2 (no autoplay) / 2.1.1 / 2.4.7 / 4.1.2 / 4.1.3.
  • Cost: lazy generation + cache = prevent "billing just by opening."
  • Type safety: state is a finite sum, the boundary is Zod, errors are visualized + announced.

A read-aloud feature isn't "accessible just by adding it." Only with this design — no autoplay, complete by keyboard, state announced — does it become a feature usable by everyone. I've built frontends that satisfy UX, a11y, and cost at the same time on learning platforms and content foundations (subscription learning platform). "Deliver your content in a form anyone can listen to, fast and cheap" — I accompany you end-to-end from design through implementation and testing. Feel free to reach out.


References (official documentation)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading