Type-safe structured output with Qwen3-8B-AWQ: vLLM guided decoding × Zod

The goal of this article

Even if you ask an LLM to "return JSON," it always breaks in production. A prefatory remark, the ```json fence, a trailing comma, a brace cut off midway —. In the Qwen3-8B-AWQ practical guide I touched lightly on response_format: json_object and Zod validation. This piece makes that thorough.

The aim is a double guard.

At the generation stage, "make it impossible to produce invalid JSON" (vLLM's structured output / guided decoding).
At the app boundary, "still don't trust it and validate" (Zod).

And drive both with one Zod schema, eliminating the double management of the schema (DRY). This is the world-class practice for pouring a thinking LLM's output into the app as type-safe data.

Reliability disclosure: this piece's vLLM structured-output API and coexistence with thinking mode are based on vLLM official (Structured Outputs) and the Qwen3-8B-AWQ model card. vLLM's API names change by version, so always confirm with your version (below).

The 30-second conclusion

Layer	What it does	Tool	Failure it guards
Generation constraint	Make invalid JSON impossible to generate	vLLM structured output (xgrammar)	Grammar breakage, prefaces, fences
Boundary validation	Don't trust the output, pass it through a type	Zod `parse`	Truncation, refusal, version diff, type mismatch
Source of truth	The schema in one place	Zod → JSON Schema	Double management of the schema

The essence: structured output (constraint) produces "almost correct JSON" but not 100% (truncation, refusal, implementation differences). So the boundary Zod validation can't be skipped. The constraint "reduces accidents," and validation "stops accidents." Do both.

Why `json_object` alone isn't enough

response_format: {type: "json_object"} only encourages "a string valid as JSON" and doesn't enforce the schema (what keys/types). {"foo": 1} is also valid JSON. What you want in production is "JSON that follows this schema." What guarantees it is guided decoding (structured output) — a mechanism that, at decode time, only lets out tokens the schema permits.

Server: enable structured output + thinking parsing

vLLM has structured output built in (the default backend is xgrammar). When coexisting with a thinking model, the principle is thinking is free, only the final answer is constrained.

# Qwen3-8B-AWQ：思考パース＋構造化出力。最終回答にだけスキーマ制約がかかる
vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --port 8000
# 思考の途中まで制約をかけたい高度な場合のみ：
#   --structured-outputs-config.enable_in_reasoning=True（既定はオフ＝思考は自由）

🔧 Mind version differences: vLLM has updated the argument names of structured output. The OpenAI-compatible response_format: json_schema is the most portable, so this piece makes it the star. The vLLM-native specification is extra_body={"structured_outputs": {"json": schema}} in newer versions and extra_body={"guided_json": schema} in older ones. Confirm with your vLLM's documentation.

Client: make one Zod schema the source of truth

This is the core of the design. Write only one Zod schema, and derive from it both "the JSON Schema passed to vLLM (for the constraint)" and "the parser that validates the response." Not writing the schema twice = no divergence (DRY).

// lib/structured.ts — 1つのZodスキーマで「制約」も「検証」も賄う
import OpenAI from "openai";
import { z } from "zod";

const client = new OpenAI({ baseURL: process.env.QWEN_BASE_URL, apiKey: "internal", timeout: 60_000 });

/** 抽出したい構造（これが唯一の真実源）。説明はモデルへのヒントにもなる。 */
export const Invoice = z.object({
  vendor: z.string().min(1).describe("請求元の会社名"),
  total: z.number().nonnegative().describe("税込合計（円）"),
  dueDate: z.string().regex(/^\d{4}-\d{2}-\d{2}$/).describe("支払期日 YYYY-MM-DD"),
  currency: z.enum(["JPY", "USD", "EUR"]),
});
export type Invoice = z.infer<typeof Invoice>;

// Zod 4: z.toJSONSchema(Invoice) が標準。Zod 3 なら zod-to-json-schema を使う。
const invoiceJsonSchema = z.toJSONSchema(Invoice);

/** 非構造のテキストから請求書情報を“型”として取り出す。 */
export async function extractInvoice(text: string): Promise<Invoice> {
  const resp = await client.chat.completions.create({
    model: "Qwen/Qwen3-8B-AWQ",
    messages: [
      { role: "system", content: "請求書テキストから指定スキーマのJSONだけを返す。推測で埋めない。" },
      { role: "user", content: text },
    ],
    temperature: 0.7, top_p: 0.8, // 抽出は非思考モードで十分（速い・安い）
    // ① 生成制約：このスキーマに従うJSONしか生成させない（OpenAI互換・移植性◎）
    response_format: {
      type: "json_schema",
      json_schema: { name: "Invoice", schema: invoiceJsonSchema, strict: true },
    },
    presence_penalty: 1.5, // 量子化モデルの繰り返し対策（OpenAI互換の標準パラメータ）
    // vLLM拡張はトップレベルで送る。Node SDKは未知キーも本文へ転送し、spreadなので型エラーも出ない。
    // ※ Python SDK の extra_body は TS SDK には存在しないので使わない。
    ...{ top_k: 20, chat_template_kwargs: { enable_thinking: false } },
  });

  const raw = resp.choices[0]?.message.content ?? "";
  // ② 境界検証：制約しても信用しない。版差・打ち切りはここで止める。
  return Invoice.parse(JSON.parse(raw));
}

From the same Invoice, z.toJSONSchema generates the constraint schema for vLLM, and Invoice.parse validates the response. Add to the schema and both follow automatically — this is ETC (Easy To Change).

When it still breaks: a repair loop (defense in depth)

Structured output reduces accidents but doesn't zero them. It's cut off midway from insufficient max_tokens, refused for safety reasons, rarely breaks from implementation differences —. So make it a resilient function that detects a Zod parse failure and attempts repair just once.

// lib/structured-safe.ts — 検証失敗を“正常系”として吸収する（最大1回リトライ）
import { z } from "zod";

export type Parsed<T> = { ok: true; data: T } | { ok: false; error: string };

/** schema に通るまで最大2回。2回目は失敗内容をモデルに渡して自己修復させる。 */
export async function generateValidated<T>(
  schema: z.ZodType<T>,
  run: (repairHint?: string) => Promise<string>,
): Promise<Parsed<T>> {
  for (let attempt = 0; attempt < 2; attempt++) {
    const raw = await run(attempt === 0 ? undefined : "前回の出力はスキーマ不一致。キーと型を厳密に直して再出力。");
    const result = schema.safeParse(safeJson(raw));
    if (result.success) return { ok: true, data: result.data };
    // 失敗はログへ（PIIは載せない。スキーマ名・エラー要約・attemptだけ）
    logValidationFailure({ attempt, issues: result.error.issues.length });
  }
  return { ok: false, error: "schema validation failed after retry" };
}

const safeJson = (s: string): unknown => {
  try { return JSON.parse(s); } catch { return null; } // null は確実に parse 失敗 → 上で握る
};
declare function logValidationFailure(meta: { attempt: number; issues: number }): void;

safeParse to not make exceptions the control flow, log only metadata of the failure (observability, no PII output), and cut off repair at a finite number — resilience, observability, and KISS coexist.

Coexisting thinking mode and structured output

Qwen3's strength is "think, then answer." The iron rule when coexisting with structured output is "thinking is free, only the final answer has the schema constraint."

The server separates <think>…</think> from the body with --reasoning-parser qwen3. The schema constraint applies to the final answer after separation (forcing even the thinking prose into JSON breaks the reasoning).
The app Zod.parses only content (the final JSON). It doesn't include reasoning_content (the thought process) in the validation target and sends it to audit/evaluation logs.

const msg = resp.choices[0]?.message;
const answer = Invoice.parse(JSON.parse(msg?.content ?? "{}")); // 検証するのは content だけ
auditLog({ reasoning: (msg as { reasoning_content?: string })?.reasoning_content }); // 思考は監査用

💡 Using them differently: tasks where the type is the star, like extraction, classification, and formatting, are fastest and cheapest with non-thinking + structured output. Make only structured output that includes complex judgment (e.g., risk judgment with grounds) thinking mode + structured output. Routing the mode by difficulty is the heart of cost design.

Pitfalls & best practices

🔴 Don't skip boundary validation even with structured output. It can break from truncation/refusal. json_object alone is out of the question, and even with json_schema, pass Zod.
🟠 Don't squeeze max_tokens too much. If the JSON is long, it's cut off midway and becomes invalid. Account for the schema's maximum size.
🟠 The schema in one place (DRY). Auto-generate Zod → JSON Schema, and don't create a handwritten double definition. A divergence will surely become an accident.
🟠 Use describe() as a hint. Each field's description stabilizes how the model fills it. But state clearly in the system to not fill by guessing.
🟢 enum / union is structured output's forte. A choice like "positive | negative" is reliable with z.enum. Avoid ambiguous free strings.
🟢 Absorb version differences with response_format: json_schema. The vLLM-native argument names (structured_outputs / guided_json) depend on the version.

Frequently asked questions (FAQ)

Q. json_object or json_schema — which? A. Always json_schema. json_object guarantees only up to "valid JSON," with keys and types free. If you want schema conformance, json_schema (guided decoding) is the only choice.

Q. If I use structured output, is Zod validation unnecessary? A. No. The constraint only reduces accidents, and it breaks on truncation/refusal/implementation differences. The boundary Zod validation is mandatory. Constraint = prevention, validation = stopping. Do both.

Q. Can I do structured output in thinking mode too? A. You can. Use it with --reasoning-parser qwen3, and apply the schema constraint only to the final answer. Validate content, and separate reasoning_content for auditing.

Q. Should I choose a backend like xgrammar / outlines? A. The default xgrammar is enough in most cases. Switch with --structured-outputs-config.backend only when you need a special grammar (regex, CFG). First measure with the default.

Q. Is the design the same with Pydantic (Python)? A. Yes. Pass Model.model_json_schema() to response_format.json_schema, and validate the response with Model.model_validate_json(). The philosophy of making one model the source of truth is the same (Pydantic v2 boundary validation).

Conclusion

To make a "thinking LLM's" output production data, drive a double guard of constraint and validation with one schema.

Generation constraint: make invalid JSON impossible to generate with vLLM's structured output (response_format: json_schema).
Boundary validation: still don't trust it and Zod parse. Stop truncation, refusal, and version diff.
Source of truth in one place: auto-generate Zod → JSON Schema (DRY, ETC).
Thinking is free, only the answer is constrained: validate content, send reasoning_content to auditing.
Ensure resilience with a repair loop + observability (finite count, metadata-only logs).

I build your own LLM's structured output type-safely, including schema design, guided decoding, boundary validation, and repair. Take a look at my AI platform track record and consult me. With one person × generative AI, fast, cheap, and safe.

Sources / official resources

vLLM official (Structured Outputs) — guided decoding / response_format / backends
Qwen3-8B-AWQ model card — thinking mode and sampling
Zod official — schema, toJSONSchema, safeParse
xgrammar — vLLM's default structured-output engine

vLLM's structured-output API changes by version. Always confirm with primary sources and your version before implementation.

Type-safe structured output with Qwen3-8B-AWQ: vLLM guided decoding × Zod

The goal of this article

The 30-second conclusion

Why `json_object` alone isn't enough

Server: enable structured output + thinking parsing

Client: make one Zod schema the source of truth

When it still breaks: a repair loop (defense in depth)

Coexisting thinking mode and structured output

Pitfalls & best practices

Frequently asked questions (FAQ)

Conclusion

Sources / official resources

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost

Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling

How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use

Also worth reading

The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

React Hook Form × Next.js Server Actions practical guide [latest 2026] — useActionState, double validation, progressive enhancement

shadcn/ui × React Hook Form × Zod practical guide [latest 2026] — accessible form parts to production quality by the shortest path

The goal of this article

The 30-second conclusion

Why json_object alone isn't enough

Server: enable structured output + thinking parsing

Client: make one Zod schema the source of truth

When it still breaks: a repair loop (defense in depth)

Coexisting thinking mode and structured output

Pitfalls & best practices

Frequently asked questions (FAQ)

Conclusion

Sources / official resources

Related articles

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost

Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling

How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use

Also worth reading

The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

React Hook Form × Next.js Server Actions practical guide [latest 2026] — useActionState, double validation, progressive enhancement

shadcn/ui × React Hook Form × Zod practical guide [latest 2026] — accessible form parts to production quality by the shortest path

Why `json_object` alone isn't enough