# Type-safe structured output with Qwen3-8B-AWQ: vLLM guided decoding × Zod

> A practical guide to making your own LLM's JSON output 'unbreakable.' With vLLM's structured output (guided decoding / response_format json_schema), make grammatically invalid JSON impossible to generate, then add a double guard of boundary validation with Zod. With one Zod schema as the source of truth, it satisfies both the constraint to vLLM and the app's validation, with real code covering coexistence with thinking mode and a repair loop.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Qwen, vLLM, TypeScript, Zod, 型安全, 構造化出力, 生成AI
- URL: https://tomodahinata.com/en/blog/qwen3-structured-output-json-vllm-guided-decoding-zod
- Category: Quantized LLMs & self-hosting
- Pillar guide: https://tomodahinata.com/en/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide

## Key points

- An LLM's 'plausible JSON' breaks in production. Make grammatically invalid JSON 'impossible to generate' with vLLM's structured output (guided decoding), then boundary-validate with Zod — the double guard of constraint and validation is the point.
- One Zod schema as the source of truth: generate the constraint schema passed to vLLM with z.toJSONSchema, and parse the response with the same schema. Eliminate the double management of the schema (DRY).
- For the API, the OpenAI-compatible response_format: json_schema is the most portable. The vLLM-native one is extra_body.structured_outputs (formerly guided_json). The backend is xgrammar by default.
- When coexisting with thinking mode, 'thinking is free, only the final answer is constrained.' Use --reasoning-parser qwen3 with structured output, and don't include reasoning_content in the validation target.
- Even when constrained, don't skip boundary validation: it can break from truncation, refusal, or version differences. Detect a Zod parse failure and cover it with a repair loop + observability.

---

## The goal of this article

Even if you ask an LLM to "return JSON," it **always breaks** in production. A prefatory remark, the ```` ```json ```` fence, a trailing comma, a brace cut off midway —. In the [Qwen3-8B-AWQ practical guide](/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide) I touched lightly on `response_format: json_object` and Zod validation. This piece makes that **thorough.**

The aim is a **double guard.**

1. **At the generation stage, "make it impossible to produce invalid JSON"** (vLLM's structured output / guided decoding).
2. **At the app boundary, "still don't trust it and validate"** (Zod).

And drive both with **one Zod schema**, eliminating the double management of the schema (DRY). This is the world-class practice for pouring a thinking LLM's output into the app as **type-safe data.**

> **Reliability disclosure**: this piece's vLLM structured-output API and coexistence with thinking mode are based on [vLLM official (Structured Outputs)](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and the [Qwen3-8B-AWQ model card](https://huggingface.co/Qwen/Qwen3-8B-AWQ). vLLM's API names change by version, so **always confirm with your version** (below).

---

## The 30-second conclusion

| Layer | What it does | Tool | Failure it guards |
| --- | --- | --- | --- |
| **Generation constraint** | Make invalid JSON impossible to generate | vLLM structured output (xgrammar) | Grammar breakage, prefaces, fences |
| **Boundary validation** | Don't trust the output, pass it through a type | Zod `parse` | Truncation, refusal, version diff, type mismatch |
| **Source of truth** | The schema in one place | Zod → JSON Schema | Double management of the schema |

**The essence**: structured output (constraint) produces "**almost correct JSON**" but **not 100%** (truncation, refusal, implementation differences). So **the boundary Zod validation can't be skipped.** The constraint "reduces accidents," and validation "stops accidents." Do both.

---

## Why `json_object` alone isn't enough

`response_format: {type: "json_object"}` only encourages "a string valid as JSON" and **doesn't enforce the schema (what keys/types).** `{"foo": 1}` is also valid JSON. What you want in production is "**JSON that follows this schema.**" What guarantees it is **guided decoding (structured output)** — a mechanism that, at decode time, **only lets out tokens the schema permits.**

---

## Server: enable structured output + thinking parsing

vLLM has structured output built in (the default backend is **xgrammar**). When coexisting with a thinking model, the principle is **thinking is free, only the final answer is constrained.**

```bash
# Qwen3-8B-AWQ：思考パース＋構造化出力。最終回答にだけスキーマ制約がかかる
vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --port 8000
# 思考の途中まで制約をかけたい高度な場合のみ：
#   --structured-outputs-config.enable_in_reasoning=True（既定はオフ＝思考は自由）
```

> 🔧 **Mind version differences**: vLLM has updated the argument names of structured output. **The OpenAI-compatible `response_format: json_schema` is the most portable**, so this piece makes it the star. The vLLM-native specification is `extra_body={"structured_outputs": {"json": schema}}` in newer versions and `extra_body={"guided_json": schema}` in older ones. **Confirm with your vLLM's documentation.**

---

## Client: make one Zod schema the source of truth

This is the core of the design. **Write only one Zod schema**, and derive from it **both** "the JSON Schema passed to vLLM (for the constraint)" and "the parser that validates the response." Not writing the schema twice = no divergence (DRY).

```ts
// lib/structured.ts — 1つのZodスキーマで「制約」も「検証」も賄う
import OpenAI from "openai";
import { z } from "zod";

const client = new OpenAI({ baseURL: process.env.QWEN_BASE_URL, apiKey: "internal", timeout: 60_000 });

/** 抽出したい構造（これが唯一の真実源）。説明はモデルへのヒントにもなる。 */
export const Invoice = z.object({
  vendor: z.string().min(1).describe("請求元の会社名"),
  total: z.number().nonnegative().describe("税込合計（円）"),
  dueDate: z.string().regex(/^\d{4}-\d{2}-\d{2}$/).describe("支払期日 YYYY-MM-DD"),
  currency: z.enum(["JPY", "USD", "EUR"]),
});
export type Invoice = z.infer<typeof Invoice>;

// Zod 4: z.toJSONSchema(Invoice) が標準。Zod 3 なら zod-to-json-schema を使う。
const invoiceJsonSchema = z.toJSONSchema(Invoice);

/** 非構造のテキストから請求書情報を“型”として取り出す。 */
export async function extractInvoice(text: string): Promise<Invoice> {
  const resp = await client.chat.completions.create({
    model: "Qwen/Qwen3-8B-AWQ",
    messages: [
      { role: "system", content: "請求書テキストから指定スキーマのJSONだけを返す。推測で埋めない。" },
      { role: "user", content: text },
    ],
    temperature: 0.7, top_p: 0.8, // 抽出は非思考モードで十分（速い・安い）
    // ① 生成制約：このスキーマに従うJSONしか生成させない（OpenAI互換・移植性◎）
    response_format: {
      type: "json_schema",
      json_schema: { name: "Invoice", schema: invoiceJsonSchema, strict: true },
    },
    presence_penalty: 1.5, // 量子化モデルの繰り返し対策（OpenAI互換の標準パラメータ）
    // vLLM拡張はトップレベルで送る。Node SDKは未知キーも本文へ転送し、spreadなので型エラーも出ない。
    // ※ Python SDK の extra_body は TS SDK には存在しないので使わない。
    ...{ top_k: 20, chat_template_kwargs: { enable_thinking: false } },
  });

  const raw = resp.choices[0]?.message.content ?? "";
  // ② 境界検証：制約しても信用しない。版差・打ち切りはここで止める。
  return Invoice.parse(JSON.parse(raw));
}
```

**From the same `Invoice`**, `z.toJSONSchema` generates the constraint schema for vLLM, and `Invoice.parse` validates the response. Add to the schema and **both follow automatically** — this is ETC (Easy To Change).

---

## When it still breaks: a repair loop (defense in depth)

Structured output **reduces** accidents but doesn't zero them. **It's cut off midway from insufficient `max_tokens`**, **refused for safety reasons**, **rarely breaks from implementation differences** —. So make it a resilient function that **detects a Zod `parse` failure and attempts repair just once.**

```ts
// lib/structured-safe.ts — 検証失敗を“正常系”として吸収する（最大1回リトライ）
import { z } from "zod";

export type Parsed<T> = { ok: true; data: T } | { ok: false; error: string };

/** schema に通るまで最大2回。2回目は失敗内容をモデルに渡して自己修復させる。 */
export async function generateValidated<T>(
  schema: z.ZodType<T>,
  run: (repairHint?: string) => Promise<string>,
): Promise<Parsed<T>> {
  for (let attempt = 0; attempt < 2; attempt++) {
    const raw = await run(attempt === 0 ? undefined : "前回の出力はスキーマ不一致。キーと型を厳密に直して再出力。");
    const result = schema.safeParse(safeJson(raw));
    if (result.success) return { ok: true, data: result.data };
    // 失敗はログへ（PIIは載せない。スキーマ名・エラー要約・attemptだけ）
    logValidationFailure({ attempt, issues: result.error.issues.length });
  }
  return { ok: false, error: "schema validation failed after retry" };
}

const safeJson = (s: string): unknown => {
  try { return JSON.parse(s); } catch { return null; } // null は確実に parse 失敗 → 上で握る
};
declare function logValidationFailure(meta: { attempt: number; issues: number }): void;
```

`safeParse` to **not make exceptions the control flow**, **log only metadata** of the failure ([observability](/blog/opentelemetry-observability-production-tracing-metrics-logs), no PII output), and cut off repair at **a finite number** — resilience, observability, and KISS coexist.

---

## Coexisting thinking mode and structured output

Qwen3's strength is "think, then answer." The iron rule when coexisting with structured output is **"thinking is free, only the final answer has the schema constraint."**

- The server separates `<think>…</think>` **from the body** with `--reasoning-parser qwen3`. The schema constraint applies to the **final answer after separation** (forcing even the thinking prose into JSON breaks the reasoning).
- The app `Zod.parse`s only `content` (the final JSON). It **doesn't include `reasoning_content` (the thought process) in the validation target** and sends it to audit/evaluation logs.

```ts
const msg = resp.choices[0]?.message;
const answer = Invoice.parse(JSON.parse(msg?.content ?? "{}")); // 検証するのは content だけ
auditLog({ reasoning: (msg as { reasoning_content?: string })?.reasoning_content }); // 思考は監査用
```

> 💡 **Using them differently**: tasks where **the type is the star**, like extraction, classification, and formatting, are fastest and cheapest with **non-thinking + structured output.** Make only **structured output that includes complex judgment** (e.g., risk judgment with grounds) **thinking mode + structured output.** Routing the mode by difficulty is [the heart of cost design](/blog/llama-inference-cost-optimization-self-host-vs-api#コスト削減レバー効果の大きい順).

---

## Pitfalls & best practices

- 🔴 **Don't skip boundary validation even with structured output.** It can break from truncation/refusal. `json_object` alone is out of the question, and even with `json_schema`, pass Zod.
- 🟠 **Don't squeeze `max_tokens` too much.** If the JSON is long, it's cut off midway and becomes invalid. Account for the schema's maximum size.
- 🟠 **The schema in one place (DRY).** Auto-generate Zod → JSON Schema, and don't create a handwritten double definition. A divergence will surely become an accident.
- 🟠 **Use `describe()` as a hint.** Each field's description stabilizes how the model fills it. But state clearly in the system to **not fill by guessing.**
- 🟢 **enum / union is structured output's forte.** A choice like "positive | negative" is reliable with `z.enum`. Avoid ambiguous free strings.
- 🟢 **Absorb version differences with `response_format: json_schema`.** The vLLM-native argument names (`structured_outputs` / `guided_json`) depend on the version.

---

## Frequently asked questions (FAQ)

**Q. `json_object` or `json_schema` — which?**
A. **Always `json_schema`.** `json_object` guarantees only up to "valid JSON," with keys and types free. If you want schema conformance, `json_schema` (guided decoding) is the only choice.

**Q. If I use structured output, is Zod validation unnecessary?**
A. **No.** The constraint only reduces accidents, and it breaks on truncation/refusal/implementation differences. **The boundary Zod validation is mandatory.** Constraint = prevention, validation = stopping. Do both.

**Q. Can I do structured output in thinking mode too?**
A. You can. Use it with `--reasoning-parser qwen3`, and apply the **schema constraint only to the final answer.** Validate `content`, and separate `reasoning_content` for auditing.

**Q. Should I choose a backend like xgrammar / outlines?**
A. The default **xgrammar** is enough in most cases. Switch with `--structured-outputs-config.backend` only when you need a special grammar (regex, CFG). First measure with the default.

**Q. Is the design the same with Pydantic (Python)?**
A. Yes. Pass `Model.model_json_schema()` to `response_format.json_schema`, and validate the response with `Model.model_validate_json()`. The philosophy of making **one model the source of truth** is the same ([Pydantic v2 boundary validation](/blog/pydantic-v2-production-validation-type-safety)).

---

## Conclusion

To make a "thinking LLM's" output production data, drive a **double guard of constraint and validation** with **one schema.**

1. **Generation constraint**: make invalid JSON impossible to generate with vLLM's structured output (`response_format: json_schema`).
2. **Boundary validation**: still don't trust it and Zod `parse`. Stop truncation, refusal, and version diff.
3. **Source of truth in one place**: auto-generate Zod → JSON Schema (DRY, ETC).
4. **Thinking is free, only the answer is constrained**: validate `content`, send `reasoning_content` to auditing.
5. **Ensure resilience with a repair loop + observability** (finite count, metadata-only logs).

> I build your own LLM's structured output type-safely, including schema design, guided decoding, boundary validation, and repair. Take a look at my AI platform [track record](/case-studies/ai-video-localization-lipsync) and consult me. With **one person × generative AI**, fast, cheap, and safe.

### Sources / official resources

- [vLLM official (Structured Outputs)](https://docs.vllm.ai/en/latest/features/structured_outputs.html) — guided decoding / response_format / backends
- [Qwen3-8B-AWQ model card](https://huggingface.co/Qwen/Qwen3-8B-AWQ) — thinking mode and sampling
- [Zod official](https://zod.dev/) — schema, `toJSONSchema`, `safeParse`
- [xgrammar](https://github.com/mlc-ai/xgrammar) — vLLM's default structured-output engine

* vLLM's structured-output API changes by version. Always confirm with primary sources and your version before implementation.
