# The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

> Do you think LLM structured output (JSON) is safe if you use constrained (guided) decoding? What constrained decoding guarantees is 'syntactically valid JSON,' not 'semantically correct values.' Failures don't vanish; they change shape. It explains the production design of schema validation + business-rule validation + repair retry + fallback, from the real example of running structured AI output in production and a Zod implementation.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 生成AI, LLM, 型安全, vLLM, Zod, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/structured-output-reliability-constrained-decoding-semantic-validation
- Category: Generative AI, LLMs & RAG
- Pillar guide: https://tomodahinata.com/en/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag

## Key points

- What constrained decoding guarantees is 'syntactically valid JSON,' not 'semantically correct values' — failures don't vanish; they change shape.
- Prompt-only JSON generation is reported to fail a few to ten-odd percent of the time. Constrained decoding nearly eliminates syntax failures, but failures move to 'valid but wrong values' and 'refusal/degradation.'
- Production design is two layers — validate syntax (schema) and semantics (business rules) separately. Express business constraints with Zod's refine.
- On validation failure, feed the reason back to the LLM for a repair retry, and when exhausted, fall to the safe side with a fallback.
- Make failure reasons observable with structured logs. Without measuring which field failed and why, you can't improve.

---

Let me state the conclusion first. **"Use constrained (guided) decoding and LLM structured output (JSON) is safe" is a misconception. What constrained decoding guarantees is 'syntactically valid JSON,' not 'semantically correct values.'** Even matching the schema, output where the value is wrong in business terms — a tax rate that's a non-existent value, a total that doesn't match the line items, a required basis that's empty — is generated blithely. Add constrained decoding and syntax errors nearly vanish, but **failures don't vanish; they change shape and move to 'valid but wrong output' or 'refusal/degradation.'** In production, you need a design that validates syntax and semantics separately and solidifies it with repair retries and fallbacks.

This article, based on my experience **running generative-AI structured output (per-criterion risk judgments for review, etc.) in production**, explains the design that makes structured output practical-quality, with implementation. For how to implement structured output in vLLM itself, see [Qwen3 structured output (guided decoding × Zod)](/blog/qwen3-structured-output-json-vllm-guided-decoding-zod).

---

## 1. "Valid JSON" and "correct values" are different things

First, internalize this distinction. Structured output has two correctnesses.

| | Syntax | Semantics |
|---|---|---|
| **What it guarantees** | The JSON's shape/type/required fields are correct | The values are correct in business terms |
| **Who guarantees it** | Constrained decoding (guided decoding) can | **No one guarantees it automatically — you validate yourself** |
| **Failure example** | `{"total": "abc"}` (a string where a number should be) | `{"total": 9999}` (the type is right but it doesn't match the line items) |

Constrained decoding **constrains the model's output at the generation stage** to conform to a schema (JSON Schema / grammar). With this, **syntax errors like "wrong type" or "missing closing bracket" drop dramatically.** Reports say that while prompt-only JSON generation fails a few to ten-odd percent of the time, constrained decoding suppresses syntax failure to a tiny fraction (under 0.1%, it's said).

But — this is the essence — **even if syntax is guaranteed, semantics aren't.** Business rules like "the tax rate is one of 0/8/10%," "the total matches the accumulation of line items," and "the basis field is non-empty" **can't be fully expressed by the schema's types alone.** If you feel safe after adding constrained decoding, you'll pass "perfect in shape but wrong in content" output downstream without validation.

---

## 2. Failures don't vanish; they change shape

Introduce constrained decoding, and the failure mode moves. Without understanding this, you'll misperceive that "syntax errors vanished, so it became safe."

| With constrained decoding… | Where the failure moves |
|---|---|
| Syntax errors (broken JSON) nearly vanish | → **matches the schema but the value is wrong** (most dangerous: you can't notice) |
| | → **refusal/degradation** (over-constraint makes the model return empty, boilerplate, or meaningless output) |
| | → **quality drop** (bound by the grammar, a good answer it could otherwise give is trimmed) |

Especially troublesome is the second, "refusal/degradation." Over-constraining with a schema can prevent the model from exercising its inherent reasoning and make it return "a safe value that just satisfies the schema." **Constraint can sacrifice "semantic quality" in exchange for "syntactic guarantee"** — you need to handle this trade-off by design.

---

## 3. Production design ①: validate syntax and semantics separately

The solution is clear. **Validate syntax (schema) and semantics (business rules) in two layers.** In TypeScript, you can express business rules with Zod's `superRefine`.

```ts
import { z } from "zod";

// 第1層（構文）: 形・型・必須。制約付きデコードが守ってくれる領域。
const InvoiceLineSchema = z
  .object({
    description: z.string().min(1),
    quantity: z.number().int().positive(),
    unitPriceJpy: z.number().int().nonnegative(),
  })
  .strict(); // 余計なキーを許さない（プロンプトインジェクション由来の混入を弾く）

// 第2層（意味）: 業務ルール。制約付きデコードでは守られない——ここを自分で検証する。
const InvoiceSchema = z
  .object({
    lines: z.array(InvoiceLineSchema).min(1),
    taxRate: z.number(),
    totalJpy: z.number().int(),
  })
  .strict()
  .superRefine((inv, ctx) => {
    // ルール1: 税率は 0 / 8% / 10% のいずれか（型は number だが、値は限定）
    if (![0, 0.08, 0.1].includes(inv.taxRate)) {
      ctx.addIssue({ code: "custom", message: "tax rate must be 0, 0.08, or 0.1", path: ["taxRate"] });
    }
    // ルール2: 合計は明細の積み上げ＋税と一致する（LLMの算術を信用しない）
    const subtotal = inv.lines.reduce((s, l) => s + l.quantity * l.unitPriceJpy, 0);
    const expected = Math.round(subtotal * (1 + inv.taxRate));
    if (inv.totalJpy !== expected) {
      ctx.addIssue({
        code: "custom",
        message: `total mismatch: got ${inv.totalJpy}, expected ${expected}`,
        path: ["totalJpy"],
      });
    }
  });

export type Invoice = z.infer<typeof InvoiceSchema>;
```

Reject extra keys with `.strict()`, and validate "business rules that types can't express" with `superRefine`. **Have the LLM do arithmetic and business judgment, but don't trust the result as-is** — this is the same philosophy as [recomputing the amount server-side](/blog/payment-double-charge-prevention-idempotency-procurement-guide) in payments. The LLM produces "candidate judgments," but "correctness" is finalized by code.

---

## 4. Production design ②: repair retries and fallback

When validation fails, rather than just erroring, **feed the failure reason back to the LLM for a repair retry**, and if it still fails, **fall to the safe side with a fallback.** This resilience divides production quality.

```ts
type Outcome<T> =
  | { readonly ok: true; readonly value: T }
  | { readonly ok: false; readonly reason: "exhausted" };

interface ExtractDeps<T> {
  /** LLM呼び出し（制約付きデコード）。前回の失敗理由を修復ヒントとして渡せる。 */
  readonly call: (repairHint?: string) => Promise<unknown>;
  readonly schema: z.ZodType<T>;
  readonly maxAttempts: number;
  /** 失敗を構造化ログへ。どのフィールドがなぜ落ちたかを観測可能にする。 */
  readonly onFailure: (e: { attempt: number; issues: string }) => void;
}

/**
 * 構文＋意味を検証し、失敗時は理由をフィードバックして修復リトライ。
 * 枯渇したら Outcome.ok=false を返し、呼び出し側が安全側（人手確認・既定値）に倒す。
 * 例外を制御フローに使わず、結果を型で表す（テスト容易性・回復性）。
 */
export async function extractStructured<T>(deps: ExtractDeps<T>): Promise<Outcome<T>> {
  let hint: string | undefined;
  for (let attempt = 1; attempt <= deps.maxAttempts; attempt++) {
    const raw = await deps.call(hint);
    const parsed = deps.schema.safeParse(raw);
    if (parsed.success) return { ok: true, value: parsed.data };

    // 失敗フィールドと理由を集約し、次の試行へ具体的な修復指示として渡す
    hint = parsed.error.issues
      .map((i) => `${i.path.join(".") || "(root)"}: ${i.message}`)
      .join("; ");
    deps.onFailure({ attempt, issues: hint });
  }
  return { ok: false, reason: "exhausted" };
}
```

There are three key points in this design.

1. **Don't use exceptions for control flow** — express success/failure with the `Outcome<T>` type, forcing the caller to handle "what to do on exhaustion" via types.
2. **Repair retry** — by returning "which field is bad and why" to the LLM, get a targeted repair rather than a blind retry.
3. **Fallback** — when retries are exhausted, fall to the safe side: route to human review, use a safe default, or degrade the processing. In the AI video pipeline too, I set a quantitative gate that aborts processing when the TTS failure rate exceeds a threshold and gracefully degrades by inserting silence.

---

## 5. Production design ③: make failures observable

Finally, **you can't improve what you can't measure.** Make "which field failed validation, how often, for what reason" observable with structured logs. With this,

- spots where the schema or business rules are too strict (the model can't satisfy them) become visible.
- spots where the prompt should be improved can be identified.
- the trade-off "tightening the constraint increased refusal/degradation" can be judged in numbers.

In the generative-AI review-support I built for a broadcaster, I **attached grounding-derived citations to the output to make the basis of judgments traceable.** By including and validating/recording "the basis" in the structured output, root-cause investigation of wrong judgments and continuous quality improvement run. The reliability of structured output is ensured only when it includes not just the schema but this **loop of observation and improvement.**

---

## FAQ

### Q. If I use constrained (guided) decoding, is JSON output safe?

It becomes syntactically safe but not semantically safe. Constrained decoding guarantees "a valid JSON shape" but doesn't guarantee "whether the value is correct in business terms." Schema-matching but wrong output — a non-existent tax-rate value, a total that doesn't match the line items — is generated. In addition to schema validation, business-rule validation is essential.

### Q. Isn't it enough to just instruct "return JSON" in the prompt?

It's insufficient in production. Prompt-only JSON generation is reported to fail syntactically a few to ten-odd percent of the time. Constrained decoding nearly eliminates syntax failure, but failures now move to "valid but wrong values" and "refusal/degradation." Either way, you need a design that validates the output in two layers, syntax + semantics, after receiving it.

### Q. How do I implement business-rule validation?

Implement, as validation logic, constraints the schema's types alone can't express (value ranges, cross-field consistency, arithmetic agreement, etc.). In TypeScript, use Zod's `superRefine`; in Python, Pydantic validators. What matters is to not trust the arithmetic and judgments you had the LLM do as-is, but to recompute/re-validate in code and finalize correctness.

### Q. What should I do when validation fails?

Rather than just erroring, feed the failure reason (which field is bad and why) back to the LLM for a repair retry. If it's still exhausted, prepare a fallback that falls to the safe side: route to human review, use a safe default, or degrade the processing. Expressing success/failure with a type (Result/Outcome) rather than exceptions, forcing the caller to implement the exhaustion handling, makes it robust.

### Q. Will stronger constraints raise quality?

Not necessarily. Over-constraining with a schema can prevent the model from exercising its reasoning and make it return "a safe value that just satisfies the schema" (refusal/degradation). Constraint is a trade-off that can sacrifice "semantic quality" in exchange for "syntactic guarantee." Make failure modes observable and tune the balance of constraint strength and output quality in numbers.

---

## Summary: separate syntax and semantics, and solidify with validation, repair, and observation

To make LLM structured output production-quality, here's what to grasp.

1. **What constrained decoding guarantees is syntax, not semantics** — failures don't vanish; they change shape.
2. **Validate syntax (schema) and semantics (business rules) in two layers** — express business constraints with Zod's `refine`.
3. **Don't trust the LLM's arithmetic/judgments; finalize in code** — the same philosophy as server-side amount recomputation.
4. **On failure, repair retry + fallback** — feed back the reason, and fall to the safe side when exhausted.
5. **Make failures observable with structured logs** — what you can't measure, you can't improve.

"I'm having the LLM produce JSON, but occasionally a weird value comes" / "I want to stabilize structured output in production" — that reliability can be ensured with the separation of syntax and semantics, repair retries, and the observation loop. With type-safe boundary design, I take on the implementation that raises generative AI to production-operations quality.