LLM structured output built with Pydantic: implementing JSON Schema generation, validation, and a self-healing loop with the raw API

Introduction: an LLM's output is also "unvalidated external input"

Extract the amount and line items from an invoice image. Classify an inquiry email into "urgency, category, summary." Structure ToDos from meeting minutes — much of the practical application of LLMs comes down to "extracting fixed-shape data from free text." And here a problem familiar to backend engineers returns. An LLM's output is merely unvalidated external input.

In the PydanticAI practical guide, I handled how to solve this problem with a framework. This article handles the one layer below — implementing structured output with just Anthropic / OpenAI's raw API and Pydantic, without relying on a framework. Why should you know the raw API? Because you want to embed it into an existing codebase with minimal dependency, finely control provider-specific features (prompt caching, etc.), or fully grasp "what's happening" — for such practical demands, understanding the inside of the abstraction works.

Pydantic's role boils down to a single principle — "a Pydantic model is the sender of the schema and the validator of the response." From one BaseModel, ① generate the JSON Schema passed to the LLM (sending), and ② validate the returned JSON (receiving). Implement this round trip faithful to the official documentation, and one notch more understandably.

💡 For the TypeScript crowd: implementing the same philosophy with Zod is handled in reliability design of structured output and the discipline of TypeScript type safety. This article is the Python / Pydantic version of those.

1. One model plays two roles

First, declare the shape of the data you want to extract with BaseModel. This becomes the Single Source of Truth.

from pydantic import BaseModel, Field


class Invoice(BaseModel):
    vendor_name: str = Field(description="請求元の会社名")
    total_amount: int = Field(description="税込の合計金額（円、整数）", ge=0)
    due_date: str = Field(description="支払期日。YYYY-MM-DD 形式")
    line_items: list[str] = Field(description="明細の品目名リスト")

From this one class, you can generate ① the schema for sending.

schema = Invoice.model_json_schema()
# {
#   "type": "object",
#   "properties": {
#     "vendor_name": {"type": "string", "description": "請求元の会社名"},
#     "total_amount": {"type": "integer", "minimum": 0, "description": "..."},
#     ...
#   },
#   "required": ["vendor_name", "total_amount", "due_date", "line_items"]
# }

And with the same class, you can also do ② the validation for receiving.

raw = '{"vendor_name":"Acme","total_amount":50000,"due_date":"2026-07-31","line_items":["設計費"]}'
invoice = Invoice.model_validate_json(raw)  # 検証して型付きオブジェクトに

Why is this superior? The most common bug in LLM integration is a divergence between "the schema told to the LLM" and "the type the code expects." Manage the schema as a hand-written dict and write validation in a separate function, and you'll fix one and forget the other — a typical accident born of a DRY violation. Make the Pydantic model the source of truth, and both schema and validation are derived from the same definition, so they can't structurally diverge. Add one field and the sending schema and the receiving validation update at the same time.

2. Guide the LLM with the schema: `description` and `examples` are critical

The accuracy with which the LLM returns data conforming to the schema is decided by the quality of the descriptions embedded in the schema. Field(description=...) is not mere documentation but the extraction instruction the LLM reads itself.

from typing import Annotated, Literal
from pydantic import BaseModel, Field


class SupportTicket(BaseModel):
    """ユーザーからの問い合わせを構造化したもの。"""  # docstring はスキーマの説明になる

    category: Literal["bug", "billing", "feature_request", "other"] = Field(
        description="問い合わせの分類。判断に迷う場合は other を選ぶ。"
    )
    urgency: int = Field(
        description="緊急度を1（低）〜5（高）で。サービス停止に言及があれば5。",
        ge=1, le=5,
    )
    summary: str = Field(
        description="問い合わせ内容の日本語1文要約。",
        examples=["決済画面でエラーが出てログインできない"],
    )

As you can confirm in the official documentation, description is reflected as-is into the generated JSON Schema, and examples are likewise put on the schema.

SupportTicket.model_json_schema()
# urgency → {"type": "integer", "minimum": 1, "maximum": 5, "description": "緊急度を..."}
# summary → {"type": "string", "description": "...", "examples": ["決済画面で..."]}

Using Literal, it's expressed as an enum in the schema, structurally narrowing the LLM's output candidates. examples works as few-shot examples for the LLM, suppressing variance in the output format.

⚠️ The pitfall of by_alias and $ref: model_json_schema() defaults to by_alias=True. That is, a field with Field(alias=...) has its schema key become the alias. The LLM returns with that alias, so the validation side must be consistent. Also, nested models are expressed with $defs + $ref in the schema, but some LLM providers' "strict structured-output mode" dislikes $ref. In that case, you need preprocessing to flatten the schema (Pydantic doesn't go as far as flattening). To make it OpenAPI-compatible, specify ref_template="#/components/schemas/{model}".

3. Pass it to the provider: Anthropic's tool use as an example

The surest way to pass the generated schema to the LLM is to use it as the input schema of a tool (function calling). With the Anthropic Messages API, define a tool and force its call.

import anthropic
from pydantic import BaseModel, Field


class Invoice(BaseModel):
    vendor_name: str = Field(description="請求元の会社名")
    total_amount: int = Field(description="税込の合計金額（円）", ge=0)


client = anthropic.Anthropic()  # API キーは環境変数から（ハードコードしない）

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[
        {
            "name": "save_invoice",
            "description": "抽出した請求書データを保存する。",
            "input_schema": Invoice.model_json_schema(),  # ← ここが要
        }
    ],
    tool_choice={"type": "tool", "name": "save_invoice"},  # 必ずこのツールを呼ばせる
    messages=[{"role": "user", "content": "請求書テキスト: ..."}],
)

# ツール呼び出しブロックから input（dict）を取り出して検証する
tool_use = next(b for b in message.content if b.type == "tool_use")
invoice = Invoice.model_validate(tool_use.input)  # ← 検証して初めて信頼する
print(invoice.total_amount)  # int として保証される

Two points. Pass model_json_schema() to input_schema, and force that tool's call with tool_choice. This makes the model return "structured data along the schema, not free text." tool_use.input is a dict so validate with model_validate (use model_validate_json for a raw JSON string, see chapter 1).

💡 The idea is the same with OpenAI: with OpenAI's Structured Outputs (response_format's JSON Schema), the round trip of passing the schema generated with Model.model_json_schema() and validating the returned JSON with model_validate_json is unchanged. The pattern is invariant even when the provider changes — that's the value of understanding the raw API. For Anthropic / Claude API details, see the Claude API practical guide.

4. Validate the response: when to use `strict`

For validating the returned data, use model_validate (from a dict) or model_validate_json (from a JSON string). As touched on in chapter 1, for a JSON string model_validate_json does parsing and validation in one pass, which is efficient.

Here, one practical judgment. Whether to use strict=True.

# lax（既定）：LLM が "50000" と文字列で返しても int 50000 に変換してくれる
Invoice.model_validate_json(raw)

# strict：型の完全一致を要求。"50000"（文字列）は拒否される
Invoice.model_validate_json(raw, strict=True)

⚠️ Against an LLM, strict increases retries: even if you specify integer in the schema, an LLM often returns a number as a string ("50000"). Set strict=True at the boundary and such outputs all become validation errors, retries fire frequently, and cost and latency balloon. In validating LLM output, leaving it to type coercion in the default lax mode is realistic. Pinpoint-strictify only specific fields where you "absolutely don't want to accept a string as a number" with Field(strict=True) — this division is the landing point (for details on strict and type coercion, see chapter 4 of the Pydantic v2 practical guide).

5. Self-healing loop: feed validation errors back to the LLM

When validation fails, just throwing an exception and ending is immature for "LLM integration." If you convert the detailed error information ValidationError holds into the next prompt to the LLM, the model can fix it itself. This is the heart of the self-healing loop.

ValidationError.errors() returns, in a structured way, what went wrong and how.

from pydantic import ValidationError

try:
    Invoice.model_validate({"vendor_name": "Acme", "total_amount": "とても高い"})
except ValidationError as e:
    for err in e.errors(include_url=False):
        print(err["loc"], err["type"], err["msg"])
    # ('total_amount',) int_parsing  Input should be a valid integer, ...
    # ('due_date',)     missing      Field required

Each error has type (int_parsing / missing, etc.), loc (location), msg (human-readable explanation), and input (the actual value that came). Make this the feedback to the LLM as-is.

import json
from pydantic import BaseModel, ValidationError


def extract_with_retry(client, prompt: str, model: type[BaseModel], max_retries: int = 2):
    messages = [{"role": "user", "content": prompt}]
    for attempt in range(max_retries + 1):
        raw = call_llm(client, messages, model)  # ツール経由で JSON を得る（第3章）
        try:
            return model.model_validate_json(raw)
        except ValidationError as e:
            if attempt == max_retries:
                raise  # 上限到達。これ以上は粘らない
            # url を落としてトークンを節約しつつ、エラーを構造化して差し戻す
            feedback = e.errors(include_url=False, include_input=True)
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": f"出力に検証エラーがありました。修正して再出力してください:\n"
                           f"{json.dumps(feedback, ensure_ascii=False, default=str)}",
            })
    raise RuntimeError("unreachable")

Why is this superior? ValidationError returns all errors found in one validation together (an exception doesn't fly per field). So in one feedback you can convey "the amount is invalid and the due date is missing" at once, and the LLM has a high chance of fixing it in one shot. With include_url=False, drop the errors.pydantic.dev/... URL attached to each error and save on feedback tokens too. Pydantic's error structure becomes the teaching signal to the LLM as-is — this is the design that connects syntactic validation to re-prompting.

⚠️ A retry cap is essential: the self-healing loop is powerful, but without a cap, cost and latency become unbounded. Always set max_retries, and if it still doesn't fix, review the design of the schema or description (chapter 2) is the proper approach. A retry is insurance to absorb "occasional fluctuations," not a routine drug for design flaws.

6. Streaming partial validation: `TypeAdapter`-only `experimental_allow_partial`

In a chat UI, there are scenes where you want to incrementally validate "incomplete JSON mid-generation." Pydantic has an experimental partial validation feature, returning only the validatable range even from JSON cut off midway.

from typing import TypedDict
from pydantic import TypeAdapter


class Item(TypedDict):
    a: int
    b: float


adapter = TypeAdapter(list[Item])

# 途中で切れた JSON（"b" の値が未到達）でも、完成した要素までを返す
adapter.validate_json('[{"a": 1, "b": 2.0}, {"a": 1', experimental_allow_partial=True)
# → [{'a': 1, 'b': 2.0}]  ← 不完全な2件目は捨てられる

But this feature has serious constraints. As the official explicitly states:

experimental_allow_partial is limited to TypeAdapter methods. It can't be used in BaseModel.model_validate_json ("You can only pass experimental_allow_partial to TypeAdapter methods").
Supported types are list / set / dict / TypedDict (non-required fields), etc. It doesn't propagate to a nested collection that passes through a BaseModel.
"It's an experimental feature and should be considered a proof of concept." Errors at the end of the input are all ignored.

⚠️ The crux of design: if you do streaming partial validation, wrap the output not in a BaseModel but in TypeAdapter(list[YourTypedDict]). BaseModel.model_validate_json(..., experimental_allow_partial=True) doesn't work — this is the point many blog articles get wrong. Note that if you use PydanticAI, streaming validation of structured output (stream_output) is provided on the framework side (chapter 6 of the PydanticAI practical guide). Building partial validation into the raw API costs accordingly, so if you need that far, considering PydanticAI is wise.

7. How much to build yourself and where to leave it to a framework

We've looked at the raw-API implementation so far, but in reality there are 3 stages of options. Choosing by requirements is the right answer.

Approach	What it does	Suited scene
Raw API + Pydantic (this article)	build schema generation, validation, retry yourself	minimal dependency, fine control, embedding into existing code
`instructor` (third-party)	just pass `response_model=Model` for validation + auto-retry	quick validated extraction. Multi-provider support
PydanticAI	agents, tools, DI, observability, durable execution	full-fledged agents / long-running workflows

instructor is a library that thinly wraps this article's pattern (model → schema → validation → retry) (※ not Pydantic-official, third-party).

# instructor を使うと、本記事の往復が数行に圧縮される（非公式ライブラリ）
import instructor

client = instructor.from_provider("anthropic/claude-sonnet-4-6")
invoice = client.create(response_model=Invoice, messages=[{"role": "user", "content": "..."}])
# 内部で JSON Schema 生成・検証・失敗時の自動リトライを行ってくれる

The criterion is simple. For one-shot extraction/classification, the raw API + Pydantic or instructor is enough. If tool chaining, state, human approval, or long-running execution is involved, PydanticAI. The purpose of turning a "smart guess" into a "part that goes to production" is the same; only the weight of the means to achieve it differs.

Conclusion: take LLM output inside the type system

The key to making LLM structured output robust isn't special magic. Acknowledge that "an LLM's output is also unvalidated external input" and set the Pydantic model as the single source of truth for schema and validation — it boils down to this boundary design. Restating the key points of this article.

One BaseModel plays two roles: send with model_json_schema(), receive with model_validate_json().
Guide the LLM with Field(description=...), examples, and Literal. The descriptions directly decide extraction accuracy (mind the handling of by_alias and $ref).
Trust the response only after validating it. Since strict increases retries against an LLM, the default lax is realistic at the boundary.
Convert ValidationError.errors(include_url=False) into a re-prompt and build a self-healing loop. Always cap retries.
Partial validation is limited to TypeAdapter + experimental_allow_partial. It doesn't work on BaseModel. If you genuinely need it, PydanticAI.
Use yourself, instructor, and PydanticAI by the weight of requirements.

What separates "a working LLM feature" from "an LLM feature trustworthy in production" is whether you can take the output inside the type system. Pydantic is the gatekeeper standing at that intake.

As official primary sources, I recommend re-reading the following from this article's viewpoint.

Consulting on LLM structured-extraction pipelines

The author has operated long-running AI jobs and structured extraction at production quality on an internal AI platform for a major domestic broadcaster. Stably extracting validated structured data from non-standard data like invoices, contracts, inquiries, and meeting minutes — that reliability is decided by the accumulation of schema design, validation boundaries, self-healing, and observability. I implement LLM structured-extraction / classification / RAG pipelines using Pydantic / PydanticAI / Claude API, quickly and at high quality with generative AI. Feel free to reach out.

LLM structured output built with Pydantic: implementing JSON Schema generation, validation, and a self-healing loop with the raw API

Introduction: an LLM's output is also "unvalidated external input"

1. One model plays two roles

2. Guide the LLM with the schema: `description` and `examples` are critical

3. Pass it to the provider: Anthropic's tool use as an example

4. Validate the response: when to use `strict`

5. Self-healing loop: feed validation errors back to the LLM

6. Streaming partial validation: `TypeAdapter`-only `experimental_allow_partial`

7. How much to build yourself and where to leave it to a framework

Conclusion: take LLM output inside the type system

Consulting on LLM structured-extraction pipelines

Pydantic v2 Practical Guide: Protect the System Boundary with Types and Pass Only Trustworthy Data

PydanticAI practical guide: running a type-safe AI agent in production (structured output, tools, DI, observability)

Pydantic advanced-types / custom-validators practical guide: make reusable 'domain types' with Annotated

Practical pydantic-settings guide: realize 12-factor with type-safe configuration management and secret protection

Also worth reading

FastAPI Input Validation Practical Guide: Type-Safe Query/Path/Body/Form with Annotated, Killing External Input at the Boundary

marshmallow vs Pydantic — A Thorough Comparison: Choosing by Design Philosophy, Performance, and Ecosystem (2026 Decision Guide)

Python Data Types Complete Guide: The 'Right Use' of Numbers, Strings, and Collections, and Designs That Don't Break in Production

Introduction: an LLM's output is also "unvalidated external input"

1. One model plays two roles

2. Guide the LLM with the schema: description and examples are critical

3. Pass it to the provider: Anthropic's tool use as an example

4. Validate the response: when to use strict

5. Self-healing loop: feed validation errors back to the LLM

6. Streaming partial validation: TypeAdapter-only experimental_allow_partial

7. How much to build yourself and where to leave it to a framework

Conclusion: take LLM output inside the type system

Consulting on LLM structured-extraction pipelines

Related articles

Pydantic v2 Practical Guide: Protect the System Boundary with Types and Pass Only Trustworthy Data

PydanticAI practical guide: running a type-safe AI agent in production (structured output, tools, DI, observability)

Pydantic advanced-types / custom-validators practical guide: make reusable 'domain types' with Annotated

Practical pydantic-settings guide: realize 12-factor with type-safe configuration management and secret protection

Also worth reading

FastAPI Input Validation Practical Guide: Type-Safe Query/Path/Body/Form with Annotated, Killing External Input at the Boundary

marshmallow vs Pydantic — A Thorough Comparison: Choosing by Design Philosophy, Performance, and Ecosystem (2026 Decision Guide)

Python Data Types Complete Guide: The 'Right Use' of Numbers, Strings, and Collections, and Designs That Don't Break in Production

2. Guide the LLM with the schema: `description` and `examples` are critical

4. Validate the response: when to use `strict`

6. Streaming partial validation: `TypeAdapter`-only `experimental_allow_partial`