Skip to main content
友田 陽大
Pydantic & type-safe validation
Python
Pydantic
PydanticAI
AIエージェント
LLM
型安全
可観測性

PydanticAI practical guide: running a type-safe AI agent in production (structured output, tools, DI, observability)

Faithful to the PydanticAI official documentation, this explains in real code production-quality AI-agent design: how to make an Agent, type-safe structured output via output_type, dependency injection with @agent.tool/tool_plain and deps_type, self-repair with output_validator and ModelRetry, streaming with run_stream, Logfire observability, and durable execution with Temporal/DBOS.

Published
Reading time
14 min read
Author
友田 陽大
Share

Introduction: escaping the "string hell" of LLM apps

An application that embeds an LLM becomes, if left alone, string hell. You pass a string called a prompt, a string called free text returns, you fearfully parse it with regex or json.loads, and if the shape is broken it falls at runtime. Tests are settled with "it's probably about right," and what happened in production you grep the logs and guess — this is a reversion to the "world without types" that we spent 20 years discarding in backend design.

PydanticAI is an agent framework that answers this problem with Pydantic's philosophy. The developer is the team that makes Pydantic itself. So the root of the design is consistent — "never trust data coming from outside the system boundary." The LLM's output is also merely unvalidated external input. Then declare the shape with BaseModel, validate at the boundary, and pass it inside as a typed object. Doing for the LLM's response what FastAPI does for an HTTP request — that's PydanticAI.

The latest at the time of writing is PydanticAI 2.0 (released June 23, 2026, Python 3.10+). The API was overhauled in the major version, and it diverges on important points from old tutorials on the net (result_typeoutput_type, system_promptinstructions recommended, etc.; described later). This article, faithful to the official documentation, summarizes with the correct 2.0 API.

💡 The consistent theme of this blog: in this portfolio, I hold up "building fast, cheap, and safe with one person × generative AI." PydanticAI is exactly a tool for "using AI while a human holds the verification gate." Validate LLM output with types, separate tools into deterministic code, and track behavior with observability — I handle the design that changes AI from a "smart guess" to "a part that ships in production." The TypeScript-side counterpart is the Vercel AI SDK production guide; for making structured output with the raw API without PydanticAI, see the LLM-structured-output guide built with Pydantic.


1. The minimal agent: run it in 5 lines

First, install. Choose the full version, or the per-provider lightweight version (pydantic-ai-slim).

# フル版(全プロバイダ同梱)
pip install pydantic-ai

# 軽量版+必要なプロバイダだけ(推奨)
pip install "pydantic-ai-slim[anthropic]"

The minimal agent is just this.

from pydantic_ai import Agent

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    instructions="簡潔に、1文で答えてください。",
)

result = agent.run_sync('"hello world" の語源は?')
print(result.output)

Specify the model in the "provider:model-name" format ("anthropic:claude-sonnet-4-6", "openai:gpt-5.2", "google:gemini-3-flash-preview", etc.). There are three ways to run.

MethodUse
agent.run_sync(...)synchronous execution (scripts, batches)
await agent.run(...)asynchronous execution (servers like FastAPI)
async with agent.run_stream(...) as result:streaming (chapter 6)

💡 Use instructions (not system_prompt): in v2, instructions is recommended. The difference between the two is in the handling of conversation history — when you pass message_history, system_prompt sends the past prompts included in the history together, but instructions sends only the current agent's instructions. Since it avoids the accident of prompts duplicating in multi-turn, choose instructions unless there's a special reason.


2. Structured output: make the LLM a "typed function" with output_type

This is PydanticAI's core. If you pass a BaseModel to output_type, the LLM's response is automatically validated as that model, and result.output becomes a typed object.

from pydantic import BaseModel
from pydantic_ai import Agent


class CityLocation(BaseModel):
    city: str
    country: str


agent = Agent("anthropic:claude-sonnet-4-6", output_type=CityLocation)

result = agent.run_sync("2012年のオリンピックはどこで開催された?")
print(result.output)        # city='London' country='United Kingdom'
print(result.output.city)   # 'London' ← str として型補完が効く

result.output is of type CityLocation. Editor completion works, mypy / Pyright type-check it, and downstream code can be written on the guarantee that "city is always str." The LLM behaves just like a typed pure function.

H3: Output modes — ToolOutput / NativeOutput / PromptedOutput

PydanticAI, by default, takes out structured output using the model's "tool calling (function calling)" feature. This is the most portable method. When you want to allow multiple output types, pass them as a list.

from pydantic import BaseModel
from pydantic_ai import Agent, ToolOutput, NativeOutput


class Fruit(BaseModel):
    name: str
    color: str


class Vehicle(BaseModel):
    name: str
    wheels: int


# 既定(ツール経由)。型名でツール名を明示することもできる
agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=[
        ToolOutput(Fruit, name="return_fruit"),
        ToolOutput(Vehicle, name="return_vehicle"),
    ],
)

# モデルがネイティブの構造化出力に対応していれば NativeOutput も使える
native_agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=NativeOutput([Fruit, Vehicle], name="fruit_or_vehicle"),
)
MarkerMechanismWhere to use
ToolOutput (default)structure via tool callingportability-focused. The first choice that works on almost all models
NativeOutputthe model's native structured-output featurethe most certain schema compliance on supported models
PromptedOutputinstruct JSON in the prompta safeguard for models with neither tools nor native

Why is this superior? The design of "have it return free text and parse it hard later" is at the mercy of the LLM's whims (extra preambles, code-block fences, slightly different key names) every time. output_type forces the schema on the model side and validates the response with Pydantic, so a broken shape converges into a controlled flow of "validation error → retry." The shape of the output becomes a contract, and that contract remains as-is as the BaseModel's source code — this is the exit from string hell.


3. Tools: hand "what it can do" to the LLM type-safely

The means by which an agent acts on the outside world is a tool. In PydanticAI, just attach a decorator to an ordinary Python function. From the function's type annotations and docstring, the JSON schema passed to the LLM is auto-generated.

import random
from pydantic_ai import Agent, RunContext

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    deps_type=str,  # 第4章で解説。ここではプレイヤー名を注入する
    instructions="サイコロゲームの進行役。出目が予想と一致したら勝ち。",
)


@agent.tool_plain
def roll_dice() -> str:
    """6面ダイスを振り、出目を返す。"""
    return str(random.randint(1, 6))


@agent.tool
def get_player_name(ctx: RunContext[str]) -> str:
    """プレイヤーの名前を取得する。"""
    return ctx.deps

The difference between the two decorators is whether it receives a context.

  • @agent.tool: takes ctx: RunContext[...] as the first argument. Can access dependencies (DB connections, API keys, user information, etc.).
  • @agent.tool_plain: a pure tool that doesn't need a context.

Arguments other than ctx become the tool's input schema as-is. It interprets the docstring format (Google / NumPy / Sphinx) and auto-reflects even each argument's description into the schema.

@agent.tool_plain(docstring_format="google", require_parameter_descriptions=True)
def search_products(keyword: str, max_results: int = 10) -> list[str]:
    """商品を検索する。

    Args:
        keyword: 検索キーワード。
        max_results: 返す件数の上限。
    """
    ...

H3: ModelRetry — request a "redo" from the tool to the LLM

When a tool judges "the input is invalid," instead of throwing an exception and crashing, it can send ModelRetry to prompt the LLM to correct.

from pydantic_ai import Agent, ModelRetry

agent = Agent("anthropic:claude-sonnet-4-6")


@agent.tool_plain
def lookup_user(user_id: str) -> str:
    if not user_id.startswith("usr_"):
        # 例外で落とさず、LLM に「正しい形式で渡し直して」と伝える
        raise ModelRetry("user_id は 'usr_' で始まる必要があります。")
    return f"ユーザー {user_id} の情報"

Why is this superior? A tool is "deterministic code," the LLM is "ambiguous judgment." Not mixing the two is robust design. Confine processing that needs certainty, like inventory lookup, payments, and DB writes, to tools (= ordinary Python), and leave to the LLM only "when, with which arguments, to call it." ModelRetry is a mechanism to self-repair, within the conversation, the discrepancy that occurred at that boundary. This is PydanticAI providing, as a language feature, the "separation of deterministic code and probabilistic judgment" discussed in the tool-use design of AI agents.


4. Dependency injection: inject side effects and make them testable

If a tool touches a DB or an external API, you must not hardcode that dependency. PydanticAI has dependency injection (DI) via deps_type. If you know FastAPI's Depends, the philosophy is the same.

from dataclasses import dataclass
import httpx
from pydantic_ai import Agent, RunContext


@dataclass
class Deps:
    """エージェントが必要とする外部依存。dataclass で束ねるのが定石。"""
    api_key: str
    http_client: httpx.AsyncClient


agent = Agent("anthropic:claude-sonnet-4-6", deps_type=Deps)


@agent.tool
async def fetch_weather(ctx: RunContext[Deps], city: str) -> str:
    """指定都市の天気を取得する。"""
    resp = await ctx.deps.http_client.get(
        "https://api.example.com/weather",
        params={"city": city, "key": ctx.deps.api_key},
    )
    return resp.text


async def main(client: httpx.AsyncClient) -> None:
    deps = Deps(api_key="...", http_client=client)
    result = await agent.run("東京の天気は?", deps=deps)
    print(result.output)

Declare the type with deps_type=Deps, and pass the instance at runtime with agent.run(..., deps=deps). Inside the tool, you can access it type-safely with ctx.deps.

Why is this superior? DI's true worth is testability. In production you inject the actual httpx.AsyncClient, and in tests a mock. Furthermore, using agent.override(deps=...), you can swap the dependency only during the test. Verifying only the tool's logic without calling the LLM, or running the whole flow with a deterministic test model ("test") — the path opens to test AI-including code without AI. This is the very "build the verification path first" principle of CLAUDE.md.


5. Output validators and self-repair: build verification into the conversation loop

There's semantic verification that Pydantic validation of output_type alone isn't enough for. "Is the SQL the LLM generated actually executable," "is the proposed date a business day" — such verification is done with @agent.output_validator, and on failure, make the LLM re-create with ModelRetry.

from pydantic import BaseModel
from pydantic_ai import Agent, ModelRetry, RunContext


class SqlQuery(BaseModel):
    sql_query: str


agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=SqlQuery,
    deps_type=DatabaseConn,  # 例:DB 接続
)


@agent.output_validator
async def validate_sql(ctx: RunContext[DatabaseConn], output: SqlQuery) -> SqlQuery:
    try:
        # EXPLAIN で「実行可能か」だけを安全に検証する(実行はしない)
        await ctx.deps.execute(f"EXPLAIN {output.sql_query}")
    except QueryError as e:
        # 失敗をエラーメッセージごと LLM に返し、修正版を生成させる
        raise ModelRetry(f"無効なクエリです: {e}") from e
    return output

This loop is powerful. Pydantic's syntactic verification (types, required, constraints) and output_validator's semantic verification (is it business-correct) work in two stages, and a failure in either is sent back to the conversation with ModelRetry. A self-repairing agent that "automatically redoes until correct output is obtained" can be assembled declaratively.

⚠️ Cap the retries: limit the retry count with Agent(..., retries=2). Unlimited retries make cost and latency unbounded. If retries are frequent, that's a sign of a design flaw in the prompt or schema — prioritize root improvement, like conveying each field's intent to the LLM with Field(description=...) (detailed in the LLM-structured-output guide built with Pydantic).


6. Streaming: flow partial results while validating

In a chat UI, "return after everything is done" is too slow. run_stream can stream partial results while validating the structured output.

async def stream_profile(agent: Agent, user_input: str) -> None:
    async with agent.run_stream(user_input) as result:
        # 検証済みの「途中までのオブジェクト」が順次流れてくる
        async for profile in result.stream_output():
            print(profile)
            # {'name': 'Ben'} → {'name': 'Ben', 'dob': date(1990, 1, 28)} → ...

To stream only text, use result.stream_text(); to receive the raw ModelResponse thinned out, use result.stream_response(debounce_by=0.01).

⚠️ Beware of side effects on partial results: during streaming, "still-incomplete objects" are also passed to output_validator. You'd want to do side effects like DB writes only on the completed version. Look at the ctx.partial_output flag and skip verification/side effects on a partial result. Neglecting this becomes the accident of polluting an external system with unconfirmed data.


7. Observability: see "AI's behavior" at a glance with Logfire

The biggest hurdle of an LLM app is debugging. "Why was this tool called?" "Which retry fixed it?" "Where am I wasting tokens?" — these can't be traced with print debugging. PydanticAI integrates with Logfire from the same Pydantic team, and instruments all behavior on an OpenTelemetry base.

import logfire
from pydantic_ai import Agent

logfire.configure()
logfire.instrument_pydantic_ai()  # これだけで全エージェントが計装される

agent = Agent("anthropic:claude-sonnet-4-6")
result = agent.run_sync("...")
print(result.usage)  # RunUsage(input_tokens=62, output_tokens=1, requests=1)

With just two lines (configure + instrument_pydantic_ai), each agent execution, tool call, retry, and token usage is visualized as a trace. From the result.usage attribute you can directly get the token count and request count, which becomes the foundation of cost monitoring. If you want to see even the raw HTTP requests, add logfire.instrument_httpx(capture_all=True).

Why is this superior? PydanticAI's instrumentation follows OpenTelemetry's GenAI semantic conventions, so it can flow to OTel backends other than Logfire (Grafana, Datadog, etc.). This means the AI's execution naturally rides on the "correlate the three pillars" design I discussed in the OpenTelemetry observability guide. "Trace a stalled process at a glance" — in running an AI agent in production, this observability is not a feature but a precondition. Being able to trace which agent execution / tool call consumes time and tokens becomes the foundation for shortening debugging and early detection of cost anomalies.


8. Production resilience: "continue from where it stopped" with durable execution

An agent hits external APIs many times, chains tools, and sometimes waits for human approval (human-in-the-loop) — such long-running workflows definitely crash midway. API rate limits, network drops, process restarts. Redoing from the start each time is unacceptable cost-wise and UX-wise.

PydanticAI provides durable execution via integration with Temporal / DBOS / Prefect / Restate. Just by wrapping an existing Agent in a dedicated class, progress is persisted and you can resume from where you left off across a failure.

from pydantic_ai import Agent
from pydantic_ai.durable_exec.temporal import TemporalAgent

# name は必須(ワークフロー/アクティビティの識別に使われる)
agent = Agent(
    "anthropic:claude-sonnet-4-6",
    instructions="...",
    name="geography",
)

temporal_agent = TemporalAgent(agent)  # Temporal ワークフロー内で実行する

With DBOS, it checkpoints the state to a DB.

from pydantic_ai.durable_exec.dbos import DBOSAgent

dbos_agent = DBOSAgent(agent)
result = await dbos_agent.run("メキシコの首都は?")
BackendNatureSuited scene
Temporala workflow engine. Powerful retry/timerscomplex long-running orchestration
DBOSa DB-checkpoint method. Lightweightwhen you want to lean the state on an existing DB
Prefect / Restatedata-pipeline / durable-RPC orientedto match each foundation

⚠️ Always attach name=: an agent wrapped in durable execution requires name= (it becomes the workflow identifier). Also, TemporalAgent has backend-specific constraints, like being defined at the module's top level. When introducing it, always refer to the target backend's documentation.

Why does this work? In the internal AI platform I built for a broadcaster (program-production support), I guaranteed resilience by separating long AI jobs into Cloud Workflows / Cloud Run Jobs. PydanticAI's durable execution solves that requirement of "running a long job in a non-crashing form" at the agent layer. Think of it as the AI-agent version of the design (see the FastAPI production-operation guide) of escaping processing too heavy to hold in FastAPI's BackgroundTasks to a job foundation.


Conclusion: change AI into "a part that ships in production"

PydanticAI is a framework that raises an LLM app from a "smart guess" to "a production system that's type-safe, observable, and recovers even when it crashes." Let me re-list the key points of this article.

  1. Make a minimal agent with Agent + instructions, and validate the output as a typed object with output_type=BaseModel (use ToolOutput / NativeOutput / PromptedOutput properly).
  2. Define tools with @agent.tool / @agent.tool_plain, and auto-generate the schema from type annotations and the docstring. Self-repair discrepancies with ModelRetry.
  3. Inject side effects with deps_type and make them testable (swap mocks with override).
  4. Build semantic verification into the conversation loop with @agent.output_validator + ModelRetry, and cap the retries.
  5. Stream while validating with run_stream + stream_output (control side effects with partial_output).
  6. Integrate into OpenTelemetry with Logfire (instrument_pydantic_ai) and see all behavior and cost at a glance.
  7. Make long-running workflows fault-tolerant with durable execution (Temporal / DBOS, etc.).

At the root of PydanticAI is, after all, the same discipline as Pydantic itself — "validate data coming from outside (including LLM output) at the boundary before passing it inside." This consistency is exactly what bridges AI to production reliability.

As official primary sources, I recommend re-reading the following from this article's viewpoint.


Consultation on type-safe AI-agent development

The author has designed and operated backends that embed generative AI at production quality, including an internal AI platform for a major domestic broadcaster. Validate the LLM's output with types, separate tools into deterministic code, track behavior with observability, and assemble non-crashing workflows with durable execution — I implement, fast and at high quality leveraging generative AI, the design for not "running AI" but "putting AI on the business's reliability." Please feel free to consult me about building AI agents, RAG, and structured-extraction pipelines using PydanticAI / FastAPI.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading