# PydanticAI practical guide: running a type-safe AI agent in production (structured output, tools, DI, observability)

> Faithful to the PydanticAI official documentation, this explains in real code production-quality AI-agent design: how to make an Agent, type-safe structured output via output_type, dependency injection with @agent.tool/tool_plain and deps_type, self-repair with output_validator and ModelRetry, streaming with run_stream, Logfire observability, and durable execution with Temporal/DBOS.

- Published: 2026-06-26
- Author: 友田 陽大
- Tags: Python, Pydantic, PydanticAI, AIエージェント, LLM, 型安全, 可観測性
- URL: https://tomodahinata.com/en/blog/pydantic-ai-agent-framework-production-guide
- Category: Pydantic & type-safe validation
- Pillar guide: https://tomodahinata.com/en/blog/pydantic-v2-production-validation-type-safety

## Key points

- PydanticAI is an agent framework that brings Pydantic's discipline of 'don't trust outside the boundary' to LLM output. v2.0 was released in June 2026.
- The biggest weapon is output_type=BaseModel. It validates the LLM's free text as a typed object and self-repairs with ModelRetry on failure.
- Define tools with @agent.tool (with RunContext) / @agent.tool_plain, and the docstring and type annotations automatically become a JSON schema. Inject side effects with deps_type to make them testable.
- Observability integrates into OpenTelemetry with Logfire (logfire.instrument_pydantic_ai). You can see all tool calls, retries, and token usage at a glance.
- Production resilience is guaranteed with durable execution like Temporal/DBOS (TemporalAgent/DBOSAgent), restoring progress across API failures and crashes.

---

## **Introduction: escaping the "string hell" of LLM apps**

An application that embeds an LLM becomes, if left alone, **string hell.** You pass a string called a prompt, a string called free text returns, you fearfully parse it with regex or `json.loads`, and if the shape is broken it falls at runtime. Tests are settled with "it's probably about right," and what happened in production you grep the logs and guess — this is a reversion to the "world without types" that we spent 20 years discarding in backend design.

**PydanticAI is an agent framework that answers this problem with Pydantic's philosophy.** The developer is the team that makes Pydantic itself. So the root of the design is consistent — **"never trust data coming from outside the system boundary."** The LLM's output is also merely unvalidated external input. Then declare the shape with `BaseModel`, validate at the boundary, and pass it inside as a typed object. Doing for **the LLM's response** what FastAPI does for an HTTP request — that's PydanticAI.

The latest at the time of writing is **PydanticAI 2.0** (released June 23, 2026, Python 3.10+). The API was overhauled in the major version, and it **diverges on important points** from old tutorials on the net (`result_type` → `output_type`, `system_prompt` → `instructions` recommended, etc.; described later). This article, faithful to the [official documentation](https://pydantic.dev/docs/ai/overview/), summarizes with the correct 2.0 API.

> 💡 **The consistent theme of this blog**: in this portfolio, I hold up "building fast, cheap, and safe with one person × generative AI." PydanticAI is exactly a tool for **"using AI while a human holds the verification gate."** Validate LLM output with types, separate tools into deterministic code, and track behavior with observability — I handle the design that changes AI from a "smart guess" to "a part that ships in production." The TypeScript-side counterpart is the [Vercel AI SDK production guide](/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag); for making structured output with the raw API without PydanticAI, see the [LLM-structured-output guide built with Pydantic](/blog/pydantic-llm-structured-output-json-schema-validation-guide).

---

## **1. The minimal agent: run it in 5 lines**

First, install. Choose the full version, or the per-provider lightweight version (`pydantic-ai-slim`).

```bash
# フル版（全プロバイダ同梱）
pip install pydantic-ai

# 軽量版＋必要なプロバイダだけ（推奨）
pip install "pydantic-ai-slim[anthropic]"
```

The minimal agent is just this.

```python
from pydantic_ai import Agent

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    instructions="簡潔に、1文で答えてください。",
)

result = agent.run_sync('"hello world" の語源は？')
print(result.output)
```

Specify the model in the **`"provider:model-name"`** format (`"anthropic:claude-sonnet-4-6"`, `"openai:gpt-5.2"`, `"google:gemini-3-flash-preview"`, etc.). There are three ways to run.

| Method | Use |
| --- | --- |
| `agent.run_sync(...)` | synchronous execution (scripts, batches) |
| `await agent.run(...)` | asynchronous execution (servers like FastAPI) |
| `async with agent.run_stream(...) as result:` | streaming (chapter 6) |

> 💡 **Use `instructions` (not `system_prompt`)**: in v2, `instructions` is recommended. The difference between the two is in the handling of conversation history — when you pass `message_history`, `system_prompt` sends the past prompts included in the history together, but `instructions` sends **only the current agent's instructions.** Since it avoids the accident of prompts duplicating in multi-turn, choose `instructions` unless there's a special reason.

---

## **2. Structured output: make the LLM a "typed function" with `output_type`**

This is PydanticAI's core. If you pass a `BaseModel` to `output_type`, **the LLM's response is automatically validated as that model,** and `result.output` becomes a **typed object.**

```python
from pydantic import BaseModel
from pydantic_ai import Agent


class CityLocation(BaseModel):
    city: str
    country: str


agent = Agent("anthropic:claude-sonnet-4-6", output_type=CityLocation)

result = agent.run_sync("2012年のオリンピックはどこで開催された？")
print(result.output)        # city='London' country='United Kingdom'
print(result.output.city)   # 'London' ← str として型補完が効く
```

`result.output` is of type `CityLocation`. Editor completion works, `mypy` / Pyright type-check it, and downstream code can be written on the guarantee that "`city` is always `str`." **The LLM behaves just like a typed pure function.**

### **H3: Output modes — `ToolOutput` / `NativeOutput` / `PromptedOutput`**

PydanticAI, by default, takes out structured output using **the model's "tool calling (function calling)" feature.** This is the most portable method. When you want to allow multiple output types, pass them as a **list.**

```python
from pydantic import BaseModel
from pydantic_ai import Agent, ToolOutput, NativeOutput


class Fruit(BaseModel):
    name: str
    color: str


class Vehicle(BaseModel):
    name: str
    wheels: int


# 既定（ツール経由）。型名でツール名を明示することもできる
agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=[
        ToolOutput(Fruit, name="return_fruit"),
        ToolOutput(Vehicle, name="return_vehicle"),
    ],
)

# モデルがネイティブの構造化出力に対応していれば NativeOutput も使える
native_agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=NativeOutput([Fruit, Vehicle], name="fruit_or_vehicle"),
)
```

| Marker | Mechanism | Where to use |
| --- | --- | --- |
| `ToolOutput` (default) | structure via tool calling | portability-focused. The first choice that works on almost all models |
| `NativeOutput` | the model's native structured-output feature | the most certain schema compliance on supported models |
| `PromptedOutput` | instruct JSON in the prompt | a safeguard for models with neither tools nor native |

**Why is this superior?**
The design of "have it return free text and parse it hard later" is at the mercy of the LLM's whims (extra preambles, code-block fences, slightly different key names) every time. `output_type` **forces the schema on the model side and validates the response with Pydantic,** so a broken shape converges into a controlled flow of "validation error → retry." The shape of the output becomes a **contract,** and that contract remains as-is as the `BaseModel`'s source code — this is the exit from string hell.

---

## **3. Tools: hand "what it can do" to the LLM type-safely**

The means by which an agent acts on the outside world is a **tool.** In PydanticAI, **just attach a decorator to an ordinary Python function.** From the function's type annotations and docstring, the JSON schema passed to the LLM is **auto-generated.**

```python
import random
from pydantic_ai import Agent, RunContext

agent = Agent(
    "anthropic:claude-sonnet-4-6",
    deps_type=str,  # 第4章で解説。ここではプレイヤー名を注入する
    instructions="サイコロゲームの進行役。出目が予想と一致したら勝ち。",
)


@agent.tool_plain
def roll_dice() -> str:
    """6面ダイスを振り、出目を返す。"""
    return str(random.randint(1, 6))


@agent.tool
def get_player_name(ctx: RunContext[str]) -> str:
    """プレイヤーの名前を取得する。"""
    return ctx.deps
```

The difference between the two decorators is **whether it receives a context.**

- **`@agent.tool`**: takes `ctx: RunContext[...]` as the first argument. Can access dependencies (DB connections, API keys, user information, etc.).
- **`@agent.tool_plain`**: a pure tool that doesn't need a context.

Arguments other than `ctx` become **the tool's input schema** as-is. It interprets the docstring format (Google / NumPy / Sphinx) and auto-reflects even each argument's description into the schema.

```python
@agent.tool_plain(docstring_format="google", require_parameter_descriptions=True)
def search_products(keyword: str, max_results: int = 10) -> list[str]:
    """商品を検索する。

    Args:
        keyword: 検索キーワード。
        max_results: 返す件数の上限。
    """
    ...
```

### **H3: `ModelRetry` — request a "redo" from the tool to the LLM**

When a tool judges "the input is invalid," instead of throwing an exception and crashing, it can **send `ModelRetry` to prompt the LLM to correct.**

```python
from pydantic_ai import Agent, ModelRetry

agent = Agent("anthropic:claude-sonnet-4-6")


@agent.tool_plain
def lookup_user(user_id: str) -> str:
    if not user_id.startswith("usr_"):
        # 例外で落とさず、LLM に「正しい形式で渡し直して」と伝える
        raise ModelRetry("user_id は 'usr_' で始まる必要があります。")
    return f"ユーザー {user_id} の情報"
```

**Why is this superior?**
A tool is "deterministic code," the LLM is "ambiguous judgment." Not mixing the two is robust design. **Confine processing that needs certainty,** like inventory lookup, payments, and DB writes, **to tools (= ordinary Python),** and leave to the LLM only "when, with which arguments, to call it." `ModelRetry` is a mechanism to **self-repair, within the conversation,** the discrepancy that occurred at that boundary. This is PydanticAI providing, as a language feature, the "separation of deterministic code and probabilistic judgment" discussed in [the tool-use design of AI agents](/blog/ai-agent-tool-use-function-calling-production-design).

---

## **4. Dependency injection: inject side effects and make them testable**

If a tool touches a DB or an external API, you must not hardcode that dependency. PydanticAI has **dependency injection (DI) via `deps_type`.** If you know FastAPI's `Depends`, the philosophy is the same.

```python
from dataclasses import dataclass
import httpx
from pydantic_ai import Agent, RunContext


@dataclass
class Deps:
    """エージェントが必要とする外部依存。dataclass で束ねるのが定石。"""
    api_key: str
    http_client: httpx.AsyncClient


agent = Agent("anthropic:claude-sonnet-4-6", deps_type=Deps)


@agent.tool
async def fetch_weather(ctx: RunContext[Deps], city: str) -> str:
    """指定都市の天気を取得する。"""
    resp = await ctx.deps.http_client.get(
        "https://api.example.com/weather",
        params={"city": city, "key": ctx.deps.api_key},
    )
    return resp.text


async def main(client: httpx.AsyncClient) -> None:
    deps = Deps(api_key="...", http_client=client)
    result = await agent.run("東京の天気は？", deps=deps)
    print(result.output)
```

Declare the **type** with `deps_type=Deps`, and pass the **instance** at runtime with `agent.run(..., deps=deps)`. Inside the tool, you can access it type-safely with `ctx.deps`.

**Why is this superior?**
DI's true worth is **testability.** In production you inject the actual `httpx.AsyncClient`, and in tests a mock. Furthermore, using `agent.override(deps=...)`, you can swap the dependency only during the test. Verifying only the tool's logic without calling the LLM, or running the whole flow with a deterministic test model (`"test"`) — the path opens to **test AI-including code without AI.** This is the very "build the verification path first" principle of CLAUDE.md.

---

## **5. Output validators and self-repair: build verification into the conversation loop**

There's **semantic verification** that Pydantic validation of `output_type` alone isn't enough for. "Is the SQL the LLM generated actually executable," "is the proposed date a business day" — such verification is done with `@agent.output_validator`, and on failure, **make the LLM re-create** with `ModelRetry`.

```python
from pydantic import BaseModel
from pydantic_ai import Agent, ModelRetry, RunContext


class SqlQuery(BaseModel):
    sql_query: str


agent = Agent(
    "anthropic:claude-sonnet-4-6",
    output_type=SqlQuery,
    deps_type=DatabaseConn,  # 例：DB 接続
)


@agent.output_validator
async def validate_sql(ctx: RunContext[DatabaseConn], output: SqlQuery) -> SqlQuery:
    try:
        # EXPLAIN で「実行可能か」だけを安全に検証する（実行はしない）
        await ctx.deps.execute(f"EXPLAIN {output.sql_query}")
    except QueryError as e:
        # 失敗をエラーメッセージごと LLM に返し、修正版を生成させる
        raise ModelRetry(f"無効なクエリです: {e}") from e
    return output
```

This loop is powerful. Pydantic's **syntactic verification** (types, required, constraints) and `output_validator`'s **semantic verification** (is it business-correct) work in two stages, and a failure in either is sent back to the conversation with `ModelRetry`. A self-repairing agent that **"automatically redoes until correct output is obtained"** can be assembled declaratively.

> ⚠️ **Cap the retries**: limit the retry count with `Agent(..., retries=2)`. Unlimited retries make cost and latency unbounded. If retries are frequent, that's a sign of **a design flaw in the prompt or schema** — prioritize root improvement, like conveying each field's intent to the LLM with `Field(description=...)` (detailed in the [LLM-structured-output guide built with Pydantic](/blog/pydantic-llm-structured-output-json-schema-validation-guide)).

---

## **6. Streaming: flow partial results while validating**

In a chat UI, "return after everything is done" is too slow. `run_stream` can **stream partial results while validating the structured output.**

```python
async def stream_profile(agent: Agent, user_input: str) -> None:
    async with agent.run_stream(user_input) as result:
        # 検証済みの「途中までのオブジェクト」が順次流れてくる
        async for profile in result.stream_output():
            print(profile)
            # {'name': 'Ben'} → {'name': 'Ben', 'dob': date(1990, 1, 28)} → ...
```

To stream only text, use `result.stream_text()`; to receive the raw `ModelResponse` thinned out, use `result.stream_response(debounce_by=0.01)`.

> ⚠️ **Beware of side effects on partial results**: during streaming, "still-incomplete objects" are also passed to `output_validator`. You'd want to do **side effects like DB writes only on the completed version.** Look at the `ctx.partial_output` flag and skip verification/side effects on a partial result. Neglecting this becomes the accident of polluting an external system with unconfirmed data.

---

## **7. Observability: see "AI's behavior" at a glance with Logfire**

The biggest hurdle of an LLM app is **debugging.** "Why was this tool called?" "Which retry fixed it?" "Where am I wasting tokens?" — these can't be traced with print debugging. PydanticAI integrates with **Logfire** from the same Pydantic team, and instruments all behavior on an **OpenTelemetry base.**

```python
import logfire
from pydantic_ai import Agent

logfire.configure()
logfire.instrument_pydantic_ai()  # これだけで全エージェントが計装される

agent = Agent("anthropic:claude-sonnet-4-6")
result = agent.run_sync("...")
print(result.usage)  # RunUsage(input_tokens=62, output_tokens=1, requests=1)
```

With just two lines (`configure` + `instrument_pydantic_ai`), **each agent execution, tool call, retry, and token usage** is visualized as a trace. From the `result.usage` attribute you can directly get the token count and request count, which becomes the foundation of cost monitoring. If you want to see even the raw HTTP requests, add `logfire.instrument_httpx(capture_all=True)`.

**Why is this superior?**
PydanticAI's instrumentation follows **OpenTelemetry's GenAI semantic conventions,** so it can flow to OTel backends other than Logfire (Grafana, Datadog, etc.). This means the AI's execution naturally rides on the "correlate the three pillars" design I discussed in the [OpenTelemetry observability guide](/blog/opentelemetry-observability-production-tracing-metrics-logs). **"Trace a stalled process at a glance"** — in running an AI agent in production, this observability is not a feature but a precondition. Being able to trace which agent execution / tool call consumes time and tokens becomes the foundation for shortening debugging and early detection of cost anomalies.

---

## **8. Production resilience: "continue from where it stopped" with durable execution**

An agent hits external APIs many times, chains tools, and sometimes waits for human approval (human-in-the-loop) — such **long-running workflows** definitely crash midway. API rate limits, network drops, process restarts. **Redoing from the start each time is unacceptable cost-wise and UX-wise.**

PydanticAI provides **durable execution** via integration with Temporal / DBOS / Prefect / Restate. Just by wrapping an existing `Agent` in a dedicated class, **progress is persisted and you can resume from where you left off across a failure.**

```python
from pydantic_ai import Agent
from pydantic_ai.durable_exec.temporal import TemporalAgent

# name は必須（ワークフロー/アクティビティの識別に使われる）
agent = Agent(
    "anthropic:claude-sonnet-4-6",
    instructions="...",
    name="geography",
)

temporal_agent = TemporalAgent(agent)  # Temporal ワークフロー内で実行する
```

With DBOS, it checkpoints the state to a DB.

```python
from pydantic_ai.durable_exec.dbos import DBOSAgent

dbos_agent = DBOSAgent(agent)
result = await dbos_agent.run("メキシコの首都は？")
```

| Backend | Nature | Suited scene |
| --- | --- | --- |
| Temporal | a workflow engine. Powerful retry/timers | complex long-running orchestration |
| DBOS | a DB-checkpoint method. Lightweight | when you want to lean the state on an existing DB |
| Prefect / Restate | data-pipeline / durable-RPC oriented | to match each foundation |

> ⚠️ **Always attach `name=`**: an agent wrapped in durable execution requires `name=` (it becomes the workflow identifier). Also, `TemporalAgent` has backend-specific constraints, like being defined at the module's top level. When introducing it, always refer to the target backend's documentation.

**Why does this work?**
In the internal AI platform I built for a broadcaster ([program-production support](/case-studies/broadcaster-ai-content-platform)), I guaranteed resilience by separating long AI jobs into **Cloud Workflows / Cloud Run Jobs.** PydanticAI's durable execution solves that requirement of "running a long job in a non-crashing form" **at the agent layer.** Think of it as the AI-agent version of the design (see the [FastAPI production-operation guide](/blog/fastapi-production-async-pydantic-observability-guide)) of escaping processing too heavy to hold in FastAPI's `BackgroundTasks` to a job foundation.

---

## **Conclusion: change AI into "a part that ships in production"**

PydanticAI is a framework that raises an LLM app from a "smart guess" to "a production system that's type-safe, observable, and recovers even when it crashes." Let me re-list the key points of this article.

1. Make a minimal agent with **`Agent` + `instructions`,** and **validate the output as a typed object** with `output_type=BaseModel` (use `ToolOutput` / `NativeOutput` / `PromptedOutput` properly).
2. Define tools with **`@agent.tool` / `@agent.tool_plain`,** and **auto-generate the schema** from type annotations and the docstring. Self-repair discrepancies with `ModelRetry`.
3. Inject side effects with **`deps_type`** and make them **testable** (swap mocks with `override`).
4. Build semantic verification into the conversation loop with **`@agent.output_validator` + `ModelRetry`,** and cap the retries.
5. Stream while validating with **`run_stream` + `stream_output`** (control side effects with `partial_output`).
6. Integrate into OpenTelemetry with **Logfire (`instrument_pydantic_ai`)** and see all behavior and cost at a glance.
7. Make long-running workflows fault-tolerant with **durable execution (Temporal / DBOS, etc.).**

At the root of PydanticAI is, after all, the same discipline as Pydantic itself — **"validate data coming from outside (including LLM output) at the boundary before passing it inside."** This consistency is exactly what bridges AI to production reliability.

As official primary sources, I recommend re-reading the following from this article's viewpoint.

- [PydanticAI Overview](https://pydantic.dev/docs/ai/overview/)
- [Agents](https://pydantic.dev/docs/ai/core-concepts/agent/)
- [Output (structured output, streaming)](https://pydantic.dev/docs/ai/core-concepts/output/)
- [Tools](https://pydantic.dev/docs/ai/tools-toolsets/tools/)
- [Dependencies](https://pydantic.dev/docs/ai/core-concepts/dependencies/)
- [Durable Execution](https://pydantic.dev/docs/ai/integrations/durable_execution/overview/)
- [Logfire integration](https://pydantic.dev/docs/ai/integrations/logfire/)

---

### **Consultation on type-safe AI-agent development**

The author has designed and operated backends that embed generative AI at **production quality,** including an internal AI platform for a major domestic broadcaster. Validate the LLM's output with types, separate tools into deterministic code, track behavior with observability, and assemble non-crashing workflows with durable execution — I implement, fast and at high quality leveraging generative AI, the design for not "running AI" but "**putting AI on the business's reliability.**" Please feel free to consult me about building AI agents, RAG, and structured-extraction pipelines using PydanticAI / FastAPI.
