"If you're just hitting an LLM, fetch is enough" — that's what I thought at first. Indeed, if you're just making a demo, calling each provider's SDK directly works. But the moment you want to switch providers, the moment you wire a streaming UI to the front, the moment you make it an agent with tool calls, the moment you want to trust the LLM's output as a type — at any one of these, a hand-written thin wrapper falls apart.
So far I've designed and implemented generative-AI production systems in the Python / AWS family, like a LangChain + Pinecone RAG platform, a Bedrock × pgvector voice-AI sales agent, an AI video-localization platform, and a broadcaster's internal multi-AI platform. This article rewrites that experience from the angle of the TypeScript / Next.js / Vercel ecosystem. I leave deep RAG theory and voice pipelines to the above two articles and concentrate here on "how to assemble a production LLM app with the Vercel AI SDK."
The reference version of this article: Vercel AI SDK v6 (the
aipackage). Model connection is in principle via Vercel AI Gateway (a"provider/model"string). All code assumes TypeScript / App Router and is posted at a granularity you can actually move your hands with. The official docs are linked throughout at ai-sdk.dev and Vercel AI Gateway.
0. The big picture: how to combine the AI SDK's "three layers"
Using the Vercel AI SDK in production effectively means designing these three layers.
| Layer | Role | Central API | This article's section |
|---|---|---|---|
| Connection layer | Unified connection to providers/models, fallback, billing observation | AI Gateway ("provider/model" string) | §1 |
| Core layer | Text generation, structured output, tool calling, embeddings | generateText / streamText / Output / tool / embed | §2–§4, §6 |
| UI layer | A streaming chat UI, rendering tool states | useChat (@ai-sdk/react) | §5 |
From bottom to top, "connect with the Gateway, make it think with Core, show it with the UI." We proceed in this flow. At the end, §7 and §8 summarize the cross-cutting concerns of production operation (cost, reliability, security, observability) and the pitfalls.
1. Why a unified SDK + AI Gateway (vs. calling providers directly)
The first design decision is here. Use a provider-specific SDK (@anthropic-ai/sdk, etc.) directly, or use the AI SDK + AI Gateway.
To say the conclusion first, for a new production app I make AI Gateway the default. The reasons are as follows (consistent with the official features).
- Hundreds of models with one key. With a single
AI_GATEWAY_API_KEY, you can cut across OpenAI / Anthropic / Google and others. - Provider-independent. You can swap
modelwith a one-line string. You can structurally avoid vendor lock-in. - Automatic fallback. If a provider goes down, it auto-retries to another provider (reliability §7).
- Spend observability. You can observe tokens and cost across providers.
- No markup on tokens. Same cost as a direct provider contract (official states this clearly).
// .env.local — Vercel本番ではOIDCで自動認証されるため不要。ローカル/他基盤ではこれを使う
// AI_GATEWAY_API_KEY=your_api_key_here
// lib/ai/models.ts — モデルIDを1箇所に集約(DRY / 差し替え容易性)
// AI Gateway は AI SDK の「既定のグローバルプロバイダ」なので、文字列だけで繋がる
export const MODELS = {
// 重い推論・最終回答
reasoning: "anthropic/claude-opus-4.8",
// 通常のチャット・要約・抽出(速度とコストのバランス)
chat: "anthropic/claude-sonnet-4.6",
// 分類・ルーティングなど軽量タスク(最安・最速)
fast: "anthropic/claude-haiku-4.5",
// RAG用の埋め込み
embedding: "openai/text-embedding-3-small",
} as const;
Don't write model IDs from memory. Confirming the latest IDs with
curl -s https://ai-gateway.vercel.sh/v1/models | jq -r '.data[].id'is the official practice (Gateway docs). The example above is matched to the latest Claude as of 2026-06.
So when do you call a provider directly? Only when you want to use a provider-specific preview feature (a specific beta API, or a latest flag not yet supported by the Gateway) the same day. In that case, use a dedicated provider package like @ai-sdk/anthropic and write model: anthropic("claude-..."). But this is exception operation. Decide "the default is the Gateway string," and your design won't waver.
2. generateText and streamText: when to stream
There are only two entrances to text generation. generateText (returns the finished form in one go) and streamText (streams tokens incrementally). Both are imported from the ai package (official: Generating Text).
2-1. The decision axis: choosing among the three APIs
What you agonize over in practice is "generateText or streamText, or structured output." Let me first pin down the big picture in a table.
| API | What it returns | When to use | Perceived speed |
|---|---|---|---|
generateText | Finished text/structured result in one go | Batch processing, server-internal intermediate processing, short responses, structured extraction | Wait until done |
streamText | An incremental token stream | A long response the user reads on screen, chat | The first character is fast |
generateText + Output | A typed object validated by a Zod schema | When you use the LLM's output as a type in subsequent code (§3) | Wait until done |
The principle: long responses humans read, stream; results a machine uses next, in one go. The perceived silence affects only humans.
2-2. One-shot generation in a Server Action (generateText)
Processing that completes inside the server — summarize and save to the DB, classify and branch — is straightforward with generateText. You can place it directly in a Next.js Server Action.
// app/actions/summarize.ts
"use server";
import { generateText } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";
// 境界での入力検証:外部入力は常にZodで絞る(CLAUDE.md準拠)
const InputSchema = z.object({ text: z.string().min(1).max(20_000) });
export async function summarize(raw: unknown): Promise<string> {
const { text } = InputSchema.parse(raw);
const result = await generateText({
model: MODELS.chat,
system: "あなたは編集者です。日本語で3文以内に要約してください。",
prompt: text,
maxOutputTokens: 512, // v6では maxTokens ではなく maxOutputTokens
temperature: 0.3,
});
return result.text;
}
Be careful about the v6 naming changes.
maxTokenschanged tomaxOutputTokens, and themaxStepsdescribed later changed tostopWhen: stepCountIs(n)(Common Errors). Copy-pasting from old articles is the most accident-prone point.
2-3. Streaming in a Route Handler (streamText)
A response the user reads on screen, return incrementally with streamText. In an App Router Route Handler, you can convert the result directly into the response.
// app/api/generate/route.ts
import { streamText } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";
export const maxDuration = 30; // Vercel Functionsのタイムアウト上限(秒)
const BodySchema = z.object({ prompt: z.string().min(1).max(4_000) });
export async function POST(req: Request) {
// 外部入力は必ず検証してから渡す
const parsed = BodySchema.safeParse(await req.json());
if (!parsed.success) {
return Response.json({ error: "invalid input" }, { status: 400 });
}
const result = streamText({
model: MODELS.chat,
prompt: parsed.data.prompt,
// 失敗を握り潰さない:監視基盤へ送る
onError: ({ error }) => console.error("streamText error", error),
});
// チャットUI(useChat)で受けるなら toUIMessageStreamResponse、
// 素のテキストストリームでよいなら toTextStreamResponse
return result.toTextStreamResponse();
}
result.textStream is a ReadableStream and an AsyncIterable, so you can also loop it inside the server with for await (const chunk of result.textStream). Use fullStream and you can observe all events including tool calls and step boundaries.
3. Structured output: "parse, don't validate" at the LLM boundary
This is the feature that makes a decisive difference from a hand-written wrapper. JSON.parse-ing the LLM's output and casting it as MyType is a source of accidents in production. The LLM breaks JSON, and adds or removes fields.
In v6, pass Output to generateText and get a validated, typed object with a Zod schema (the old generateObject was deprecated and integrated into the Output API — Generating Structured Data). The schema becomes the single source of truth (SSoT), unifying type, validation, and prompt instruction.
3-1. Output.object: extracting a single object
// app/actions/extract-invoice.ts
"use server";
import { generateText, Output } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";
// このスキーマが「型」「実行時検証」「LLMへの指示」すべての源
const Invoice = z.object({
vendor: z.string().describe("請求元の会社名"),
total: z.number().describe("税込合計金額(円・整数)"),
dueDate: z.string().describe("支払期日 ISO 8601 (YYYY-MM-DD)"),
lineItems: z.array(
z.object({ name: z.string(), amount: z.number() }),
),
});
export async function extractInvoice(ocrText: string) {
const result = await generateText({
model: MODELS.chat,
output: Output.object({ schema: Invoice }),
prompt: `次のOCRテキストから請求情報を抽出してください:\n${ocrText}`,
});
// result.output は z.infer<typeof Invoice> として型付け&検証済み
const invoice = result.output;
return invoice; // ここから先は安全に型として扱える
}
Attach .describe() to the schema, and that description works as the field instruction to the LLM. The essence is that the need to write "return as JSON" in the prompt disappears.
3-2. Output.array and Output.choice
Array extraction and classification into fixed options each have their own dedicated output mode.
import { generateText, Output } from "ai";
import { z } from "zod";
// 配列:要素スキーマを element に渡す
const tasks = await generateText({
model: MODELS.fast,
output: Output.array({
element: z.object({ title: z.string(), priority: z.enum(["high", "low"]) }),
}),
prompt: "次の議事録からToDoを抽出: ...",
});
// tasks.output: { title: string; priority: "high" | "low" }[]
// 分類:選択肢を固定(ハルシネーションで未知ラベルが出ない)
const sentiment = await generateText({
model: MODELS.fast,
output: Output.choice({ options: ["positive", "negative", "neutral"] as const }),
prompt: "この声を分類: 「最高の製品でした」",
});
// sentiment.output: "positive" | "negative" | "neutral"
Output.choice is especially powerful in classification tasks. Because the model can't return outside the options, the subsequent branching logic doesn't fall apart. This is the thinking of "eliminating hallucination by structure," the same idea by which I crushed model-number errors in the voice-AI article.
If you want to draw structured results incrementally in streaming, use
streamText+outputwithpartialOutputStream. It suits UX like generating a form preview while filling it.
4. Tool calling and agents: when to give the model tools
What gives the LLM access to the external world (DB search, API calls, computation) is tool calling. In v6, you define it with the tool() helper. The argument schema is inputSchema (renamed from parameters in v6), and here too Zod is the SSoT (Tools and Tool Calling).
4-1. Defining tool()
// lib/ai/tools/get-order.ts
import { tool } from "ai";
import { z } from "zod";
import { findOrder } from "@/lib/db/orders";
export const getOrder = tool({
description: "注文IDから注文状況を取得する。ユーザーが配送状況を尋ねた時に使う。",
// inputSchema が引数の型・検証・モデルへの説明を兼ねる
inputSchema: z.object({
orderId: z.string().describe("注文ID(例: ORD-12345)"),
}),
execute: async ({ orderId }) => {
// execute の中は普通のTypeScript。ここでDB/APIに触る
const order = await findOrder(orderId);
if (!order) return { found: false as const };
return { found: true as const, status: order.status, eta: order.eta };
},
});
4-2. Multi-step agents (stopWhen)
Pass tools and specify stopWhen: stepCountIs(n), and the model autonomously loops "call a tool → read the result → think further" (in v6, use this instead of the old maxSteps).
// app/api/agent/route.ts
import { streamText, stepCountIs, convertToModelMessages, type UIMessage } from "ai";
import { getOrder } from "@/lib/ai/tools/get-order";
import { MODELS } from "@/lib/ai/models";
export async function POST(req: Request) {
const { messages }: { messages: UIMessage[] } = await req.json();
const result = streamText({
model: MODELS.chat,
system: "あなたはカスタマーサポートです。注文照会にはツールを使うこと。",
messages: convertToModelMessages(messages),
tools: { getOrder },
stopWhen: stepCountIs(5), // 暴走防止のステップ上限は必須
onStepFinish: ({ toolCalls, toolResults }) => {
// 可観測性:どのツールをどう呼んだかを必ずログ
console.log("step", { toolCalls, toolResults });
},
});
return result.toUIMessageStreamResponse();
}
If you want to structure an agent, v6 has the ToolLoopAgent class, and you can propagate the type all the way to the UI side with InferAgentUIMessage<typeof agent> (type-safe agents). The recommended structure is to split tools into lib/tools/ and agents into lib/agents/.
4-3. The decision axis: passing tools vs. deterministic code
This is where the designer shows their skill. Make everything a tool, and it becomes slow, expensive, and unstable.
| Passing tools (let the LLM judge) | Deterministic code (write it yourself) | |
|---|---|---|
| Suited case | Input is natural language and which operation is needed isn't decided in advance | Both input and processing flow are settled |
| Speed / cost | An LLM round-trip per step (slow, expensive) | Zero to one LLM call (fast, cheap) |
| Reliability | Wavers depending on the model | Deterministic, easy to test |
| Example | "Summarize last month's delayed orders" | A simple lookup where the order ID is known |
The principle: if the branching can be written statically, write it. Entrust the LLM with only the boundary part of "converting natural language into structured intent," and process beyond that with deterministic code — this wins on all of speed, cost, and testability. A tool like getOrder is convenient, but for "a screen that always only does order lookups," it's healthier to not pass tools and instead extract orderId with §3's structured output and directly call findOrder.
5. useChat: building a streaming UI all the way to a11y
The front-end chat UI is assembled with useChat (@ai-sdk/react). There are two things that changed greatly in v6, so you'll definitely get stuck with old knowledge (Chatbot).
- Input state is now self-managed (
input/handleInputChange/handleSubmitwere removed). - The API specification is now via
transport: new DefaultChatTransport({ api }). - Messages are not
contentbut apartsarray (a mix of text, tool calls, etc.).
// app/chat/chat.tsx
"use client";
import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";
export function Chat() {
const [input, setInput] = useState("");
const { messages, sendMessage, status, stop, regenerate } = useChat({
transport: new DefaultChatTransport({ api: "/api/agent" }),
});
const isBusy = status === "submitted" || status === "streaming";
function onSubmit(e: React.FormEvent) {
e.preventDefault();
if (!input.trim() || isBusy) return;
sendMessage({ text: input });
setInput("");
}
return (
<div>
{/* a11y: 生成テキストはライブリージョンで読み上げる */}
<div aria-live="polite" aria-atomic="false" role="log">
{messages.map((m) => (
<article key={m.id} aria-label={m.role === "user" ? "あなた" : "アシスタント"}>
{m.parts.map((part, i) => {
if (part.type === "text") return <p key={i}>{part.text}</p>;
// 型付きツールパート: tool-getOrder / states で安全に分岐
if (part.type === "tool-getOrder") {
if (part.state === "output-available") {
return <p key={i}>注文状況: {String(part.output.found)}</p>;
}
return <p key={i}>注文を照会中…</p>;
}
return null;
})}
</article>
))}
</div>
<form onSubmit={onSubmit}>
<label htmlFor="msg" className="sr-only">メッセージ</label>
<input
id="msg"
value={input}
onChange={(e) => setInput(e.target.value)}
disabled={isBusy}
aria-disabled={isBusy}
/>
{isBusy ? (
// 停止ボタン:長い生成を中断できることはUXとコストの両面で重要
<button type="button" onClick={stop} aria-label="生成を停止">停止</button>
) : (
<button type="submit" aria-label="送信">送信</button>
)}
<button type="button" onClick={() => regenerate()} disabled={isBusy}>
再生成
</button>
</form>
</div>
);
}
Let me narrow a11y to three points.
- Live region: wrap with
aria-live="polite"and the streamed body is read aloud incrementally to a screen reader.assertiveinterrupts harshly, so usepolitefor a conversation log. - Stop button: emit
stop()whilestatusisstreaming. You can stop a runaway during read-aloud, and also stop wasted token billing. - State visualization: make loading and errors explicit with
status(submitted/streaming/ready/error). Don't swallowerror.
Tool parts arrive with a typed name tool-{toolName}, and you can draw them stepwise with state (input-streaming / input-available / output-available). Because part.input / part.output exist only in the corresponding state, accessing them without a state check is rejected by TS — this is a behavior to welcome as a safety device.
6. RAG essentials: embed / embedMany, and the deep dive goes to other articles
The essence of RAG (retrieval-augmented generation) is "vector-search the documents relevant to the question and inject them into the prompt." The AI SDK provides embedding APIs (embed / embedMany / cosineSimilarity, all from ai) (Embeddings).
// lib/ai/rag.ts
import { embed, embedMany, cosineSimilarity, generateText } from "ai";
import { MODELS } from "@/lib/ai/models";
// インデックス時:文書を一括で埋め込む(embedMany はバッチで効率的)
export async function indexDocuments(chunks: string[]) {
const { embeddings } = await embedMany({
model: MODELS.embedding,
values: chunks,
});
// embeddings[i] を chunks[i] と一緒にベクトルDBへ保存(pgvector / Pinecone 等)
return chunks.map((text, i) => ({ text, embedding: embeddings[i] }));
}
// 検索時:質問を埋め込み、類似文書を取る
export async function answer(question: string, store: { text: string; embedding: number[] }[]) {
const { embedding } = await embed({ model: MODELS.embedding, value: question });
// 本番ではベクトルDB側でANN検索する。ここは概念を示すためのインメモリ例
const top = store
.map((d) => ({ ...d, score: cosineSimilarity(embedding, d.embedding) }))
.sort((a, b) => b.score - a.score)
.slice(0, 4);
const context = top.map((d) => d.text).join("\n---\n");
const { text } = await generateText({
model: MODELS.chat,
system: "提供されたコンテキストのみを根拠に回答。無い情報は『分かりません』と答える。",
prompt: `コンテキスト:\n${context}\n\n質問: ${question}`,
});
return text;
}
But production RAG doesn't end with these 10 lines. Chunk-splitting strategy, metadata filters, reranking, hallucination countermeasures, quantitative accuracy evaluation, cost estimation — this is where the real battle is. I implemented this thoroughly in Python / LangChain / Pinecone. The deep dive is written in Building a production RAG system with LangChain + Pinecone and RAG design for voice AI with pgvector. Regard this article's TypeScript implementation as a port of that design philosophy to the Vercel ecosystem.
An honest take on choosing the language: if you have heavy preprocessing, evaluation pipelines, or existing ML assets, Python is still advantageous. On the other hand, for a thin RAG tightly coupled with a web app (like co-locating internal document search in a Next.js app), making it end-to-end with TypeScript + AI SDK is simpler to operate. This is a decision axis after doing both.
7. Production-operation pressure points: cost, reliability, security, observability
Up to here was "features." Let me summarize the cross-cutting design for withstanding production, in the order it paid off.
7-1. Cost: model routing and token budget
The biggest cost reduction is "not processing everything with the top-tier model."
- Model routing: classification, extraction, and routing with
haiku, normal responses withsonnet, only hard problems withopus. §1'sMODELSconstant is the foundation of this branching. Just by not using an excessive model for the task, cost changes by an order of magnitude. - Token budget: always set an upper limit with
maxOutputTokens. Unbounded output is an accident. - Prompt caching: for a long fixed system/context, push the variable part to the tail so the provider's prompt caching takes effect.
- Stream the output and stop early: provide
stop()in the UI (§5). Don't keep generating long text that isn't read.
7-2. Reliability: timeout, retry, fallback
- Timeout: set
maxDurationon the Route Handler (§2-3). For the client, anAbortController. - Fallback: AI Gateway auto-retries to another provider on a failure (official). This is one of the biggest practical benefits of choosing "unified SDK + Gateway." With direct calls, you'd write all this fallback yourself.
- Idempotency: if you update the DB with a generation result, absorb duplicates with an idempotency key. LLM calls fail and duplicate normally over the network. This design philosophy is detailed in the voice-AI article.
7-3. Security: keys, prompt injection
- Don't expose API keys to the client. Always make LLM calls from the server (Route Handler / Server Action).
AI_GATEWAY_API_KEYis a server-only environment variable; don't attachNEXT_PUBLIC_. A configuration that hits the provider directly from the browser is key leakage itself. - Validate and sanitize user input. As shown in §2 and §3, always pass external input through Zod at the boundary. Don't forget length limits either (token-bombing countermeasure).
- Design on the premise of prompt injection. Don't trust user input or external documents injected via RAG. Clearly delimit them from the system prompt, and place guards in tools like "make destructive operations require confirmation" inside
execute. Passing the LLM's output directly toeval, SQL, or shell is strictly forbidden. Narrow tools' execution permissions with the principle of least privilege.
7-4. Observability and eval: what you can't measure, you can't fix
- Always log: the model used, input/output tokens (
result.usage), latency,finishReason, and which tool was called.onFinish/onStepFinishare the hooks. You can also observe spend across the board on the AI Gateway dashboard. - Eval (evaluation): before going to production, have tests with fixed expected outputs against a representative input set. Structured output (§3) makes schema validation itself a minimal eval. Don't ship on "looks good by eye" — this is the most often-omitted and most accident-prone step in generative AI.
- Testability: a
tool'sexecutecan be unit-tested as a pure TypeScript function. Mock the LLM call itself, and test the tool's logic and the structured-output schema separately is the realistic strategy.
8. Common pitfalls
Let me line up the landmines I step (stepped) on in real projects, paired with countermeasures.
| Pitfall | What happens | Countermeasure |
|---|---|---|
Trusting LLM output with as MyType | Types collapse and crash in production | parse with Output.object + Zod (§3) |
Copy-pasting maxTokens / maxSteps | Doesn't work in v6 | maxOutputTokens / stopWhen: stepCountIs(n) (§2, §4) |
| Exposing the API key to the client | Key leakage, billing explosion | LLM calls server-only (§7-3) |
| No timeout / fallback | The whole app stops on a provider outage | maxDuration + the Gateway's automatic fallback (§7-2) |
| Ignoring prompt injection | An injected document overwrites the system | Don't trust input and delimit it, narrow tool permissions (§7-3) |
| Entrusting everything to tools | Slow, expensive, unstable | Deterministic code if you can branch statically (§4-3) |
| Going to production without eval | You can't notice "degraded before you knew it" | Tests with fixed expected outputs + schema validation (§7-4) |
The old useChat's input/handleSubmit | Doesn't exist in v6, build fails | Self-manage input state + DefaultChatTransport (§5) |
| Processing everything with the top-tier model | Cost swells by an order of magnitude | Per-task model routing (§7-1) |
Summary: the AI SDK is the foundation for assembling LLM apps "fast, cheap, and safe"
Using the Vercel AI SDK v6 in production effectively means connecting with the Gateway, making it think with Core, showing it with the UI, and protecting it with cross-cutting operational design. The key points in five lines.
- Connection is AI Gateway + a
"provider/model"string. Provider-independence, automatic fallback, and spend observation take effect by default. Keys server-only. - Responses humans read,
streamText; results a machine uses next,generateText. In v6,maxOutputTokens/stopWhen. - Parse structured output with
Output.object+ Zod. The schema is the SSoT. Anascast is forbidden. - Tools are
tool({ inputSchema, execute }), agents arestopWhen: stepCountIs(n). But lean statically-writable branching toward deterministic code — the optimal answer for speed, cost, and reliability. - Weave cost (model routing), reliability (timeout/fallback), security (keys/injection), and eval into the design from the start.
I've built multiple generative-AI production systems in Python / AWS (a RAG platform, voice-AI sales, AI video localization, a broadcaster's internal multi-AI platform). This article is a port of the design that bridges that "distance between demo and production" to the TypeScript / Vercel ecosystem.
"With one person × generative AI (Claude Code), fast, cheap, and safe" — if you want to integrate LLM features into an existing product, for that consultation, see the related case the generative-AI voice chatbot and reach out from contact. I provide the design together with the decision axis of not using excessive technology for the requirement.