Skip to main content
友田 陽大
Generative AI, LLMs & RAG
TypeScript
RAG
Next.js
Vercel
AI
Claude
アーキテクチャ設計

Building Production LLM Apps with Vercel AI SDK v6: Streaming, Tool Calling, Structured Output, and RAG in Real Code

A practical guide to building production-quality LLM apps in TypeScript. Centered on Vercel AI SDK v6 and AI Gateway, explained with working code and decision axes: generateText/streamText, structured output with Zod schemas, tool calling and agents, the useChat streaming UI, RAG with embed/embedMany, and cost, reliability, security, and observability.

Published
Reading time
19 min read
Author
友田 陽大
Share

"If you're just hitting an LLM, fetch is enough" — that's what I thought at first. Indeed, if you're just making a demo, calling each provider's SDK directly works. But the moment you want to switch providers, the moment you wire a streaming UI to the front, the moment you make it an agent with tool calls, the moment you want to trust the LLM's output as a type — at any one of these, a hand-written thin wrapper falls apart.

So far I've designed and implemented generative-AI production systems in the Python / AWS family, like a LangChain + Pinecone RAG platform, a Bedrock × pgvector voice-AI sales agent, an AI video-localization platform, and a broadcaster's internal multi-AI platform. This article rewrites that experience from the angle of the TypeScript / Next.js / Vercel ecosystem. I leave deep RAG theory and voice pipelines to the above two articles and concentrate here on "how to assemble a production LLM app with the Vercel AI SDK."

The reference version of this article: Vercel AI SDK v6 (the ai package). Model connection is in principle via Vercel AI Gateway (a "provider/model" string). All code assumes TypeScript / App Router and is posted at a granularity you can actually move your hands with. The official docs are linked throughout at ai-sdk.dev and Vercel AI Gateway.

0. The big picture: how to combine the AI SDK's "three layers"

Using the Vercel AI SDK in production effectively means designing these three layers.

LayerRoleCentral APIThis article's section
Connection layerUnified connection to providers/models, fallback, billing observationAI Gateway ("provider/model" string)§1
Core layerText generation, structured output, tool calling, embeddingsgenerateText / streamText / Output / tool / embed§2–§4, §6
UI layerA streaming chat UI, rendering tool statesuseChat (@ai-sdk/react)§5

From bottom to top, "connect with the Gateway, make it think with Core, show it with the UI." We proceed in this flow. At the end, §7 and §8 summarize the cross-cutting concerns of production operation (cost, reliability, security, observability) and the pitfalls.


1. Why a unified SDK + AI Gateway (vs. calling providers directly)

The first design decision is here. Use a provider-specific SDK (@anthropic-ai/sdk, etc.) directly, or use the AI SDK + AI Gateway.

To say the conclusion first, for a new production app I make AI Gateway the default. The reasons are as follows (consistent with the official features).

  • Hundreds of models with one key. With a single AI_GATEWAY_API_KEY, you can cut across OpenAI / Anthropic / Google and others.
  • Provider-independent. You can swap model with a one-line string. You can structurally avoid vendor lock-in.
  • Automatic fallback. If a provider goes down, it auto-retries to another provider (reliability §7).
  • Spend observability. You can observe tokens and cost across providers.
  • No markup on tokens. Same cost as a direct provider contract (official states this clearly).
// .env.local — Vercel本番ではOIDCで自動認証されるため不要。ローカル/他基盤ではこれを使う
// AI_GATEWAY_API_KEY=your_api_key_here
// lib/ai/models.ts — モデルIDを1箇所に集約(DRY / 差し替え容易性)
// AI Gateway は AI SDK の「既定のグローバルプロバイダ」なので、文字列だけで繋がる
export const MODELS = {
  // 重い推論・最終回答
  reasoning: "anthropic/claude-opus-4.8",
  // 通常のチャット・要約・抽出(速度とコストのバランス)
  chat: "anthropic/claude-sonnet-4.6",
  // 分類・ルーティングなど軽量タスク(最安・最速)
  fast: "anthropic/claude-haiku-4.5",
  // RAG用の埋め込み
  embedding: "openai/text-embedding-3-small",
} as const;

Don't write model IDs from memory. Confirming the latest IDs with curl -s https://ai-gateway.vercel.sh/v1/models | jq -r '.data[].id' is the official practice (Gateway docs). The example above is matched to the latest Claude as of 2026-06.

So when do you call a provider directly? Only when you want to use a provider-specific preview feature (a specific beta API, or a latest flag not yet supported by the Gateway) the same day. In that case, use a dedicated provider package like @ai-sdk/anthropic and write model: anthropic("claude-..."). But this is exception operation. Decide "the default is the Gateway string," and your design won't waver.


2. generateText and streamText: when to stream

There are only two entrances to text generation. generateText (returns the finished form in one go) and streamText (streams tokens incrementally). Both are imported from the ai package (official: Generating Text).

2-1. The decision axis: choosing among the three APIs

What you agonize over in practice is "generateText or streamText, or structured output." Let me first pin down the big picture in a table.

APIWhat it returnsWhen to usePerceived speed
generateTextFinished text/structured result in one goBatch processing, server-internal intermediate processing, short responses, structured extractionWait until done
streamTextAn incremental token streamA long response the user reads on screen, chatThe first character is fast
generateText + OutputA typed object validated by a Zod schemaWhen you use the LLM's output as a type in subsequent code (§3)Wait until done

The principle: long responses humans read, stream; results a machine uses next, in one go. The perceived silence affects only humans.

2-2. One-shot generation in a Server Action (generateText)

Processing that completes inside the server — summarize and save to the DB, classify and branch — is straightforward with generateText. You can place it directly in a Next.js Server Action.

// app/actions/summarize.ts
"use server";

import { generateText } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";

// 境界での入力検証:外部入力は常にZodで絞る(CLAUDE.md準拠)
const InputSchema = z.object({ text: z.string().min(1).max(20_000) });

export async function summarize(raw: unknown): Promise<string> {
  const { text } = InputSchema.parse(raw);

  const result = await generateText({
    model: MODELS.chat,
    system: "あなたは編集者です。日本語で3文以内に要約してください。",
    prompt: text,
    maxOutputTokens: 512, // v6では maxTokens ではなく maxOutputTokens
    temperature: 0.3,
  });

  return result.text;
}

Be careful about the v6 naming changes. maxTokens changed to maxOutputTokens, and the maxSteps described later changed to stopWhen: stepCountIs(n) (Common Errors). Copy-pasting from old articles is the most accident-prone point.

2-3. Streaming in a Route Handler (streamText)

A response the user reads on screen, return incrementally with streamText. In an App Router Route Handler, you can convert the result directly into the response.

// app/api/generate/route.ts
import { streamText } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";

export const maxDuration = 30; // Vercel Functionsのタイムアウト上限(秒)

const BodySchema = z.object({ prompt: z.string().min(1).max(4_000) });

export async function POST(req: Request) {
  // 外部入力は必ず検証してから渡す
  const parsed = BodySchema.safeParse(await req.json());
  if (!parsed.success) {
    return Response.json({ error: "invalid input" }, { status: 400 });
  }

  const result = streamText({
    model: MODELS.chat,
    prompt: parsed.data.prompt,
    // 失敗を握り潰さない:監視基盤へ送る
    onError: ({ error }) => console.error("streamText error", error),
  });

  // チャットUI(useChat)で受けるなら toUIMessageStreamResponse、
  // 素のテキストストリームでよいなら toTextStreamResponse
  return result.toTextStreamResponse();
}

result.textStream is a ReadableStream and an AsyncIterable, so you can also loop it inside the server with for await (const chunk of result.textStream). Use fullStream and you can observe all events including tool calls and step boundaries.


3. Structured output: "parse, don't validate" at the LLM boundary

This is the feature that makes a decisive difference from a hand-written wrapper. JSON.parse-ing the LLM's output and casting it as MyType is a source of accidents in production. The LLM breaks JSON, and adds or removes fields.

In v6, pass Output to generateText and get a validated, typed object with a Zod schema (the old generateObject was deprecated and integrated into the Output API — Generating Structured Data). The schema becomes the single source of truth (SSoT), unifying type, validation, and prompt instruction.

3-1. Output.object: extracting a single object

// app/actions/extract-invoice.ts
"use server";

import { generateText, Output } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";

// このスキーマが「型」「実行時検証」「LLMへの指示」すべての源
const Invoice = z.object({
  vendor: z.string().describe("請求元の会社名"),
  total: z.number().describe("税込合計金額(円・整数)"),
  dueDate: z.string().describe("支払期日 ISO 8601 (YYYY-MM-DD)"),
  lineItems: z.array(
    z.object({ name: z.string(), amount: z.number() }),
  ),
});

export async function extractInvoice(ocrText: string) {
  const result = await generateText({
    model: MODELS.chat,
    output: Output.object({ schema: Invoice }),
    prompt: `次のOCRテキストから請求情報を抽出してください:\n${ocrText}`,
  });

  // result.output は z.infer<typeof Invoice> として型付け&検証済み
  const invoice = result.output;
  return invoice; // ここから先は安全に型として扱える
}

Attach .describe() to the schema, and that description works as the field instruction to the LLM. The essence is that the need to write "return as JSON" in the prompt disappears.

3-2. Output.array and Output.choice

Array extraction and classification into fixed options each have their own dedicated output mode.

import { generateText, Output } from "ai";
import { z } from "zod";

// 配列:要素スキーマを element に渡す
const tasks = await generateText({
  model: MODELS.fast,
  output: Output.array({
    element: z.object({ title: z.string(), priority: z.enum(["high", "low"]) }),
  }),
  prompt: "次の議事録からToDoを抽出: ...",
});
// tasks.output: { title: string; priority: "high" | "low" }[]

// 分類:選択肢を固定(ハルシネーションで未知ラベルが出ない)
const sentiment = await generateText({
  model: MODELS.fast,
  output: Output.choice({ options: ["positive", "negative", "neutral"] as const }),
  prompt: "この声を分類: 「最高の製品でした」",
});
// sentiment.output: "positive" | "negative" | "neutral"

Output.choice is especially powerful in classification tasks. Because the model can't return outside the options, the subsequent branching logic doesn't fall apart. This is the thinking of "eliminating hallucination by structure," the same idea by which I crushed model-number errors in the voice-AI article.

If you want to draw structured results incrementally in streaming, use streamText + output with partialOutputStream. It suits UX like generating a form preview while filling it.


4. Tool calling and agents: when to give the model tools

What gives the LLM access to the external world (DB search, API calls, computation) is tool calling. In v6, you define it with the tool() helper. The argument schema is inputSchema (renamed from parameters in v6), and here too Zod is the SSoT (Tools and Tool Calling).

4-1. Defining tool()

// lib/ai/tools/get-order.ts
import { tool } from "ai";
import { z } from "zod";
import { findOrder } from "@/lib/db/orders";

export const getOrder = tool({
  description: "注文IDから注文状況を取得する。ユーザーが配送状況を尋ねた時に使う。",
  // inputSchema が引数の型・検証・モデルへの説明を兼ねる
  inputSchema: z.object({
    orderId: z.string().describe("注文ID(例: ORD-12345)"),
  }),
  execute: async ({ orderId }) => {
    // execute の中は普通のTypeScript。ここでDB/APIに触る
    const order = await findOrder(orderId);
    if (!order) return { found: false as const };
    return { found: true as const, status: order.status, eta: order.eta };
  },
});

4-2. Multi-step agents (stopWhen)

Pass tools and specify stopWhen: stepCountIs(n), and the model autonomously loops "call a tool → read the result → think further" (in v6, use this instead of the old maxSteps).

// app/api/agent/route.ts
import { streamText, stepCountIs, convertToModelMessages, type UIMessage } from "ai";
import { getOrder } from "@/lib/ai/tools/get-order";
import { MODELS } from "@/lib/ai/models";

export async function POST(req: Request) {
  const { messages }: { messages: UIMessage[] } = await req.json();

  const result = streamText({
    model: MODELS.chat,
    system: "あなたはカスタマーサポートです。注文照会にはツールを使うこと。",
    messages: convertToModelMessages(messages),
    tools: { getOrder },
    stopWhen: stepCountIs(5), // 暴走防止のステップ上限は必須
    onStepFinish: ({ toolCalls, toolResults }) => {
      // 可観測性:どのツールをどう呼んだかを必ずログ
      console.log("step", { toolCalls, toolResults });
    },
  });

  return result.toUIMessageStreamResponse();
}

If you want to structure an agent, v6 has the ToolLoopAgent class, and you can propagate the type all the way to the UI side with InferAgentUIMessage<typeof agent> (type-safe agents). The recommended structure is to split tools into lib/tools/ and agents into lib/agents/.

4-3. The decision axis: passing tools vs. deterministic code

This is where the designer shows their skill. Make everything a tool, and it becomes slow, expensive, and unstable.

Passing tools (let the LLM judge)Deterministic code (write it yourself)
Suited caseInput is natural language and which operation is needed isn't decided in advanceBoth input and processing flow are settled
Speed / costAn LLM round-trip per step (slow, expensive)Zero to one LLM call (fast, cheap)
ReliabilityWavers depending on the modelDeterministic, easy to test
Example"Summarize last month's delayed orders"A simple lookup where the order ID is known

The principle: if the branching can be written statically, write it. Entrust the LLM with only the boundary part of "converting natural language into structured intent," and process beyond that with deterministic code — this wins on all of speed, cost, and testability. A tool like getOrder is convenient, but for "a screen that always only does order lookups," it's healthier to not pass tools and instead extract orderId with §3's structured output and directly call findOrder.


5. useChat: building a streaming UI all the way to a11y

The front-end chat UI is assembled with useChat (@ai-sdk/react). There are two things that changed greatly in v6, so you'll definitely get stuck with old knowledge (Chatbot).

  1. Input state is now self-managed (input / handleInputChange / handleSubmit were removed).
  2. The API specification is now via transport: new DefaultChatTransport({ api }).
  3. Messages are not content but a parts array (a mix of text, tool calls, etc.).
// app/chat/chat.tsx
"use client";

import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";

export function Chat() {
  const [input, setInput] = useState("");
  const { messages, sendMessage, status, stop, regenerate } = useChat({
    transport: new DefaultChatTransport({ api: "/api/agent" }),
  });

  const isBusy = status === "submitted" || status === "streaming";

  function onSubmit(e: React.FormEvent) {
    e.preventDefault();
    if (!input.trim() || isBusy) return;
    sendMessage({ text: input });
    setInput("");
  }

  return (
    <div>
      {/* a11y: 生成テキストはライブリージョンで読み上げる */}
      <div aria-live="polite" aria-atomic="false" role="log">
        {messages.map((m) => (
          <article key={m.id} aria-label={m.role === "user" ? "あなた" : "アシスタント"}>
            {m.parts.map((part, i) => {
              if (part.type === "text") return <p key={i}>{part.text}</p>;
              // 型付きツールパート: tool-getOrder / states で安全に分岐
              if (part.type === "tool-getOrder") {
                if (part.state === "output-available") {
                  return <p key={i}>注文状況: {String(part.output.found)}</p>;
                }
                return <p key={i}>注文を照会中…</p>;
              }
              return null;
            })}
          </article>
        ))}
      </div>

      <form onSubmit={onSubmit}>
        <label htmlFor="msg" className="sr-only">メッセージ</label>
        <input
          id="msg"
          value={input}
          onChange={(e) => setInput(e.target.value)}
          disabled={isBusy}
          aria-disabled={isBusy}
        />
        {isBusy ? (
          // 停止ボタン:長い生成を中断できることはUXとコストの両面で重要
          <button type="button" onClick={stop} aria-label="生成を停止">停止</button>
        ) : (
          <button type="submit" aria-label="送信">送信</button>
        )}
        <button type="button" onClick={() => regenerate()} disabled={isBusy}>
          再生成
        </button>
      </form>
    </div>
  );
}

Let me narrow a11y to three points.

  • Live region: wrap with aria-live="polite" and the streamed body is read aloud incrementally to a screen reader. assertive interrupts harshly, so use polite for a conversation log.
  • Stop button: emit stop() while status is streaming. You can stop a runaway during read-aloud, and also stop wasted token billing.
  • State visualization: make loading and errors explicit with status (submitted / streaming / ready / error). Don't swallow error.

Tool parts arrive with a typed name tool-{toolName}, and you can draw them stepwise with state (input-streaming / input-available / output-available). Because part.input / part.output exist only in the corresponding state, accessing them without a state check is rejected by TS — this is a behavior to welcome as a safety device.


The essence of RAG (retrieval-augmented generation) is "vector-search the documents relevant to the question and inject them into the prompt." The AI SDK provides embedding APIs (embed / embedMany / cosineSimilarity, all from ai) (Embeddings).

// lib/ai/rag.ts
import { embed, embedMany, cosineSimilarity, generateText } from "ai";
import { MODELS } from "@/lib/ai/models";

// インデックス時:文書を一括で埋め込む(embedMany はバッチで効率的)
export async function indexDocuments(chunks: string[]) {
  const { embeddings } = await embedMany({
    model: MODELS.embedding,
    values: chunks,
  });
  // embeddings[i] を chunks[i] と一緒にベクトルDBへ保存(pgvector / Pinecone 等)
  return chunks.map((text, i) => ({ text, embedding: embeddings[i] }));
}

// 検索時:質問を埋め込み、類似文書を取る
export async function answer(question: string, store: { text: string; embedding: number[] }[]) {
  const { embedding } = await embed({ model: MODELS.embedding, value: question });

  // 本番ではベクトルDB側でANN検索する。ここは概念を示すためのインメモリ例
  const top = store
    .map((d) => ({ ...d, score: cosineSimilarity(embedding, d.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 4);

  const context = top.map((d) => d.text).join("\n---\n");
  const { text } = await generateText({
    model: MODELS.chat,
    system: "提供されたコンテキストのみを根拠に回答。無い情報は『分かりません』と答える。",
    prompt: `コンテキスト:\n${context}\n\n質問: ${question}`,
  });
  return text;
}

But production RAG doesn't end with these 10 lines. Chunk-splitting strategy, metadata filters, reranking, hallucination countermeasures, quantitative accuracy evaluation, cost estimation — this is where the real battle is. I implemented this thoroughly in Python / LangChain / Pinecone. The deep dive is written in Building a production RAG system with LangChain + Pinecone and RAG design for voice AI with pgvector. Regard this article's TypeScript implementation as a port of that design philosophy to the Vercel ecosystem.

An honest take on choosing the language: if you have heavy preprocessing, evaluation pipelines, or existing ML assets, Python is still advantageous. On the other hand, for a thin RAG tightly coupled with a web app (like co-locating internal document search in a Next.js app), making it end-to-end with TypeScript + AI SDK is simpler to operate. This is a decision axis after doing both.


7. Production-operation pressure points: cost, reliability, security, observability

Up to here was "features." Let me summarize the cross-cutting design for withstanding production, in the order it paid off.

7-1. Cost: model routing and token budget

The biggest cost reduction is "not processing everything with the top-tier model."

  • Model routing: classification, extraction, and routing with haiku, normal responses with sonnet, only hard problems with opus. §1's MODELS constant is the foundation of this branching. Just by not using an excessive model for the task, cost changes by an order of magnitude.
  • Token budget: always set an upper limit with maxOutputTokens. Unbounded output is an accident.
  • Prompt caching: for a long fixed system/context, push the variable part to the tail so the provider's prompt caching takes effect.
  • Stream the output and stop early: provide stop() in the UI (§5). Don't keep generating long text that isn't read.

7-2. Reliability: timeout, retry, fallback

  • Timeout: set maxDuration on the Route Handler (§2-3). For the client, an AbortController.
  • Fallback: AI Gateway auto-retries to another provider on a failure (official). This is one of the biggest practical benefits of choosing "unified SDK + Gateway." With direct calls, you'd write all this fallback yourself.
  • Idempotency: if you update the DB with a generation result, absorb duplicates with an idempotency key. LLM calls fail and duplicate normally over the network. This design philosophy is detailed in the voice-AI article.

7-3. Security: keys, prompt injection

  • Don't expose API keys to the client. Always make LLM calls from the server (Route Handler / Server Action). AI_GATEWAY_API_KEY is a server-only environment variable; don't attach NEXT_PUBLIC_. A configuration that hits the provider directly from the browser is key leakage itself.
  • Validate and sanitize user input. As shown in §2 and §3, always pass external input through Zod at the boundary. Don't forget length limits either (token-bombing countermeasure).
  • Design on the premise of prompt injection. Don't trust user input or external documents injected via RAG. Clearly delimit them from the system prompt, and place guards in tools like "make destructive operations require confirmation" inside execute. Passing the LLM's output directly to eval, SQL, or shell is strictly forbidden. Narrow tools' execution permissions with the principle of least privilege.

7-4. Observability and eval: what you can't measure, you can't fix

  • Always log: the model used, input/output tokens (result.usage), latency, finishReason, and which tool was called. onFinish / onStepFinish are the hooks. You can also observe spend across the board on the AI Gateway dashboard.
  • Eval (evaluation): before going to production, have tests with fixed expected outputs against a representative input set. Structured output (§3) makes schema validation itself a minimal eval. Don't ship on "looks good by eye" — this is the most often-omitted and most accident-prone step in generative AI.
  • Testability: a tool's execute can be unit-tested as a pure TypeScript function. Mock the LLM call itself, and test the tool's logic and the structured-output schema separately is the realistic strategy.

8. Common pitfalls

Let me line up the landmines I step (stepped) on in real projects, paired with countermeasures.

PitfallWhat happensCountermeasure
Trusting LLM output with as MyTypeTypes collapse and crash in productionparse with Output.object + Zod (§3)
Copy-pasting maxTokens / maxStepsDoesn't work in v6maxOutputTokens / stopWhen: stepCountIs(n) (§2, §4)
Exposing the API key to the clientKey leakage, billing explosionLLM calls server-only (§7-3)
No timeout / fallbackThe whole app stops on a provider outagemaxDuration + the Gateway's automatic fallback (§7-2)
Ignoring prompt injectionAn injected document overwrites the systemDon't trust input and delimit it, narrow tool permissions (§7-3)
Entrusting everything to toolsSlow, expensive, unstableDeterministic code if you can branch statically (§4-3)
Going to production without evalYou can't notice "degraded before you knew it"Tests with fixed expected outputs + schema validation (§7-4)
The old useChat's input/handleSubmitDoesn't exist in v6, build failsSelf-manage input state + DefaultChatTransport (§5)
Processing everything with the top-tier modelCost swells by an order of magnitudePer-task model routing (§7-1)

Summary: the AI SDK is the foundation for assembling LLM apps "fast, cheap, and safe"

Using the Vercel AI SDK v6 in production effectively means connecting with the Gateway, making it think with Core, showing it with the UI, and protecting it with cross-cutting operational design. The key points in five lines.

  1. Connection is AI Gateway + a "provider/model" string. Provider-independence, automatic fallback, and spend observation take effect by default. Keys server-only.
  2. Responses humans read, streamText; results a machine uses next, generateText. In v6, maxOutputTokens / stopWhen.
  3. Parse structured output with Output.object + Zod. The schema is the SSoT. An as cast is forbidden.
  4. Tools are tool({ inputSchema, execute }), agents are stopWhen: stepCountIs(n). But lean statically-writable branching toward deterministic code — the optimal answer for speed, cost, and reliability.
  5. Weave cost (model routing), reliability (timeout/fallback), security (keys/injection), and eval into the design from the start.

I've built multiple generative-AI production systems in Python / AWS (a RAG platform, voice-AI sales, AI video localization, a broadcaster's internal multi-AI platform). This article is a port of the design that bridges that "distance between demo and production" to the TypeScript / Vercel ecosystem.

"With one person × generative AI (Claude Code), fast, cheap, and safe" — if you want to integrate LLM features into an existing product, for that consultation, see the related case the generative-AI voice chatbot and reach out from contact. I provide the design together with the decision axis of not using excessive technology for the requirement.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading