# Building Production LLM Apps with Vercel AI SDK v6: Streaming, Tool Calling, Structured Output, and RAG in Real Code

> A practical guide to building production-quality LLM apps in TypeScript. Centered on Vercel AI SDK v6 and AI Gateway, explained with working code and decision axes: generateText/streamText, structured output with Zod schemas, tool calling and agents, the useChat streaming UI, RAG with embed/embedMany, and cost, reliability, security, and observability.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: TypeScript, RAG, Next.js, Vercel, AI, Claude, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag
- Category: Generative AI, LLMs & RAG

## Key points

- Make AI Gateway the default for new production LLM apps, and get provider-independence and automatic fallback with a provider/model string
- Use streamText for long responses humans read, and generateText for results a machine uses next; in v6 use maxOutputTokens / stopWhen
- Parse structured output with Output.object + Zod, and an as cast is forbidden. The schema becomes the SSoT of type, validation, and instruction
- Define tools with inputSchema + execute, but lean statically-writable branching toward deterministic code — faster, cheaper, more certain
- Weave cost (model routing), reliability (timeout/fallback), security (keys/injection), and eval into the design from the start

---

"If you're just hitting an LLM, `fetch` is enough" — that's what I thought at first. Indeed, if you're just making a demo, calling each provider's SDK directly works. But **the moment you want to switch providers**, **the moment you wire a streaming UI to the front**, **the moment you make it an agent with tool calls**, **the moment you want to trust the LLM's output as a type** — at any one of these, a hand-written thin wrapper falls apart.

So far I've designed and implemented **generative-AI production systems in the Python / AWS family**, like a [LangChain + Pinecone RAG platform](/blog/langchain-pinecone-production-rag-system), a [Bedrock × pgvector voice-AI sales agent](/blog/production-voice-ai-sales-agent-bedrock-pgvector), an AI video-localization platform, and a broadcaster's internal multi-AI platform. This article rewrites that experience from the angle of the **TypeScript / Next.js / Vercel ecosystem**. I leave deep RAG theory and voice pipelines to the above two articles and concentrate here on "**how to assemble a production LLM app with the Vercel AI SDK**."

> The reference version of this article: **Vercel AI SDK v6** (the `ai` package). Model connection is in principle via **Vercel AI Gateway** (a `"provider/model"` string). All code assumes TypeScript / App Router and is posted at a granularity you can actually move your hands with. The official docs are linked throughout at [ai-sdk.dev](https://ai-sdk.dev/docs/introduction) and [Vercel AI Gateway](https://vercel.com/docs/ai-gateway).

## 0. The big picture: how to combine the AI SDK's "three layers"

Using the Vercel AI SDK in production effectively means designing these three layers.

| Layer | Role | Central API | This article's section |
| --- | --- | --- | --- |
| **Connection layer** | Unified connection to providers/models, fallback, billing observation | AI Gateway (`"provider/model"` string) | §1 |
| **Core layer** | Text generation, structured output, tool calling, embeddings | `generateText` / `streamText` / `Output` / `tool` / `embed` | §2–§4, §6 |
| **UI layer** | A streaming chat UI, rendering tool states | `useChat` (`@ai-sdk/react`) | §5 |

From bottom to top, "**connect with the Gateway, make it think with Core, show it with the UI**." We proceed in this flow. At the end, §7 and §8 summarize the cross-cutting concerns of production operation (cost, reliability, security, observability) and the pitfalls.

---

## 1. Why a unified SDK + AI Gateway (vs. calling providers directly)

The first design decision is here. **Use a provider-specific SDK (`@anthropic-ai/sdk`, etc.) directly, or use the AI SDK + AI Gateway.**

To say the conclusion first, for a new production app I make **AI Gateway the default**. The reasons are as follows (consistent with the [official](https://vercel.com/docs/ai-gateway) features).

- **Hundreds of models with one key.** With a single `AI_GATEWAY_API_KEY`, you can cut across OpenAI / Anthropic / Google and others.
- **Provider-independent.** You can swap `model` with a one-line string. You can structurally avoid vendor lock-in.
- **Automatic fallback.** If a provider goes down, it auto-retries to another provider (reliability §7).
- **Spend observability.** You can observe tokens and cost across providers.
- **No markup on tokens.** Same cost as a direct provider contract ([official](https://vercel.com/docs/ai-gateway) states this clearly).

```ts
// .env.local — Vercel本番ではOIDCで自動認証されるため不要。ローカル/他基盤ではこれを使う
// AI_GATEWAY_API_KEY=your_api_key_here
```

```ts
// lib/ai/models.ts — モデルIDを1箇所に集約（DRY / 差し替え容易性）
// AI Gateway は AI SDK の「既定のグローバルプロバイダ」なので、文字列だけで繋がる
export const MODELS = {
  // 重い推論・最終回答
  reasoning: "anthropic/claude-opus-4.8",
  // 通常のチャット・要約・抽出（速度とコストのバランス）
  chat: "anthropic/claude-sonnet-4.6",
  // 分類・ルーティングなど軽量タスク（最安・最速）
  fast: "anthropic/claude-haiku-4.5",
  // RAG用の埋め込み
  embedding: "openai/text-embedding-3-small",
} as const;
```

> **Don't write model IDs from memory.** Confirming the latest IDs with `curl -s https://ai-gateway.vercel.sh/v1/models | jq -r '.data[].id'` is the official practice ([Gateway docs](https://vercel.com/docs/ai-gateway)). The example above is matched to the latest Claude as of 2026-06.

**So when do you call a provider directly?** Only when you want to use a provider-specific preview feature (a specific beta API, or a latest flag not yet supported by the Gateway) the same day. In that case, use a dedicated provider package like `@ai-sdk/anthropic` and write `model: anthropic("claude-...")`. But this is **exception operation**. Decide "the default is the Gateway string," and your design won't waver.

---

## 2. generateText and streamText: when to stream

There are only two entrances to text generation. `generateText` (returns the finished form in one go) and `streamText` (streams tokens incrementally). Both are imported from the `ai` package ([official: Generating Text](https://ai-sdk.dev/docs/ai-sdk-core/generating-text)).

### 2-1. The decision axis: choosing among the three APIs

What you agonize over in practice is "`generateText` or `streamText`, or structured output." Let me first pin down the big picture in a table.

| API | What it returns | When to use | Perceived speed |
| --- | --- | --- | --- |
| `generateText` | Finished text/structured result in one go | Batch processing, server-internal intermediate processing, short responses, structured extraction | Wait until done |
| `streamText` | An incremental token stream | A long response the user reads on screen, chat | The first character is fast |
| `generateText` + `Output` | A **typed object** validated by a Zod schema | When you use the LLM's output as a type in subsequent code (§3) | Wait until done |

The principle: **long responses humans read, stream; results a machine uses next, in one go.** The perceived silence affects only humans.

### 2-2. One-shot generation in a Server Action (generateText)

Processing that completes inside the server — summarize and save to the DB, classify and branch — is straightforward with `generateText`. You can place it directly in a Next.js Server Action.

```ts
// app/actions/summarize.ts
"use server";

import { generateText } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";

// 境界での入力検証：外部入力は常にZodで絞る（CLAUDE.md準拠）
const InputSchema = z.object({ text: z.string().min(1).max(20_000) });

export async function summarize(raw: unknown): Promise<string> {
  const { text } = InputSchema.parse(raw);

  const result = await generateText({
    model: MODELS.chat,
    system: "あなたは編集者です。日本語で3文以内に要約してください。",
    prompt: text,
    maxOutputTokens: 512, // v6では maxTokens ではなく maxOutputTokens
    temperature: 0.3,
  });

  return result.text;
}
```

> Be careful about the v6 naming changes. `maxTokens` changed to **`maxOutputTokens`**, and the `maxSteps` described later changed to **`stopWhen: stepCountIs(n)`** ([Common Errors](https://ai-sdk.dev/docs/troubleshooting)). Copy-pasting from old articles is the most accident-prone point.

### 2-3. Streaming in a Route Handler (streamText)

A response the user reads on screen, return incrementally with `streamText`. In an App Router Route Handler, you can **convert the result directly into the response**.

```ts
// app/api/generate/route.ts
import { streamText } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";

export const maxDuration = 30; // Vercel Functionsのタイムアウト上限（秒）

const BodySchema = z.object({ prompt: z.string().min(1).max(4_000) });

export async function POST(req: Request) {
  // 外部入力は必ず検証してから渡す
  const parsed = BodySchema.safeParse(await req.json());
  if (!parsed.success) {
    return Response.json({ error: "invalid input" }, { status: 400 });
  }

  const result = streamText({
    model: MODELS.chat,
    prompt: parsed.data.prompt,
    // 失敗を握り潰さない：監視基盤へ送る
    onError: ({ error }) => console.error("streamText error", error),
  });

  // チャットUI(useChat)で受けるなら toUIMessageStreamResponse、
  // 素のテキストストリームでよいなら toTextStreamResponse
  return result.toTextStreamResponse();
}
```

`result.textStream` is **a ReadableStream and an AsyncIterable**, so you can also loop it inside the server with `for await (const chunk of result.textStream)`. Use `fullStream` and you can observe all events including tool calls and step boundaries.

---

## 3. Structured output: "parse, don't validate" at the LLM boundary

This is the feature that makes a decisive difference from a hand-written wrapper. **`JSON.parse`-ing the LLM's output and casting it `as MyType` is a source of accidents in production.** The LLM breaks JSON, and adds or removes fields.

In v6, pass **`Output`** to `generateText` and get a **validated, typed object** with a Zod schema (the old `generateObject` was deprecated and integrated into the `Output` API — [Generating Structured Data](https://ai-sdk.dev/docs/ai-sdk-core/generating-structured-data)). The schema becomes the **single source of truth (SSoT)**, unifying type, validation, and prompt instruction.

### 3-1. Output.object: extracting a single object

```ts
// app/actions/extract-invoice.ts
"use server";

import { generateText, Output } from "ai";
import { z } from "zod";
import { MODELS } from "@/lib/ai/models";

// このスキーマが「型」「実行時検証」「LLMへの指示」すべての源
const Invoice = z.object({
  vendor: z.string().describe("請求元の会社名"),
  total: z.number().describe("税込合計金額（円・整数）"),
  dueDate: z.string().describe("支払期日 ISO 8601 (YYYY-MM-DD)"),
  lineItems: z.array(
    z.object({ name: z.string(), amount: z.number() }),
  ),
});

export async function extractInvoice(ocrText: string) {
  const result = await generateText({
    model: MODELS.chat,
    output: Output.object({ schema: Invoice }),
    prompt: `次のOCRテキストから請求情報を抽出してください:\n${ocrText}`,
  });

  // result.output は z.infer<typeof Invoice> として型付け＆検証済み
  const invoice = result.output;
  return invoice; // ここから先は安全に型として扱える
}
```

Attach `.describe()` to the schema, and that description works as the field instruction to the LLM. The essence is that **the need to write "return as JSON" in the prompt disappears**.

### 3-2. Output.array and Output.choice

Array extraction and classification into fixed options each have their own dedicated output mode.

```ts
import { generateText, Output } from "ai";
import { z } from "zod";

// 配列：要素スキーマを element に渡す
const tasks = await generateText({
  model: MODELS.fast,
  output: Output.array({
    element: z.object({ title: z.string(), priority: z.enum(["high", "low"]) }),
  }),
  prompt: "次の議事録からToDoを抽出: ...",
});
// tasks.output: { title: string; priority: "high" | "low" }[]

// 分類：選択肢を固定（ハルシネーションで未知ラベルが出ない）
const sentiment = await generateText({
  model: MODELS.fast,
  output: Output.choice({ options: ["positive", "negative", "neutral"] as const }),
  prompt: "この声を分類: 「最高の製品でした」",
});
// sentiment.output: "positive" | "negative" | "neutral"
```

`Output.choice` is especially powerful in classification tasks. Because **the model can't return outside the options**, the subsequent branching logic doesn't fall apart. This is the thinking of "eliminating hallucination by structure," the same idea by which I crushed model-number errors in the [voice-AI article](/blog/production-voice-ai-sales-agent-bedrock-pgvector).

> If you want to draw structured results incrementally in streaming, use `streamText` + `output` with `partialOutputStream`. It suits UX like generating a form preview while filling it.

---

## 4. Tool calling and agents: when to give the model tools

What gives the LLM **access to the external world** (DB search, API calls, computation) is tool calling. In v6, you define it with the `tool()` helper. **The argument schema is `inputSchema`** (renamed from `parameters` in v6), and here too Zod is the SSoT ([Tools and Tool Calling](https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling)).

### 4-1. Defining tool()

```ts
// lib/ai/tools/get-order.ts
import { tool } from "ai";
import { z } from "zod";
import { findOrder } from "@/lib/db/orders";

export const getOrder = tool({
  description: "注文IDから注文状況を取得する。ユーザーが配送状況を尋ねた時に使う。",
  // inputSchema が引数の型・検証・モデルへの説明を兼ねる
  inputSchema: z.object({
    orderId: z.string().describe("注文ID（例: ORD-12345）"),
  }),
  execute: async ({ orderId }) => {
    // execute の中は普通のTypeScript。ここでDB/APIに触る
    const order = await findOrder(orderId);
    if (!order) return { found: false as const };
    return { found: true as const, status: order.status, eta: order.eta };
  },
});
```

### 4-2. Multi-step agents (stopWhen)

Pass tools and specify `stopWhen: stepCountIs(n)`, and the model **autonomously loops** "call a tool → read the result → think further" (in v6, use this instead of the old `maxSteps`).

```ts
// app/api/agent/route.ts
import { streamText, stepCountIs, convertToModelMessages, type UIMessage } from "ai";
import { getOrder } from "@/lib/ai/tools/get-order";
import { MODELS } from "@/lib/ai/models";

export async function POST(req: Request) {
  const { messages }: { messages: UIMessage[] } = await req.json();

  const result = streamText({
    model: MODELS.chat,
    system: "あなたはカスタマーサポートです。注文照会にはツールを使うこと。",
    messages: convertToModelMessages(messages),
    tools: { getOrder },
    stopWhen: stepCountIs(5), // 暴走防止のステップ上限は必須
    onStepFinish: ({ toolCalls, toolResults }) => {
      // 可観測性：どのツールをどう呼んだかを必ずログ
      console.log("step", { toolCalls, toolResults });
    },
  });

  return result.toUIMessageStreamResponse();
}
```

If you want to structure an agent, v6 has the **`ToolLoopAgent`** class, and you can propagate the type all the way to the UI side with `InferAgentUIMessage<typeof agent>` ([type-safe agents](https://ai-sdk.dev/docs/agents)). The recommended structure is to split tools into `lib/tools/` and agents into `lib/agents/`.

### 4-3. The decision axis: passing tools vs. deterministic code

This is where the designer shows their skill. **Make everything a tool, and it becomes slow, expensive, and unstable.**

| | Passing tools (let the LLM judge) | Deterministic code (write it yourself) |
| --- | --- | --- |
| Suited case | Input is natural language and which operation is needed isn't decided in advance | Both input and processing flow are settled |
| Speed / cost | An LLM round-trip per step (slow, expensive) | Zero to one LLM call (fast, cheap) |
| Reliability | Wavers depending on the model | Deterministic, easy to test |
| Example | "Summarize last month's delayed orders" | A simple lookup where the order ID is known |

The principle: **if the branching can be written statically, write it.** Entrust the LLM with only the boundary part of "converting natural language into structured intent," and process beyond that with deterministic code — this wins on all of speed, cost, and testability. A tool like `getOrder` is convenient, but for "a screen that always only does order lookups," it's healthier to not pass tools and instead extract `orderId` with §3's structured output and directly call `findOrder`.

---

## 5. useChat: building a streaming UI all the way to a11y

The front-end chat UI is assembled with `useChat` (`@ai-sdk/react`). There are two things that **changed greatly in v6**, so you'll definitely get stuck with old knowledge ([Chatbot](https://ai-sdk.dev/docs/ai-sdk-ui/chatbot)).

1. **Input state is now self-managed** (`input` / `handleInputChange` / `handleSubmit` were removed).
2. The API specification is now via **`transport: new DefaultChatTransport({ api })`**.
3. Messages are not `content` but a **`parts` array** (a mix of text, tool calls, etc.).

```tsx
// app/chat/chat.tsx
"use client";

import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";

export function Chat() {
  const [input, setInput] = useState("");
  const { messages, sendMessage, status, stop, regenerate } = useChat({
    transport: new DefaultChatTransport({ api: "/api/agent" }),
  });

  const isBusy = status === "submitted" || status === "streaming";

  function onSubmit(e: React.FormEvent) {
    e.preventDefault();
    if (!input.trim() || isBusy) return;
    sendMessage({ text: input });
    setInput("");
  }

  return (
    <div>
      {/* a11y: 生成テキストはライブリージョンで読み上げる */}
      <div aria-live="polite" aria-atomic="false" role="log">
        {messages.map((m) => (
          <article key={m.id} aria-label={m.role === "user" ? "あなた" : "アシスタント"}>
            {m.parts.map((part, i) => {
              if (part.type === "text") return <p key={i}>{part.text}</p>;
              // 型付きツールパート: tool-getOrder / states で安全に分岐
              if (part.type === "tool-getOrder") {
                if (part.state === "output-available") {
                  return <p key={i}>注文状況: {String(part.output.found)}</p>;
                }
                return <p key={i}>注文を照会中…</p>;
              }
              return null;
            })}
          </article>
        ))}
      </div>

      <form onSubmit={onSubmit}>
        <label htmlFor="msg" className="sr-only">メッセージ</label>
        <input
          id="msg"
          value={input}
          onChange={(e) => setInput(e.target.value)}
          disabled={isBusy}
          aria-disabled={isBusy}
        />
        {isBusy ? (
          // 停止ボタン：長い生成を中断できることはUXとコストの両面で重要
          <button type="button" onClick={stop} aria-label="生成を停止">停止</button>
        ) : (
          <button type="submit" aria-label="送信">送信</button>
        )}
        <button type="button" onClick={() => regenerate()} disabled={isBusy}>
          再生成
        </button>
      </form>
    </div>
  );
}
```

Let me narrow a11y to three points.

- **Live region**: wrap with `aria-live="polite"` and the streamed body is read aloud incrementally to a screen reader. `assertive` interrupts harshly, so use `polite` for a conversation log.
- **Stop button**: emit `stop()` while `status` is `streaming`. You can stop a runaway during read-aloud, and also stop wasted token billing.
- **State visualization**: make loading and errors explicit with `status` (`submitted` / `streaming` / `ready` / `error`). Don't swallow `error`.

Tool parts arrive with a typed name `tool-{toolName}`, and you can draw them stepwise with `state` (`input-streaming` / `input-available` / `output-available`). Because `part.input` / `part.output` exist only in the corresponding state, **accessing them without a state check is rejected by TS** — this is a behavior to welcome as a safety device.

---

## 6. RAG essentials: embed / embedMany, and the deep dive goes to other articles

The essence of RAG (retrieval-augmented generation) is "**vector-search the documents relevant to the question and inject them into the prompt**." The AI SDK provides embedding APIs (`embed` / `embedMany` / `cosineSimilarity`, all from `ai`) ([Embeddings](https://ai-sdk.dev/docs/ai-sdk-core/embeddings)).

```ts
// lib/ai/rag.ts
import { embed, embedMany, cosineSimilarity, generateText } from "ai";
import { MODELS } from "@/lib/ai/models";

// インデックス時：文書を一括で埋め込む（embedMany はバッチで効率的）
export async function indexDocuments(chunks: string[]) {
  const { embeddings } = await embedMany({
    model: MODELS.embedding,
    values: chunks,
  });
  // embeddings[i] を chunks[i] と一緒にベクトルDBへ保存（pgvector / Pinecone 等）
  return chunks.map((text, i) => ({ text, embedding: embeddings[i] }));
}

// 検索時：質問を埋め込み、類似文書を取る
export async function answer(question: string, store: { text: string; embedding: number[] }[]) {
  const { embedding } = await embed({ model: MODELS.embedding, value: question });

  // 本番ではベクトルDB側でANN検索する。ここは概念を示すためのインメモリ例
  const top = store
    .map((d) => ({ ...d, score: cosineSimilarity(embedding, d.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, 4);

  const context = top.map((d) => d.text).join("\n---\n");
  const { text } = await generateText({
    model: MODELS.chat,
    system: "提供されたコンテキストのみを根拠に回答。無い情報は『分かりません』と答える。",
    prompt: `コンテキスト:\n${context}\n\n質問: ${question}`,
  });
  return text;
}
```

But **production RAG doesn't end with these 10 lines.** Chunk-splitting strategy, metadata filters, reranking, hallucination countermeasures, quantitative accuracy evaluation, cost estimation — this is where the real battle is. I implemented this thoroughly in Python / LangChain / Pinecone. **The deep dive is written in [Building a production RAG system with LangChain + Pinecone](/blog/langchain-pinecone-production-rag-system) and [RAG design for voice AI with pgvector](/blog/production-voice-ai-sales-agent-bedrock-pgvector).** Regard this article's TypeScript implementation as a port of that design philosophy to the Vercel ecosystem.

> **An honest take on choosing the language**: if you have heavy preprocessing, evaluation pipelines, or existing ML assets, Python is still advantageous. On the other hand, for **a thin RAG tightly coupled with a web app** (like co-locating internal document search in a Next.js app), making it end-to-end with TypeScript + AI SDK is simpler to operate. This is a decision axis after doing both.

---

## 7. Production-operation pressure points: cost, reliability, security, observability

Up to here was "features." Let me summarize the cross-cutting design for withstanding production, in the order it paid off.

### 7-1. Cost: model routing and token budget

The biggest cost reduction is "**not processing everything with the top-tier model**."

- **Model routing**: classification, extraction, and routing with `haiku`, normal responses with `sonnet`, only hard problems with `opus`. §1's `MODELS` constant is the foundation of this branching. Just by **not using an excessive model for the task**, cost changes by an order of magnitude.
- **Token budget**: always set an upper limit with `maxOutputTokens`. Unbounded output is an accident.
- **Prompt caching**: for a long fixed system/context, **push the variable part to the tail** so the provider's prompt caching takes effect.
- **Stream the output and stop early**: provide `stop()` in the UI (§5). Don't keep generating long text that isn't read.

### 7-2. Reliability: timeout, retry, fallback

- **Timeout**: set `maxDuration` on the Route Handler (§2-3). For the client, an `AbortController`.
- **Fallback**: AI Gateway auto-retries to another provider on a failure ([official](https://vercel.com/docs/ai-gateway)). This is one of the biggest practical benefits of choosing "unified SDK + Gateway." With direct calls, you'd write all this fallback yourself.
- **Idempotency**: if you update the DB with a generation result, absorb duplicates with an idempotency key. LLM calls fail and duplicate normally over the network. This design philosophy is detailed in the [voice-AI article](/blog/production-voice-ai-sales-agent-bedrock-pgvector).

### 7-3. Security: keys, prompt injection

- **Don't expose API keys to the client.** Always make LLM calls **from the server (Route Handler / Server Action)**. `AI_GATEWAY_API_KEY` is a server-only environment variable; don't attach `NEXT_PUBLIC_`. A configuration that hits the provider directly from the browser is key leakage itself.
- **Validate and sanitize user input.** As shown in §2 and §3, always pass external input through Zod at the boundary. Don't forget length limits either (token-bombing countermeasure).
- **Design on the premise of prompt injection.** Don't trust user input or external documents injected via RAG. **Clearly delimit** them from the system prompt, and place guards in tools like "make destructive operations require confirmation" inside `execute`. Passing the LLM's output directly to `eval`, SQL, or shell is strictly forbidden. Narrow tools' execution permissions with the **principle of least privilege**.

### 7-4. Observability and eval: what you can't measure, you can't fix

- **Always log**: the model used, input/output tokens (`result.usage`), latency, `finishReason`, and which tool was called. `onFinish` / `onStepFinish` are the hooks. You can also observe spend across the board on the AI Gateway dashboard.
- **Eval (evaluation)**: before going to production, have **tests with fixed expected outputs** against a representative input set. Structured output (§3) makes schema validation itself a minimal eval. Don't ship on "looks good by eye" — this is the most often-omitted and most accident-prone step in generative AI.
- **Testability**: a `tool`'s `execute` can be unit-tested as a pure TypeScript function. Mock the LLM call itself, and **test the tool's logic and the structured-output schema separately** is the realistic strategy.

---

## 8. Common pitfalls

Let me line up the landmines I step (stepped) on in real projects, paired with countermeasures.

| Pitfall | What happens | Countermeasure |
| --- | --- | --- |
| Trusting LLM output with `as MyType` | Types collapse and crash in production | **parse** with `Output.object` + Zod (§3) |
| Copy-pasting `maxTokens` / `maxSteps` | Doesn't work in v6 | `maxOutputTokens` / `stopWhen: stepCountIs(n)` (§2, §4) |
| Exposing the API key to the client | Key leakage, billing explosion | LLM calls server-only (§7-3) |
| No timeout / fallback | The whole app stops on a provider outage | `maxDuration` + the Gateway's automatic fallback (§7-2) |
| Ignoring prompt injection | An injected document overwrites the system | Don't trust input and delimit it, narrow tool permissions (§7-3) |
| Entrusting everything to tools | Slow, expensive, unstable | Deterministic code if you can branch statically (§4-3) |
| Going to production without eval | You can't notice "degraded before you knew it" | Tests with fixed expected outputs + schema validation (§7-4) |
| The old `useChat`'s `input`/`handleSubmit` | Doesn't exist in v6, build fails | Self-manage input state + `DefaultChatTransport` (§5) |
| Processing everything with the top-tier model | Cost swells by an order of magnitude | Per-task model routing (§7-1) |

---

## Summary: the AI SDK is the foundation for assembling LLM apps "fast, cheap, and safe"

Using the Vercel AI SDK v6 in production effectively means **connecting with the Gateway, making it think with Core, showing it with the UI, and protecting it with cross-cutting operational design**. The key points in five lines.

1. **Connection is AI Gateway + a `"provider/model"` string.** Provider-independence, automatic fallback, and spend observation take effect by default. Keys server-only.
2. **Responses humans read, `streamText`; results a machine uses next, `generateText`.** In v6, `maxOutputTokens` / `stopWhen`.
3. **Parse structured output with `Output.object` + Zod.** The schema is the SSoT. An `as` cast is forbidden.
4. **Tools are `tool({ inputSchema, execute })`, agents are `stopWhen: stepCountIs(n)`.** But **lean statically-writable branching toward deterministic code** — the optimal answer for speed, cost, and reliability.
5. **Weave cost (model routing), reliability (timeout/fallback), security (keys/injection), and eval into the design from the start.**

I've built multiple generative-AI production systems in Python / AWS (a [RAG platform](/blog/langchain-pinecone-production-rag-system), [voice-AI sales](/blog/production-voice-ai-sales-agent-bedrock-pgvector), AI video localization, a broadcaster's internal multi-AI platform). This article is a port of the design that bridges that "**distance between demo and production**" to the TypeScript / Vercel ecosystem.

"With one person × generative AI (Claude Code), fast, cheap, and safe" — if you want to integrate LLM features into an existing product, for that consultation, see the related case [the generative-AI voice chatbot](/case-studies/ai-voice-chatbot) and reach out from [contact](/contact). I provide the design together with the decision axis of **not using excessive technology for the requirement**.
