"I called an LLM in a PoC and it worked. But the moment I shipped to production, the JSON broke and it fell over."
"I assembled an agent that uses tools, and it melted tokens in an infinite loop."
"I made a streaming UI, but it can't be interrupted or canceled, and the user stares at a frozen screen."
"I can't read the cost. I first notice 'it's expensive' when the month-end bill comes."
Both companies considering outsourcing AI-feature development and developers implementing it themselves stop here. "Just call an LLM" finishes in a day. The work of raising it to production quality is the body itself.
I am a core engineer of a B2B SaaS that won the Minister of Economy, Trade and Industry Award, and I have built an in-house AI platform for a major domestic broadcaster (5 AI services, an auth hub, speech synthesis, OCR × speech-recognition typo detection, generative-AI review assistance). What worked there was not a flashy model but the unglamorous production techniques of structured output, safe tool design, cancelable streaming, prompt caching, and a verification gate.
This article shows them in TypeScript real code, faithful to Anthropic official (docs.anthropic.com / platform.claude.com) and Vercel AI SDK v6 official (ai-sdk.dev / vercel.com/docs). I state the referenced official URL at the end of each section.
The map of this article
- Model selection (Opus 4.8 / Sonnet 4.6 / Haiku 4.5 / Fable 5)
- Basics: AI SDK v6's
generateText/streamTextand model specification with the AI Gateway- Structured output: type-safe with
generateObject/streamObject+ Zod- Tool use:
tool()definition and turning it into a multi-step agent- Streaming UX: Route Handler →
useChat, cancel/interrupt- Cost / performance optimization: prompt caching, routing, measurement
- Reliability / observability: retries, guardrails, hallucination suppression
- Security: API-key concealment, prompt injection, PII
First, Grasp This: "One Person × Generative AI" Is Fast and Cheap Because It Doesn't Skip Production Techniques
Make generative AI (Claude Code) an accelerator and even one person can implement enterprise AI features fast and cheap. But that's not because "you can write code fast." It's because you automate the verification gates and uphold the production-quality patterns.
- Always bind the output with a schema and validate at the boundary (don't flow broken JSON downstream)
- Minimize permissions on tools and isolate side effects (make the blast radius of a runaway small)
- Make streaming cancelable (stop the user's departure)
- Pre-measure the cost and cut it with caching (don't be shocked by the bill)
- Pass the output through a human or another-model verification gate (don't put hallucinations into production)
Below, let me land these 5 principles into concrete code.
Model Selection: Intelligence vs Cost vs Latency
The first design judgment is model selection. Based on Anthropic's official model list, let me table the use distinctions as of June 2026. "All Opus" and "all Haiku" are both wrong. The optimal point differs per scene.
| Model | Model ID | Context | Assumed use | Decision axis |
|---|---|---|---|---|
| Claude Opus 4.8 | claude-opus-4-8 | 1M | Long-running autonomous agents, complex code generation, high-difficulty knowledge work | Highest intelligence. The core of hard tasks |
| Claude Sonnet 4.6 | claude-sonnet-4-6 | 1M | The default for most apps. RAG, summarization, dialogue, tool use | The best balance of speed and intelligence |
| Claude Haiku 4.5 | claude-haiku-4-5 | 200K | Classification, extraction, routing, preprocessing, large batches | Fastest and cheapest. Simple tasks |
| Claude Fable 5 | claude-fable-5 | 1M | The hardest reasoning, ultra-long-term agents | The highest-performance. The fee exceeds Opus |
The way to think about allocation in practice:
- The default is Sonnet 4.6. When in doubt, here. Many B2B features (RAG answers, in-house search, shaping) get sufficient quality with Sonnet.
- Only the hard steps to Opus 4.8. Bulk implementation from a spec, deep refactoring, a long-running autonomous loop. It costs, but invest where the cost of fixing an error is higher.
- The front-stage / large processing to Haiku 4.5. Simple tasks like "which category is this inquiry" or "extract the date from this document" go to the cheap, fast Haiku. Haiku also works for an agent's sub-agents.
- Fable 5 only when you explicitly choose it. The tokenizer changed, so the same content consumes about 30% more tokens, and the fee exceeds Opus. The
thinkingparameter's spec also differs (always on), so think of it as dedicated to "the hardest problems." It's not an object to choose with "the latest, just in case."
Tip: intelligence tuning can be done not only by swapping the model but with the
effortparameter (low/medium/high/xhigh/max). With Opus 4.8,highis the default, andxhighis recommended for coding/agents. Via the AI SDK, you can pass it as a model-specific option inproviderOptions.anthropic.
Source: Models overview / Effort
Basics: Call a Model with AI SDK v6 and the AI Gateway
The foundation of implementation is the Vercel AI SDK v6 (the ai package). How you specify the model governs the production design, so first get this right.
Recommended: The AI Gateway's "provider/model" String
Go through the Vercel AI Gateway (GA'd in August 2025) and you can specify a model with just a "provider/model" string like "anthropic/claude-opus-4.8". No import of a provider-specific package (@ai-sdk/anthropic) is needed.
Reasons to use the AI Gateway (the official advantages):
- Multiple providers with 1 key. Access Anthropic / Bedrock / Vertex, etc. with a single
AI_GATEWAY_API_KEY. - Automatic fallback. Even if one provider goes down, it auto-retries on another path.
- No markup on tokens (no markup). The same unit price as direct use.
- Spend visibility. Monitor cost across providers.
// app/api/summary/route.ts
import { generateText } from "ai";
export async function POST(req: Request) {
const { article } = (await req.json()) as { article: string };
const { text, usage, finishReason } = await generateText({
// "provider/model" 文字列。@ai-sdk/anthropic の import は不要
model: "anthropic/claude-sonnet-4.6",
system:
"あなたは技術記事の要約者です。事実のみを、箇条書き3点で日本語要約してください。",
prompt: `次の記事を要約してください:\n\n${article}`,
});
return Response.json({ text, usage, finishReason });
}
Authentication is just reading the environment variable AI_GATEWAY_API_KEY (an OIDC token also works on a Vercel deploy). Not writing the API key in code is the big premise (details in the security chapter later).
Supplement: the AI Gateway's model slug is officially notated with dot separators (
anthropic/claude-opus-4.8). On the other hand, when hitting the Anthropic API directly (with@ai-sdk/anthropicor the Anthropic SDK), the model ID is the hyphen-separated primary notation (claude-opus-4-8/claude-sonnet-4-6/claude-haiku-4-5). Note the notation changes by path.
Make Streaming the Default
For requests with long inputs, long outputs, or high max_tokens, make streaming the default to avoid HTTP timeouts. The server side is streamText:
import { streamText } from "ai";
const result = streamText({
model: "anthropic/claude-sonnet-4.6",
system: "簡潔で正確な日本語で答えてください。",
prompt: "Reactのレンダリング最適化を3行で説明して。",
});
for await (const textPart of result.textStream) {
process.stdout.write(textPart);
}
When You Want to Explicitly Use a Specific Provider
Only when you have explicit requirements like "I want to hit Anthropic directly" or "I want fine prompt-caching specification" do you use the specific package @ai-sdk/anthropic.
import { createAnthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";
const anthropic = createAnthropic({
apiKey: process.env.ANTHROPIC_API_KEY, // 直叩きの場合のみ
});
const { text } = await generateText({
model: anthropic("claude-sonnet-4-6"), // 一次表記(ハイフン区切り)
prompt: "...",
});
Source: AI Gateway / AI Gateway: Text Generation / AI SDK Core: generateText/streamText
Structured Output: Don't Flow Broken JSON to Production
What works first in a production AI feature is structured output. Just asking an LLM to "return it as JSON" mixes in explanatory text before/after, or drops fields. AI SDK v6 binds the output with a schema (Zod), and the SDK does the parsing and validation too.
generateObject: Extraction, Classification, Shaping
An example of extracting structured data from an inquiry email. Pass Zod to schema and the result's object is returned typed.
import { generateObject } from "ai";
import { z } from "zod";
const LeadSchema = z.object({
name: z.string().describe("問い合わせ者の氏名"),
email: z.string().describe("連絡先メールアドレス"),
plan: z.enum(["lite", "standard", "enterprise"]).describe("希望プラン"),
interests: z.array(z.string()).describe("関心のある機能"),
demoRequested: z.boolean().describe("デモ希望の有無"),
});
const { object, usage } = await generateObject({
model: "anthropic/claude-haiku-4.5", // 抽出は安価なHaikuで十分
schema: LeadSchema,
prompt:
"次の問い合わせから情報を抽出: 「田中太郎です(tanaka@example.com)。" +
"Enterpriseを検討中で、API連携とSSOに関心あり。デモ希望です」",
});
// object は LeadSchema 型として保証される
console.log(object.plan); // "enterprise"
Some of Zod's constraints (
min/max/minLength, etc.) aren't supported in Claude's schema, but the AI SDK validates them on the client side. Validation at the boundary is the core of the design of treating the AI's output too as "external input." Don't trust; always bind with a schema.
Classification Tasks: Fix the Choices with enum
Classification like "which department to route this inquiry to" is most robust by fixing the choices with enum. Send it to Haiku to curb cost.
const { object } = await generateObject({
model: "anthropic/claude-haiku-4.5",
schema: z.object({
category: z.enum(["技術サポート", "営業", "請求", "その他"]),
urgency: z.enum(["低", "中", "高"]),
}),
prompt: `次の問い合わせを分類: ${inquiry}`,
});
streamObject: Partial Display While Generating
When generating a long report or list, with streamObject you can receive partial objects sequentially. You can produce a "filling-in" experience in the UI.
import { streamObject } from "ai";
import { z } from "zod";
const { partialObjectStream } = streamObject({
model: "anthropic/claude-sonnet-4.6",
schema: z.object({
title: z.string(),
sections: z.array(z.object({ heading: z.string(), body: z.string() })),
}),
prompt: "新製品の発表ブログ記事の構成を生成して。",
});
for await (const partial of partialObjectStream) {
// partial は生成途中の部分オブジェクト(型は Deep Partial)
render(partial);
}
Tool Use: Turning It into an Agent and the "Don't Let It Run Away" Design
Giving an LLM access to external APIs, DBs, and computation is tool use (tool calling). In AI SDK v6, you define it with the tool() helper, bind input with Zod, and write the actual processing in execute.
Tool Definition
import { tool } from "ai";
import { z } from "zod";
const getOrderStatus = tool({
description:
"注文IDから配送ステータスを取得する。ユーザーが注文状況を尋ねたときに使う。",
inputSchema: z.object({
orderId: z.string().describe("注文ID(例: ORD-12345)"),
}),
execute: async ({ orderId }) => {
// 権限最小化: 読み取り専用の照会APIだけを呼ぶ
const status = await db.orders.findStatus(orderId);
return { orderId, status }; // 戻り値はモデルのコンテキストに入る
},
});
The knack of a tool's description is to make "when to call it" explicit. Not just "what it does" but writing "when the user asks for ○○" raises recent Claude's call-judgment accuracy.
Multi-Step = Turning It into an Agent (stopWhen)
What enables the iteration of "look at a tool's result and call the next tool" is stopWhen. Always cap the max step count with stepCountIs(n). This is the most important guard preventing a runaway (infinite loop, token melt).
import { generateText, stepCountIs, tool } from "ai";
const { text, steps } = await generateText({
model: "anthropic/claude-sonnet-4.6",
tools: { getOrderStatus, searchKnowledgeBase },
// 上限を必ず設ける。無限ループとコスト爆発の最初の防波堤
stopWhen: stepCountIs(5),
system:
"あなたはカスタマーサポートです。ツールで事実を確認してから答えてください。",
prompt: "注文ORD-12345はいつ届きますか?",
});
Safe Tool Design (Permission Minimization, Side-Effect Isolation)
A tool's execution is requested by the LLM, and your code actually runs it. That's exactly why a tool's shape decides safety.
- Separate read and write.
searchKnowledgeBase(read) may auto-execute, but interpose a human approval forsendEmail/issueRefund(irreversible side effects). - Gate irreversible operations. A robust design is, when the step boundary is reached in the AI SDK's
stopWhen, to inspectsteps, confirm "no side-effect tool was called," and continue after showing an approval UI. - Re-validate the input inside
execute. Zod can bind the type, but authorization like "is thisorderIdthe user's" you check separately insideexecute. Don't trust the value the LLM produced and pull the DB.
const issueRefund = tool({
description: "返金を実行する。金額と注文IDが必要。",
inputSchema: z.object({
orderId: z.string(),
amount: z.number(),
}),
execute: async ({ orderId, amount }, { abortSignal }) => {
// 認可: このツールを起動したユーザーが当該注文の持ち主か再確認
await assertOwnership(currentUserId, orderId);
// 副作用は隔離されたサービス層経由でのみ実行
return await refundService.execute(orderId, amount, { abortSignal });
},
});
When you want to switch the model or
toolChoiceper step, useprepareStep(e.g. force tool start withtoolChoice: 'required'only on the first move).
Source: AI SDK Core: Tools and Tool Calling / Tool use overview
Streaming UX: Cancelable, Accessible Sequential Display
What erases the "being made to wait" feeling is the streaming UX. Combine the server's Route Handler and the client's useChat.
Server: Route Handler
Convert the UI messages from the client to model messages with convertToModelMessages, and return the streamText result with toUIMessageStreamResponse().
// app/api/chat/route.ts
import { convertToModelMessages, streamText, type UIMessage } from "ai";
export async function POST(req: Request) {
const { messages } = (await req.json()) as { messages: UIMessage[] };
const result = streamText({
model: "anthropic/claude-sonnet-4.6",
system: "あなたは親切なアシスタントです。簡潔に答えてください。",
messages: convertToModelMessages(messages),
});
return result.toUIMessageStreamResponse();
}
Client: useChat (Including Interrupt and Cancel)
@ai-sdk/react's useChat manages the messages, input, state (status), and interrupt (stop). Always tie status and stop to the UI — this is the dividing point of production quality.
"use client";
import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";
export function Chat() {
const [input, setInput] = useState("");
const { messages, sendMessage, status, stop } = useChat({
transport: new DefaultChatTransport({ api: "/api/chat" }),
});
const isBusy = status === "submitted" || status === "streaming";
return (
<div>
{/* aria-live で、追記される応答をスクリーンリーダーに伝える */}
<div aria-live="polite">
{messages.map((m) => (
<article key={m.id}>
<strong>{m.role === "user" ? "あなた" : "AI"}</strong>
{m.parts.map((part, i) =>
part.type === "text" ? <span key={i}>{part.text}</span> : null,
)}
</article>
))}
</div>
<form
onSubmit={(e) => {
e.preventDefault();
if (!input.trim() || isBusy) return;
sendMessage({ text: input });
setInput("");
}}
>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
aria-label="メッセージを入力"
/>
{isBusy ? (
// 生成中はキャンセルボタンに切り替える
<button type="button" onClick={() => stop()}>
停止
</button>
) : (
<button type="submit">送信</button>
)}
</form>
</div>
);
}
The accessibility points:
- Attach
aria-live="polite"to the response region and convey the sequential appending to the screen reader. - During generation (
streaming), switch the send button to a stop button and make it interruptible withstop(). - Prepare a
status === "error"branch and surface a re-send path.
Source: AI SDK UI: Chatbot / Streaming
Cost / Performance Optimization: Prompt Caching and Routing
Let me solve "I can't read the cost." What works most is prompt caching and model routing/fallback.
Prompt Caching: Reusing the Prefix
Prompt caching is an exact prefix match. Because it renders in the order tools → system → messages, place stable content (a fixed system prompt, knowledge) at the front, and variable content (the user's question) at the back. Insert Date.now() or a request ID at the front and everything after it is cache-invalidated.
To use Anthropic's caching in the AI SDK, attach providerOptions.anthropic.cacheControl to the target content.
import { generateText } from "ai";
const { text, providerMetadata } = await generateText({
model: "anthropic/claude-sonnet-4.6",
messages: [
{
role: "system",
content: LARGE_KNOWLEDGE_BASE, // 数千トークンの固定コンテキスト
providerOptions: {
anthropic: { cacheControl: { type: "ephemeral" } }, // キャッシュ境界
},
},
{ role: "user", content: userQuestion }, // 変動部分は後ろ。マーカー無し
],
});
// キャッシュの効きを必ず計測する
console.log(providerMetadata?.anthropic);
// cacheCreationInputTokens(書き込み)/ cacheReadInputTokens(読み出し)
Via the AI Gateway, you can also have it auto-apply a per-provider caching strategy with providerOptions.gateway.caching: 'auto' (handy for providers like Anthropic that need an explicit cache marker).
const result = streamText({
model: "anthropic/claude-sonnet-4.6",
messages,
providerOptions: { gateway: { caching: "auto" } },
});
The economics guide: a cache read is about 0.1× the base input unit price, and a write about 1.25× (5-minute TTL). If you use the same prefix 2 or more times, it's almost certainly a win. If cacheReadInputTokens is always 0, suspect a "silent invalidation" like a datetime.now() mixing in or a non-deterministic JSON order.
Routing and Fallback (AI Gateway)
Control both availability and cost with order / only / sort of providerOptions.gateway.
const result = await generateText({
model: "anthropic/claude-sonnet-4.6",
prompt,
providerOptions: {
gateway: {
order: ["bedrock", "anthropic"], // Bedrock優先、ダメならAnthropic
sort: "cost", // コスト最小のプロバイダから試す('ttft'/'tps'も可)
},
},
});
Measuring Token Usage and the Way to Think About Batches
- Record usage every time. Emit
usage/totalUsageto logs and aggregate per route, per model. This is the only way to "not be shocked by the bill." - Curb unneeded regeneration. Cache the result for the same input on the app side (within the range determinism allows). Not calling the LLM is the biggest cost reduction.
- Latency-independent large processing in batch/parallel. Independent tasks like classification and extraction, throw them in parallel to the cheap Haiku. Anthropic's Message Batches API is asynchronous, at 50% of the standard price.
Source: Prompt caching / AI Gateway: Provider Options
Reliability / Observability: Design on the Premise of Failure
In production, assemble on the premise that failure happens. An LLM is probabilistic, external APIs go down, and you hit rate limits too.
Retries, Timeouts, Rate Limits
- The Anthropic SDK / AI SDK auto-retries 429/5xx with exponential backoff (with a default retry count). Leave it to that while separately setting an app-specific cap.
- Pass
abortSignalto a long-running tool to propagate the upstream timeout/cancel (see theissueRefundabove). - A rate limit (429) has a
retry-afterheader, which the SDK reads and waits on. On a burst, the realistic solution is to fall back to Haiku or interpose queuing.
Observability: Usage, Latency, Failure Rate
At minimum, emit the following to structured logs. Visualize not "is it working" but "how much, how fast, and how often it's failing."
const started = performance.now();
const { text, usage, finishReason } = await generateText({
model: "anthropic/claude-sonnet-4.6",
prompt,
});
logger.info("llm.call", {
route: "summary",
model: "anthropic/claude-sonnet-4.6",
inputTokens: usage.inputTokens,
outputTokens: usage.outputTokens,
finishReason, // "stop" / "length" / "tool-calls" など
latencyMs: Math.round(performance.now() - started),
});
The AI Gateway itself also provides cross-provider observability of usage, latency, and spend.
Guardrails and Hallucination Suppression (Verification Gate)
Don't flow an LLM's output to production as-is. Always interpose a verification gate.
- Bind with a schema (the structured output above). If the shape is broken, don't let it go downstream.
- Have it fetch facts with tools. State in the system prompt "don't answer from knowledge, first call
search" to suppress unfounded assertions. - Verify important output in a separate step. Separate generation and verification (generation is exhaustive, verification selects in a separate pass). Institutionalize a step where "a human or another model confirms," like code review or a review.
The review-assistance / typo detection I built for the broadcaster was exactly this form of "generation is AI, the final confirmation is the verification gate." Place AI as an accelerator and guarantee quality assurance with human (and another-model) verification. This is the crux of achieving both "fast, cheap" and "safe."
Source: Errors / rate limits / Tool use overview
Security: API Keys, Prompt Injection, PII
Finally, the security requirements you must always confirm even when considering outsourcing.
Concealing the API Key
- Don't write the key in code. Put
AI_GATEWAY_API_KEY/ANTHROPIC_API_KEYin an environment variable or a secret manager. Never emit it to the repository, the client bundle, or logs. - Don't call the LLM directly from the browser. Always go through a server Route Handler (don't expose the key to the client). OIDC-token authentication is also an option on a Vercel deploy.
Prompt-Injection Countermeasures
The principle is to treat externally-sourced text (user input, fetched web pages, tool results) as data, not instructions.
- Permission separation: don't execute a tool-mediated operation on the LLM's judgment alone; protect irreversible side effects with an approval gate + server-side authorization (the
issueRefundabove). - Make the trust boundary explicit: don't conflate the system prompt (the operator's instructions) and user/fetched content (untrusted data). Even if the latter says "ignore the prior instructions," if the side-effect tool is protected by server-side authorization, the blast radius is limited.
- Validate the output: make it defense in depth where, even if the injection succeeds, the schema, authorization, and verification gate stop it.
Handling PII
- Don't leave the inquiry form's PII in logs more than necessary. The log example above too is just token counts and meta info, not emitting the body or PII.
- Judge the retention of personal data against regulation (GDPR / Act on the Protection of Personal Information). Don't store secrets in a memory feature, etc.
Source: Tool use overview / AI Gateway: Authentication
FAQ
Q1. Is the AI Gateway mandatory? Won't the specific package (@ai-sdk/anthropic) do?
It's not mandatory. But v6's default recommendation is the AI-Gateway-mediated "provider/model" string specification. You get multiple providers with 1 key, automatic fallback, and spend visibility, with no markup on the token unit price. The practical distinction is to use the specific package only when you have an explicit requirement like fine prompt-caching specification.
Q2. How do I distinguish generateObject vs streamObject vs generateText + Output?
If "a structured result is the main goal," like extraction, classification, or shaping, generateObject / streamObject is straightforward. On the other hand, if you want structured output within text generation (combined with tool use, etc.), there's the hand of combining Output.object() with generateText / streamText. Both can bind type-safely with a Zod schema.
Q3. I'm anxious about an agent running away and melting tokens.
Always cap the max step count with stopWhen: stepCountIs(n). In addition, stop irreversible-side-effect tools with an approval gate, and log usage to detect abnormal consumption. From Opus 4.7 on, you can also use the API-native Task Budgets (conveying the remaining tokens to the model).
Q4. Should I just make everything Opus 4.8?
No. The cost and latency don't add up. The optimal allocation is the default Sonnet 4.6, simple tasks (classification, extraction, preprocessing) to Haiku 4.5, only the hard steps to Opus 4.8. You can also adjust intelligence/cost with the effort parameter.
Q5. The output is occasionally factually wrong (hallucination). Model selection alone doesn't solve it. Stack the verification gates: ① bind with a schema, ② have it fetch facts with tools ("don't answer from knowledge, call search"), ③ verify important output in a separate step. The division of generation by AI, final confirmation by a human or another model, works.
Summary: Production Quality Is the Accumulation of "Unglamorous Techniques"
The wall of raising "just call an LLM" to production quality is crossed by the accumulation of unglamorous but effective techniques — structured output, safe tool design, cancelable streaming, prompt caching, and a verification gate. Claude API × AI SDK v6 is a combination that lets you implement these straightforwardly along the official patterns.
As a core engineer of a B2B SaaS that won the Minister of Economy, Trade and Industry Award, and as the builder of an in-house AI platform for a major domestic broadcaster (5 AI services, an auth hub, speech synthesis, OCR × speech-recognition typo detection, generative-AI review assistance), I have operated this "generation is AI, quality is the verification gate" design in the field. Because of one person × generative AI (Claude Code), I can deliver AI features fast, cheap, and safe without skipping production techniques.
If you're considering AI-feature development, production-izing from a PoC, or integrating AI into an existing system, feel free to consult me.
References (Anthropic / Vercel AI SDK Official)
- Anthropic — Models overview
- Anthropic — Tool use overview
- Anthropic — Prompt caching
- Anthropic — Streaming
- Anthropic — Effort
- Vercel — AI Gateway
- Vercel — AI Gateway: Provider Options
- AI SDK Core — Generating Text
- AI SDK Core — Generating Structured Data
- AI SDK Core — Tools and Tool Calling
- AI SDK UI — Chatbot (useChat)
- AI SDK Providers — Anthropic