# Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling

> A production design that turns your own Qwen3-8B-AWQ into a tool-using agent. With world-class code, it explains: enabling vLLM's Hermes-format tool calling, a type-safe tool contract (Zod → JSON Schema), a safe loop that validates arguments before executing, an iteration cap / idempotent side effects / authorization guards, and the official caution against ReAct in thinking mode.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Qwen, エージェント, Tool Use, vLLM, TypeScript, Zod, 生成AI
- URL: https://tomodahinata.com/en/blog/qwen3-agent-tool-use-function-calling-qwen-agent-production
- Category: Quantized LLMs & self-hosting
- Pillar guide: https://tomodahinata.com/en/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide

## Key points

- Qwen3 supports Hermes-format tool calling. Start vLLM with `--enable-auto-tool-choice --tool-call-parser hermes` and function calls return directly with the OpenAI-compatible tools parameter.
- Tools are a 'typed contract': with a Zod schema as the source of truth, auto-generate the JSON Schema into the tools definition, and validate the arguments the model returns with Zod before execution. Never execute arguments that don't pass validation.
- A safe loop: an iteration cap, tools as an allowlist, side effects protected with an idempotency key + authorization. Don't exec the LLM output as-is — separate judgment (LLM) from execution (deterministic code).
- The official caution: for thinking models, stopword-dependent templates like ReAct aren't recommended (a stopword can appear in the thinking part and break the tool call). Use the Hermes format.
- Write it all yourself, or leave it to Qwen-Agent. Qwen-Agent absorbs the boilerplate of templates/parsing. If you want control, a plain OpenAI-compatible loop — this article shows both type-safely.

---

## The goal of this article

Lifting an LLM from "just answering" to "**using tools to do work**" is agentification. [Qwen3-8B-AWQ](/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide) supports function calling, so **while staying on your own GPU**, you can build a tool-executing agent without letting data leave.

But agents are **the most accident-prone** area. Execute the arguments the LLM returns as-is, and it can become SQL or a shell. This piece shows how to separate **judgment (LLM) from execution (deterministic code)** and build, type-safely, **a safe loop that validates arguments before executing.**

> **Reliability disclosure**: the tool-calling spec, vLLM flags, and cautions for thinking models are based on [Qwen official (Function Calling)](https://qwen.readthedocs.io/en/latest/framework/function_call.html), [vLLM official (Tool Calling)](https://docs.vllm.ai/en/latest/features/tool_calling.html), and [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent). The design principle (separating judgment and execution) is aligned with [the design of tool use / function calling](/blog/ai-agent-tool-use-function-calling-production-design). Production GPU operation is an area I walked through on the [video AI localization platform](/case-studies/ai-video-localization-lipsync).

---

## The 30-second conclusion

- **Enable**: `vllm serve Qwen/Qwen3-8B-AWQ --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3`. With this, the OpenAI-compatible `tools` work as-is.
- **Tools are a typed contract**: with Zod as the source of truth, auto-generate the JSON Schema of the `tools` definition. **Validate the returned arguments with Zod before execution.**
- **Always have safety devices**: an iteration cap, a tool allowlist, an idempotency key for side effects, and authorization. **Don't execute the LLM output as-is.**
- **The official caution**: for thinking models, **don't use stopword-dependent templates like ReAct.** Use the Hermes format.
- **Build or borrow**: if you want to leave the boilerplate, **Qwen-Agent**; if you want control, **a plain OpenAI-compatible loop** (implemented here).

---

## Serving: enable Hermes-format tool calling

Qwen3 recommends **Hermes-style** tool use. In vLLM, enable it with flags.

```bash
vllm serve Qwen/Qwen3-8B-AWQ \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --reasoning-parser qwen3 \
  --max-model-len 32768 --port 8000
```

> 🔧 **The official caution (thinking models)**: Qwen official states that, for thinking models, **stopword-dependent tool templates like ReAct aren't recommended.** The reason is "**a stopword can be output in the thinking part (`<think>`) and break the tool call.**" The safe measure is to **use the Hermes format (`--tool-call-parser hermes`).**

---

## Tools are a "typed contract": make Zod the source of truth

An agent's safety starts from **the tool definition being a type.** Write one Zod schema and derive from it both **the JSON Schema of the `tools` passed to the model** and **the parser that validates the arguments the model returns** (DRY, [the same philosophy as structured output](/blog/qwen3-structured-output-json-vllm-guided-decoding-zod)).

```ts
// lib/tools.ts — ツール＝「名前・引数スキーマ・決定的な実行関数」の契約
import { z } from "zod";

export interface Tool<A extends z.ZodType> {
  readonly name: string;
  readonly description: string;
  readonly args: A;                              // 引数スキーマ（真実源）
  readonly execute: (a: z.infer<A>) => Promise<unknown>; // 実行は決定的コード
}

/** 在庫照会（読み取り専用・副作用なし） */
export const getStock: Tool<z.ZodObject<{ sku: z.ZodString }>> = {
  name: "get_stock",
  description: "SKUの在庫数を返す",
  args: z.object({ sku: z.string().regex(/^[A-Z0-9-]{4,32}$/) }), // 入力境界を型で締める
  execute: async ({ sku }) => ({ sku, qty: await stockRepo.count(sku) }),
};

/** OpenAI互換 tools 定義へ変換（Zod → JSON Schema を自動生成） */
export function toOpenAITool<A extends z.ZodType>(t: Tool<A>) {
  return {
    type: "function" as const,
    function: { name: t.name, description: t.description, parameters: z.toJSONSchema(t.args) },
  };
}

declare const stockRepo: { count(sku: string): Promise<number> };
```

The point is that `execute` is **an ordinary function.** The LLM only **judges** "which tool to call with which arguments," and **execution is held by our deterministic code.**

---

## A safe tool-execution loop (world-class practice)

The essence of the agent loop is "**the model requests a tool → validate and execute → return the result → repeat until completion.**" Weave **a cap, validation, and idempotency** into it.

```ts
// lib/agent-loop.ts — 反復上限つき・引数検証つき・allowlistつきの安全なループ
import OpenAI from "openai";
import type { Tool } from "./tools";
import { toOpenAITool } from "./tools";
import { z } from "zod";

const client = new OpenAI({ baseURL: process.env.QWEN_BASE_URL, apiKey: "internal", timeout: 60_000 });

export async function runAgent(
  userMessage: string,
  registry: ReadonlyMap<string, Tool<z.ZodType>>, // allowlist：ここに無いツールは呼べない
  maxSteps = 6,                                    // 暴走・無限ループの上限（必須）
): Promise<string> {
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: "ツールは必要な時だけ使う。引数は厳密に。" },
    { role: "user", content: userMessage },
  ];
  const tools = [...registry.values()].map(toOpenAITool);

  for (let step = 0; step < maxSteps; step++) {
    const res = await client.chat.completions.create({
      model: "Qwen/Qwen3-8B-AWQ", messages, tools,
      temperature: 0.7, top_p: 0.8,
      extra_body: { top_k: 20, chat_template_kwargs: { enable_thinking: false } }, // ツール選択は非思考で安定
    });
    const msg = res.choices[0]?.message;
    if (!msg) throw new Error("empty response");
    messages.push(msg);

    const calls = msg.tool_calls ?? [];
    if (calls.length === 0) return msg.content ?? ""; // ツール不要＝最終回答

    // 要求された各ツールを「検証してから」実行（並列でも順次でも、結果を必ず戻す）
    for (const call of calls) {
      const tool = registry.get(call.function.name);
      // allowlist 外のツール要求は実行せず、その旨をモデルへ返す（落とさない）
      const result = tool
        ? await executeChecked(tool, call.function.arguments)
        : { error: `tool not allowed: ${call.function.name}` };
      messages.push({ role: "tool", tool_call_id: call.id, content: JSON.stringify(result) });
    }
  }
  throw new AgentStepLimitError(`exceeded ${maxSteps} steps`); // 上限超過は明示的に失敗させる
}

/** 引数は“外部入力”。Zodで検証してからのみ実行。検証NGは実行せずモデルへ差し戻す。 */
async function executeChecked(tool: Tool<z.ZodType>, rawArgs: string): Promise<unknown> {
  const parsed = tool.args.safeParse(safeJson(rawArgs));
  if (!parsed.success) return { error: "invalid arguments", issues: parsed.error.issues.length };
  return tool.execute(parsed.data); // ここで初めて副作用が起きる
}

const safeJson = (s: string): unknown => { try { return JSON.parse(s); } catch { return null; } };
export class AgentStepLimitError extends Error {}
```

These 60 lines pack in a production agent's safety devices.

- **Iteration cap (`maxSteps`)**: structurally stops infinite loops and runaways (and prevents cost explosion).
- **Allowlist (`registry`)**: an unregistered tool **can't be called.** Even if the model requests a phantom tool, it isn't executed.
- **Argument validation (`executeChecked`)**: `function.arguments` is **external input of a JSON string.** Execute only when it passes `safeParse`. NG is **sent back** for self-correction.
- **Separation of judgment and execution**: the LLM only requests; **side effects happen inside deterministic code.**

---

## Protect side-effecting tools with "idempotency + authorization"

Read tools (stock query) are easygoing, but side-effecting tools like **writes, sends, payments** are in a class of their own. The LLM can **request the same call twice** on retry or in parallel.

```ts
// 副作用ツールは「冪等キー＋認可」を必須に（二重実行・権限外実行を防ぐ）
export const refund: Tool<z.ZodObject<{ orderId: z.ZodString; idempotencyKey: z.ZodString }>> = {
  name: "refund",
  description: "注文を返金する（要認可・冪等）",
  args: z.object({ orderId: z.string().uuid(), idempotencyKey: z.string().min(8) }),
  execute: async ({ orderId, idempotencyKey }) => {
    await authz.require("refund", orderId);          // 認可：誰の権限で実行するか
    return payments.refund(orderId, { idempotencyKey }); // 冪等：二度呼ばれても一度だけ効く
  },
};
declare const authz: { require(action: string, resource: string): Promise<void> };
declare const payments: { refund(id: string, o: { idempotencyKey: string }): Promise<unknown> };
```

The practice of **idempotency** is the same as [payment idempotency design](/blog/stripe-payments-production-guide-webhooks-idempotency-subscriptions). When an agent is involved, "**double execution**" is a realistic risk, so design side-effecting tools with **a mandatory idempotency key.** Enforce authorization **inside the tool execution, not in a UI if-statement.**

---

## Self-built loop vs. Qwen-Agent

| | Self-built OpenAI-compatible loop (this article) | Qwen-Agent |
| --- | --- | --- |
| Control | **Fully held** (cap, validation, logs) | Delegated to the framework |
| Boilerplate | Write yourself | **Absorbs templates/parsing** |
| Learning cost | OK if you know the OpenAI SDK | Learn the library's conventions |
| Suits | Production build-out, audit requirements | Quick prototyping, standard tool integration |

[Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) "templatizes Qwen3's function calling on the OpenAI-compatible API, with `llm.chat()` auto-handling tool conversion and parsing." The rule of thumb: **Qwen-Agent for prototyping and standard cases, a self-built loop for production where you want to hold audit and guards.** Either way, the principles of **argument validation, cap, idempotency, and authorization** don't change.

---

## Pitfalls & best practices

- 🔴 **Don't execute the LLM output as-is.** `function.arguments` is external input. Execute **only when it passes Zod validation.**
- 🔴 **Always place an iteration cap.** A loop without `maxSteps` is a breeding ground for cost explosion and infinite loops. Make exceeding it fail explicitly.
- 🔴 **Side-effecting tools are idempotent + authorized.** Structurally prevent double execution and out-of-privilege execution. Dangerous tools (arbitrary SQL/shell) **just aren't registered.**
- 🟠 **Tool selection is stable even non-thinking.** Tool calling itself is fast enough in non-thinking mode. Use thinking only when a complex plan is needed.
- 🟠 **Don't use ReAct/stopword templates for thinking models** (official). Use the Hermes format (`--tool-call-parser hermes`).
- 🟢 **Least privilege with the allowlist.** Keep the tools the agent can touch to the minimum necessary. Send back phantom tool requests.
- 🟢 **Make each step observable.** [Log with metadata](/blog/opentelemetry-observability-production-tracing-metrics-logs) which tool was called with which arguments and whether validation passed/failed (hide PII in arguments).

---

## Frequently asked questions (FAQ)

**Q. Can 8B do tool selection accurately?**
A. Simple-to-moderate tool selection is practical. The more you **narrow the number of tools** and clarify the `description` and argument schema, the more stable. If complex multi-step planning is needed, consider thinking mode or routing to a higher model.

**Q. Can I use parallel tool calls?**
A. The model may return multiple tool calls in one response. If you design the loop to **validate and execute each call and return all results**, it doesn't break down sequentially or in parallel (this article's implementation supports it).

**Q. What if `arguments` returns as broken JSON?**
A. `executeChecked`'s `safeJson` + `safeParse` **send it back without executing.** The model sees the sent-back message and self-corrects. The point is to **not cause side effects with broken arguments.**

**Q. Qwen-Agent or a self-built loop, which?**
A. **Qwen-Agent for prototyping and standard cases, a self-built loop for production where you want to hold audit, guards, and logs.** The principles (validation, cap, idempotency, authorization) are mandatory either way.

**Q. I'm worried about agent cost.**
A. Suppress step count with the **iteration cap** and output tokens with **non-thinking mode.** Caching tool results and [model routing](/blog/llama-inference-cost-optimization-self-host-vs-api#コスト削減レバー効果の大きい順) also help. Make each step's tokens observable to find where to cut.

---

## Conclusion

Agentifying Qwen3-8B-AWQ withstands production if you thoroughly enforce "**judgment is the LLM, execution is deterministic code**" with **types and safety devices.**

1. **Enable with the Hermes format** (`--enable-auto-tool-choice --tool-call-parser hermes`). Don't use ReAct with a thinking model.
2. **Tools are a typed contract** — generate `tools` with Zod as the source of truth, and validate the returned arguments before execution.
3. **A safe loop** — iteration cap, allowlist, argument validation, send-back.
4. **Side effects are idempotent + authorized** — structurally prevent double execution and out-of-privilege execution.
5. **Qwen-Agent for prototyping, a self-built loop for production** — the principles are unchanged.

> I build agentification of your own LLM at production quality, including tool design, a safe loop, idempotent side effects, authorization, and observability. Take a look at my AI platform [track record](/case-studies/ai-video-localization-lipsync) and consult me. With **one person × generative AI**, fast, cheap, and safe.

### Sources / official resources

- [Qwen official (Function Calling)](https://qwen.readthedocs.io/en/latest/framework/function_call.html) — Hermes format, cautions for thinking models
- [vLLM official (Tool Calling)](https://docs.vllm.ai/en/latest/features/tool_calling.html) — `--enable-auto-tool-choice` / parser
- [Qwen-Agent (GitHub)](https://github.com/QwenLM/Qwen-Agent) — templatizing tool calling
- [Qwen3-8B-AWQ model card](https://huggingface.co/Qwen/Qwen3-8B-AWQ) — thinking mode, sampling

* The tool-calling spec and vLLM flags get updated. Always confirm with primary sources and your version before implementation.
