Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling

The goal of this article

Lifting an LLM from "just answering" to "using tools to do work" is agentification. Qwen3-8B-AWQ supports function calling, so while staying on your own GPU, you can build a tool-executing agent without letting data leave.

But agents are the most accident-prone area. Execute the arguments the LLM returns as-is, and it can become SQL or a shell. This piece shows how to separate judgment (LLM) from execution (deterministic code) and build, type-safely, a safe loop that validates arguments before executing.

Reliability disclosure: the tool-calling spec, vLLM flags, and cautions for thinking models are based on Qwen official (Function Calling), vLLM official (Tool Calling), and Qwen-Agent. The design principle (separating judgment and execution) is aligned with the design of tool use / function calling. Production GPU operation is an area I walked through on the video AI localization platform.

The 30-second conclusion

Enable: vllm serve Qwen/Qwen3-8B-AWQ --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3. With this, the OpenAI-compatible tools work as-is.
Tools are a typed contract: with Zod as the source of truth, auto-generate the JSON Schema of the tools definition. Validate the returned arguments with Zod before execution.
Always have safety devices: an iteration cap, a tool allowlist, an idempotency key for side effects, and authorization. Don't execute the LLM output as-is.
The official caution: for thinking models, don't use stopword-dependent templates like ReAct. Use the Hermes format.
Build or borrow: if you want to leave the boilerplate, Qwen-Agent; if you want control, a plain OpenAI-compatible loop (implemented here).

Serving: enable Hermes-format tool calling

Qwen3 recommends Hermes-style tool use. In vLLM, enable it with flags.

vllm serve Qwen/Qwen3-8B-AWQ \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --reasoning-parser qwen3 \
  --max-model-len 32768 --port 8000

🔧 The official caution (thinking models): Qwen official states that, for thinking models, stopword-dependent tool templates like ReAct aren't recommended. The reason is "a stopword can be output in the thinking part (<think>) and break the tool call." The safe measure is to use the Hermes format (--tool-call-parser hermes).

Tools are a "typed contract": make Zod the source of truth

An agent's safety starts from the tool definition being a type. Write one Zod schema and derive from it both the JSON Schema of the tools passed to the model and the parser that validates the arguments the model returns (DRY, the same philosophy as structured output).

// lib/tools.ts — ツール＝「名前・引数スキーマ・決定的な実行関数」の契約
import { z } from "zod";

export interface Tool<A extends z.ZodType> {
  readonly name: string;
  readonly description: string;
  readonly args: A;                              // 引数スキーマ（真実源）
  readonly execute: (a: z.infer<A>) => Promise<unknown>; // 実行は決定的コード
}

/** 在庫照会（読み取り専用・副作用なし） */
export const getStock: Tool<z.ZodObject<{ sku: z.ZodString }>> = {
  name: "get_stock",
  description: "SKUの在庫数を返す",
  args: z.object({ sku: z.string().regex(/^[A-Z0-9-]{4,32}$/) }), // 入力境界を型で締める
  execute: async ({ sku }) => ({ sku, qty: await stockRepo.count(sku) }),
};

/** OpenAI互換 tools 定義へ変換（Zod → JSON Schema を自動生成） */
export function toOpenAITool<A extends z.ZodType>(t: Tool<A>) {
  return {
    type: "function" as const,
    function: { name: t.name, description: t.description, parameters: z.toJSONSchema(t.args) },
  };
}

declare const stockRepo: { count(sku: string): Promise<number> };

The point is that execute is an ordinary function. The LLM only judges "which tool to call with which arguments," and execution is held by our deterministic code.

A safe tool-execution loop (world-class practice)

The essence of the agent loop is "the model requests a tool → validate and execute → return the result → repeat until completion." Weave a cap, validation, and idempotency into it.

// lib/agent-loop.ts — 反復上限つき・引数検証つき・allowlistつきの安全なループ
import OpenAI from "openai";
import type { Tool } from "./tools";
import { toOpenAITool } from "./tools";
import { z } from "zod";

const client = new OpenAI({ baseURL: process.env.QWEN_BASE_URL, apiKey: "internal", timeout: 60_000 });

export async function runAgent(
  userMessage: string,
  registry: ReadonlyMap<string, Tool<z.ZodType>>, // allowlist：ここに無いツールは呼べない
  maxSteps = 6,                                    // 暴走・無限ループの上限（必須）
): Promise<string> {
  const messages: OpenAI.ChatCompletionMessageParam[] = [
    { role: "system", content: "ツールは必要な時だけ使う。引数は厳密に。" },
    { role: "user", content: userMessage },
  ];
  const tools = [...registry.values()].map(toOpenAITool);

  for (let step = 0; step < maxSteps; step++) {
    const res = await client.chat.completions.create({
      model: "Qwen/Qwen3-8B-AWQ", messages, tools,
      temperature: 0.7, top_p: 0.8,
      extra_body: { top_k: 20, chat_template_kwargs: { enable_thinking: false } }, // ツール選択は非思考で安定
    });
    const msg = res.choices[0]?.message;
    if (!msg) throw new Error("empty response");
    messages.push(msg);

    const calls = msg.tool_calls ?? [];
    if (calls.length === 0) return msg.content ?? ""; // ツール不要＝最終回答

    // 要求された各ツールを「検証してから」実行（並列でも順次でも、結果を必ず戻す）
    for (const call of calls) {
      const tool = registry.get(call.function.name);
      // allowlist 外のツール要求は実行せず、その旨をモデルへ返す（落とさない）
      const result = tool
        ? await executeChecked(tool, call.function.arguments)
        : { error: `tool not allowed: ${call.function.name}` };
      messages.push({ role: "tool", tool_call_id: call.id, content: JSON.stringify(result) });
    }
  }
  throw new AgentStepLimitError(`exceeded ${maxSteps} steps`); // 上限超過は明示的に失敗させる
}

/** 引数は“外部入力”。Zodで検証してからのみ実行。検証NGは実行せずモデルへ差し戻す。 */
async function executeChecked(tool: Tool<z.ZodType>, rawArgs: string): Promise<unknown> {
  const parsed = tool.args.safeParse(safeJson(rawArgs));
  if (!parsed.success) return { error: "invalid arguments", issues: parsed.error.issues.length };
  return tool.execute(parsed.data); // ここで初めて副作用が起きる
}

const safeJson = (s: string): unknown => { try { return JSON.parse(s); } catch { return null; } };
export class AgentStepLimitError extends Error {}

These 60 lines pack in a production agent's safety devices.

Iteration cap (maxSteps): structurally stops infinite loops and runaways (and prevents cost explosion).
Allowlist (registry): an unregistered tool can't be called. Even if the model requests a phantom tool, it isn't executed.
Argument validation (executeChecked): function.arguments is external input of a JSON string. Execute only when it passes safeParse. NG is sent back for self-correction.
Separation of judgment and execution: the LLM only requests; side effects happen inside deterministic code.

Protect side-effecting tools with "idempotency + authorization"

Read tools (stock query) are easygoing, but side-effecting tools like writes, sends, payments are in a class of their own. The LLM can request the same call twice on retry or in parallel.

// 副作用ツールは「冪等キー＋認可」を必須に（二重実行・権限外実行を防ぐ）
export const refund: Tool<z.ZodObject<{ orderId: z.ZodString; idempotencyKey: z.ZodString }>> = {
  name: "refund",
  description: "注文を返金する（要認可・冪等）",
  args: z.object({ orderId: z.string().uuid(), idempotencyKey: z.string().min(8) }),
  execute: async ({ orderId, idempotencyKey }) => {
    await authz.require("refund", orderId);          // 認可：誰の権限で実行するか
    return payments.refund(orderId, { idempotencyKey }); // 冪等：二度呼ばれても一度だけ効く
  },
};
declare const authz: { require(action: string, resource: string): Promise<void> };
declare const payments: { refund(id: string, o: { idempotencyKey: string }): Promise<unknown> };

The practice of idempotency is the same as payment idempotency design. When an agent is involved, "double execution" is a realistic risk, so design side-effecting tools with a mandatory idempotency key. Enforce authorization inside the tool execution, not in a UI if-statement.

Self-built loop vs. Qwen-Agent

	Self-built OpenAI-compatible loop (this article)	Qwen-Agent
Control	Fully held (cap, validation, logs)	Delegated to the framework
Boilerplate	Write yourself	Absorbs templates/parsing
Learning cost	OK if you know the OpenAI SDK	Learn the library's conventions
Suits	Production build-out, audit requirements	Quick prototyping, standard tool integration

Qwen-Agent "templatizes Qwen3's function calling on the OpenAI-compatible API, with llm.chat() auto-handling tool conversion and parsing." The rule of thumb: Qwen-Agent for prototyping and standard cases, a self-built loop for production where you want to hold audit and guards. Either way, the principles of argument validation, cap, idempotency, and authorization don't change.

Pitfalls & best practices

🔴 Don't execute the LLM output as-is. function.arguments is external input. Execute only when it passes Zod validation.
🔴 Always place an iteration cap. A loop without maxSteps is a breeding ground for cost explosion and infinite loops. Make exceeding it fail explicitly.
🔴 Side-effecting tools are idempotent + authorized. Structurally prevent double execution and out-of-privilege execution. Dangerous tools (arbitrary SQL/shell) just aren't registered.
🟠 Tool selection is stable even non-thinking. Tool calling itself is fast enough in non-thinking mode. Use thinking only when a complex plan is needed.
🟠 Don't use ReAct/stopword templates for thinking models (official). Use the Hermes format (--tool-call-parser hermes).
🟢 Least privilege with the allowlist. Keep the tools the agent can touch to the minimum necessary. Send back phantom tool requests.
🟢 Make each step observable. Log with metadata which tool was called with which arguments and whether validation passed/failed (hide PII in arguments).

Frequently asked questions (FAQ)

Q. Can 8B do tool selection accurately? A. Simple-to-moderate tool selection is practical. The more you narrow the number of tools and clarify the description and argument schema, the more stable. If complex multi-step planning is needed, consider thinking mode or routing to a higher model.

Q. Can I use parallel tool calls? A. The model may return multiple tool calls in one response. If you design the loop to validate and execute each call and return all results, it doesn't break down sequentially or in parallel (this article's implementation supports it).

Q. What if arguments returns as broken JSON? A. executeChecked's safeJson + safeParse send it back without executing. The model sees the sent-back message and self-corrects. The point is to not cause side effects with broken arguments.

Q. Qwen-Agent or a self-built loop, which? A. Qwen-Agent for prototyping and standard cases, a self-built loop for production where you want to hold audit, guards, and logs. The principles (validation, cap, idempotency, authorization) are mandatory either way.

Q. I'm worried about agent cost. A. Suppress step count with the iteration cap and output tokens with non-thinking mode. Caching tool results and model routing also help. Make each step's tokens observable to find where to cut.

Conclusion

Agentifying Qwen3-8B-AWQ withstands production if you thoroughly enforce "judgment is the LLM, execution is deterministic code" with types and safety devices.

Enable with the Hermes format (--enable-auto-tool-choice --tool-call-parser hermes). Don't use ReAct with a thinking model.
Tools are a typed contract — generate tools with Zod as the source of truth, and validate the returned arguments before execution.
A safe loop — iteration cap, allowlist, argument validation, send-back.
Side effects are idempotent + authorized — structurally prevent double execution and out-of-privilege execution.
Qwen-Agent for prototyping, a self-built loop for production — the principles are unchanged.

I build agentification of your own LLM at production quality, including tool design, a safe loop, idempotent side effects, authorization, and observability. Take a look at my AI platform track record and consult me. With one person × generative AI, fast, cheap, and safe.

Sources / official resources

Qwen official (Function Calling) — Hermes format, cautions for thinking models
vLLM official (Tool Calling) — --enable-auto-tool-choice / parser
Qwen-Agent (GitHub) — templatizing tool calling
Qwen3-8B-AWQ model card — thinking mode, sampling

The tool-calling spec and vLLM flags get updated. Always confirm with primary sources and your version before implementation.

Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling

The goal of this article

The 30-second conclusion

Serving: enable Hermes-format tool calling

Tools are a "typed contract": make Zod the source of truth

A safe tool-execution loop (world-class practice)

Protect side-effecting tools with "idempotency + authorization"

Self-built loop vs. Qwen-Agent

Pitfalls & best practices

Frequently asked questions (FAQ)

Conclusion

Sources / official resources

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost

How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use

Self-hosted RAG with Qwen3-8B-AWQ: a production design of thinking mode × hybrid search

Also worth reading

The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

This is how the 'you can see other people's data' IDOR vulnerability is born in Supabase — a practical guide to finding and fixing the authorization flaws lurking in AI-generated Next.js code

React Hook Form × Next.js Server Actions practical guide [latest 2026] — useActionState, double validation, progressive enhancement

The goal of this article

The 30-second conclusion

Serving: enable Hermes-format tool calling

Tools are a "typed contract": make Zod the source of truth

A safe tool-execution loop (world-class practice)

Protect side-effecting tools with "idempotency + authorization"

Self-built loop vs. Qwen-Agent

Pitfalls & best practices

Frequently asked questions (FAQ)

Conclusion

Sources / official resources

Related articles

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost

How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use

Self-hosted RAG with Qwen3-8B-AWQ: a production design of thinking mode × hybrid search

Also worth reading

The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

This is how the 'you can see other people's data' IDOR vulnerability is born in Supabase — a practical guide to finding and fixing the authorization flaws lurking in AI-generated Next.js code

React Hook Form × Next.js Server Actions practical guide [latest 2026] — useActionState, double validation, progressive enhancement