Skip to main content
友田 陽大
Quantized LLMs & self-hosting
Qwen
AWQ
量子化
vLLM
生成AI
セルフホスト
Python

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

Explaining Qwen3-8B-AWQ faithful to the official documentation. With AWQ 4-bit quantization, compress the weights to about 6GB and run in production on a single 24GB GPU. Switching hybrid thinking (thinking/non-thinking), OpenAI-compatible serving with vLLM, the recommended sampling per mode, 131K extension with YaRN, tool calling, and quantization-specific pitfalls (presence_penalty / greedy forbidden), all in real code.

Published
Reading time
16 min read
Author
友田 陽大
Share

The goal of this article

In the big picture of putting an open-weight LLM into production, I wrote "for self-hosting, vLLM" and "what decides the cost is quantization and utilization." This piece is the definitive concrete example of that. The subject is Qwen3-8B-AWQ — an Apache-2.0 open weight where Alibaba's Qwen team's 8B model is quantized to 4-bit with AWQ.

What's interesting about this model comes down to the single point that it's "small and cheap, yet 'thinks.'" Since AWQ compresses the weights to about 1/3, it fits on a single 24GB GPU, and moreover you can switch a reasoning mode and a fast-dialogue mode in one model. That is, "o1-class staged reasoning, on your own server, cheaply" becomes a reality.

The goal isn't to be able to type vllm serve, but to reach a state where, staying faithful to the official documentation, you understand in which scene to use what, and what crumbles the quality if you remove it. All code is shown in a form that runs on your machine.

Disclosure of credibility: the difficulty of running an LLM in production on a GPU (VRAM, throughput, resilience) is an area I actually stepped on at the video-AI-localization platform. This article's specs, commands, and sampling values are all based on official primary sources — the Qwen3-8B-AWQ model card, the Qwen official documentation, and the vLLM documentation. I clearly state that the actual numbers of VRAM and throughput are environment-dependent and need benchmarking.


The 30-second conclusion: when to choose Qwen3-8B-AWQ

Qwen3-8B-AWQ (4-bit)Qwen3-8B (FP16/BF16)Closed API
Weight VRAM≈6GB (needs measurement)≈16GB0 (no self-hosting)
GPU it fitsa single 24GB-class (L4 / A10 / RTX 4090, etc.)24GB+
Thinking modeyes (switchable)yesmodel-dependent
LicenseApache-2.0 (commercial OK)Apache-2.0terms of use
Data sovereigntyself-containedself-containedexternal transmission
Operationself (this article)selfalmost zero

The essence is this one line: Qwen3-8B-AWQ is an option for "running a small model that can think, without sending data outside, cheaply on a single GPU."

  • Suited: inference on-prem / inside a VPC, projects where sensitive data can't go outside, optimizing the cost of the "reasoning role" of RAG or an agent, running verification to small-scale production on 1 GPU.
  • Not suited: you need the absolute highest quality (go to a bigger model) / want to start with zero operation (start with an API first). For the overall view of the judgment axes, align with the break-even of API vs. self-host.

What Qwen3-8B-AWQ is (the official specs accurately)

First, let me lay out the official facts without inflating them (source: the model card).

ItemValue
Total parameters8.2B (of which 6.95B non-embedding)
Number of layers36
AttentionGQA: 32 Q heads / 8 KV heads
Native context length32,768 tokens
Extended context length131,072 tokens (when using YaRN)
QuantizationAWQ 4-bit
LicenseApache-2.0
CitationQwen3 Technical Report (arXiv:2505.09388)

The official states that Qwen3 surpasses the past QwQ and Qwen2.5 family in reasoning ability, and that the alignment with human preferences in creative writing, role-play, and instruction-following also improved (for specific scores, see the technical report). This piece concentrates not on claiming bench numbers but on "running it in production exactly per the official spec."

🔎 Beware of derivatives: Hugging Face lines up Qwen/Qwen3-8B (full precision), Qwen/Qwen3-8B-FP8, Qwen/Qwen3-8B-Base, the community MLX version, etc. What this article handles is the official Qwen/Qwen3-8B-AWQ (4-bit). GQA is Q32/KV8 — a design with few KV heads has the implication of a light KV cache and being easy to load concurrency onto = self-host-friendly.


What AWQ (4-bit quantization) is: why it fits on "a single GPU"

AWQ = Activation-aware Weight Quantization. The official explains it as "hardware-friendly low-bit weight quantization." What's important is activation-aware — since it quantizes so as not to break the weights that matter to the output more, it suppresses quality degradation even at 4-bit. The AutoAWQ implementation shows about 3x memory reduction and about 3x speedup vs. FP16 (the AWQ documentation).

A rough VRAM estimate makes "why it fits on one" click.

FP16 weights  ≈ 8.2B × 2 byte ≈ 16.4 GB
AWQ 4bit      ≈ 8.2B × 0.5 byte ≈ 4.1 GB (+ scale/zero-point, ~6GB measured)

In other words, the weights are roughly around 6GB. You can turn the remaining VRAM toward the KV cache (which decides concurrency and context length). With a single 24GB GPU, it's the picture of weights 6GB + plenty of KV room (always fix the actual concurrency and throughput with a load test).

QuantizationWeight VRAMQualityMain use
FP16/BF16≈16GBbaselinethe highest quality when there's room
AWQ 4-bit≈6GBpractically slight degradation1-GPU self-host, cost optimization
FP8≈8GBalmost no degradationspeed-focused on a supported GPU (H100, etc.)

💡 Use distinction: if you have an FP8-supported GPU like H100, FP8 is also strong (see the episode of serving Llama in FP8). On the other hand, for "just load it and run it cheaply" on a 24GB GPU like L4 / A10 / RTX 4090, AWQ 4-bit is the royal road. The choice among AWQ, GPTQ, FP8, and GGUF is detailed in how to choose a quantization method.


The killer feature: hybrid thinking (thinking / non-thinking)

Qwen3's biggest differentiator is being able to switch "think" and "answer immediately" in one model.

  • Thinking mode: reasons in stages inside <think> … </think> before answering. Strong at math, code, and complex judgment.
  • Non-thinking mode: answers immediately without emitting a reasoning block. For uses where speed and cost matter, like classification, extraction, and routine dialogue.

Switching is in two ways (official spec)

  1. Hard switch: apply_chat_template(..., enable_thinking=True/False) (in the API, chat_template_kwargs).
  2. Soft switch: in the enable_thinking=True state, writing /think / /no_think in the user utterance overrides it per turn.

Note that the official specifies different sampling per mode.

ParameterThinking modeNon-thinking mode
Temperature0.60.7
TopP0.950.8
TopK2020
MinP00
greedy decodingforbidden

⚠️ The official's clear warning: in thinking mode, you must not use greedy decoding (equivalent to temperature=0). It invites infinite loops, repetition, and performance degradation. "I want reasoning output deterministically, so temp=0" is counterproductive.

There's also an official guideline for output length: most queries are 32,768 tokens, and complex problems like math or programming should be expected up to 38,912 tokens. Since thinking mode "thinks" for a long time before the answer, if you skimp on max_tokens, it cuts off before reaching the conclusion.


Running it ①: minimal confirmation with transformers (for development)

First, minimal confirmation locally. Running AWQ weights needs the awq kernel, so install autoawq (for production, vLLM described later is recommended).

# pip install -U transformers accelerate autoawq
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "9.11 と 9.9 はどちらが大きい?理由も。"}]

# enable_thinking=True で <think>...</think> による段階推論を有効化(ハードスイッチ)
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# 思考モードの公式推奨サンプリング(greedy は使わない)
generated = model.generate(
    **inputs, max_new_tokens=32768,
    temperature=0.6, top_p=0.95, top_k=20,
)
output_ids = generated[0][len(inputs.input_ids[0]):].tolist()
print(tokenizer.decode(output_ids, skip_special_tokens=True))

For a single-machine try in development, this is enough. But transformers's generate doesn't have continuous batching, so it's not suited for many simultaneous requests in production. So I move to vLLM.


Running it ②: production serving with vLLM (OpenAI-compatible)

Production is vLLM. With continuous batching and PagedAttention, it doesn't idle the GPU, and it exposes an OpenAI-compatible endpoint, so you can swap only the destination without changing the app-side code (the whole of building the foundation is consolidated in the vLLM production-self-host operation log).

# Qwen3-8B-AWQ を OpenAI 互換でサーブ。思考の分離パースを有効化
vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000
  • --reasoning-parser qwen3: the specification for vLLM 0.9.0 and later. It separates the <think> block from the body and adds a reasoning_content field to the response. On older vLLM, use --enable-reasoning --reasoning-parser deepseek_r1.
  • --max-model-len 32768: the native upper limit. Narrowing to the length you actually need is the iron rule of cost and latency (don't make it unlimited).
  • --gpu-memory-utilization 0.90: the reservation ratio. Too high = OOM, too low = concurrency decreases.

Once it's up, you can hit it with an OpenAI client as-is. Control thinking on/off with extra_body.chat_template_kwargs.enable_thinking.

from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

resp = client.chat.completions.create(
    model="Qwen/Qwen3-8B-AWQ",
    messages=[{"role": "user", "content": "在庫が負になりうる箇所をこのSQLから指摘して"}],
    temperature=0.6, top_p=0.95, extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": True},  # 思考モード
        # 量子化モデルで“終わらない繰り返し”が出たら presence_penalty を 0〜2 で(推奨 1.5)
        "presence_penalty": 1.5,
    },
)
msg = resp.choices[0].message
print("思考:", getattr(msg, "reasoning_content", None))  # <think> の中身(ログ/監査用)
print("回答:", msg.content)                               # ユーザーに返す最終回答

reasoning_content (the thinking process) and content (the final answer) being structurally separated pays off in production. Return only content to the user, and handle reasoning_content internally for audit/evaluation — this separation becomes the foundation of the type safety and observability described later.


Long context: 32K → 131K with YaRN (only when needed)

The native is 32,768. Only when you need more context, extend with YaRN. You can also declare it on the CLI.

vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
  --max-model-len 131072

The form of writing the equivalent in config.json (factor:4.0, original_max_position_embeddings:32768) is also shown officially.

⚠️ Don't enable YaRN all the time. The official clearly states "only when needed." Applying long-context scaling even to short inputs has the side effect of lowering the quality of short text. If 32K is enough, remove YaRN — "max just in case" is counterproductive. The longer the context, the more the KV cache swells and VRAM and latency increase, so set max-model-len to match actual demand.


Tool calling / agents

Qwen3 supports function calling (tool use). In vLLM, enable it with a flag.

vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

After that, if you pass OpenAI-compatible tools, the model returns a function call as needed. If you build agent-like tool integration, using the official Qwen-Agent (which has templates for tool definition and parsing) saves you from writing the tool-call handling yourself. The implementation of a safe loop including Zod validation of arguments, an iteration upper limit, and idempotent side effects is concretized in making Qwen3 an agent. For the crux of the design — "separate the judgment you leave to the LLM and the execution you lean toward deterministic code" — align with the design of tool use / function calling.


In what scenes to use it: a type-safe production client (application)

From here is the application. Qwen3-8B-AWQ's strengths are "can think," "cheap," "self-hosted." Wrapping this in type safety, idempotency, and resilience makes it a component that earns in production. A representative application is a self-hosted RAG that doesn't send sensitive data outside.

Design policy (as in the principles of CLAUDE.md, make SRP / KISS / type-safe boundaries effective):

  1. Route the mode by difficulty — easy tasks to non-thinking (fast, cheap), only the hard parts to thinking (the main thrust of cost optimization).
  2. Separate the thinking process and the answerreasoning_content to logs, validate and return only content.
  3. Validate structured output with Zod at the boundary — LLM output is external input. Reject the invalid with parse (the discipline of type safety). For a two-stage setup with guided decoding that doesn't let invalid JSON be generated at the generation stage, go to type-safe structured output.
  4. Timeout + idempotent cache — don't generate the same input twice = the cost is also stable.
// lib/qwen-client.ts — 思考/非思考をルーティングし、出力を境界で型検証する薄いクライアント
import OpenAI from "openai";
import { z } from "zod";
import { createHash } from "node:crypto";

const client = new OpenAI({
  baseURL: process.env.QWEN_BASE_URL, // 例: http://qwen-internal:8000/v1(privateで公開しない)
  apiKey: "internal",
  timeout: 60_000, // 思考モードは長い。短すぎる timeout は“正常な熟考”を切る
});

/** タスク難度 → モードと公式推奨サンプリングを決める(SSoT) */
const MODE = {
  fast: { enable_thinking: false, temperature: 0.7, top_p: 0.8, top_k: 20 },
  think: { enable_thinking: true, temperature: 0.6, top_p: 0.95, top_k: 20 },
} as const;
type Difficulty = keyof typeof MODE;

/** 期待する構造化出力。LLMの“それっぽい文字列”を信用せず、ここで弾く */
const RiskFinding = z.object({
  hasRisk: z.boolean(),
  severity: z.enum(["low", "medium", "high"]),
  reason: z.string().min(1),
});
type RiskFinding = z.infer<typeof RiskFinding>;

const cache = new Map<string, RiskFinding>(); // 実運用は Redis 等に置換

/** 入力が同じなら生成も同じ(冪等)。連打・リトライ・重複依頼でコストを無駄にしない */
const keyOf = (prompt: string, d: Difficulty) =>
  createHash("sha256").update(`qwen3-8b-awq:${d}:${prompt}`).digest("hex");

export async function assessRisk(prompt: string, difficulty: Difficulty = "think"): Promise<RiskFinding> {
  const key = keyOf(prompt, difficulty);
  const hit = cache.get(key);
  if (hit) return hit;

  const m = MODE[difficulty];
  const resp = await client.chat.completions.create({
    model: "Qwen/Qwen3-8B-AWQ",
    messages: [
      { role: "system", content: 'JSONのみで返答: {"hasRisk":bool,"severity":"low|medium|high","reason":string}' },
      { role: "user", content: prompt },
    ],
    temperature: m.temperature,
    top_p: m.top_p,
    response_format: { type: "json_object" },
    presence_penalty: 1.5, // 量子化モデルの繰り返し対策(公式推奨レンジ 0〜2・OpenAI互換)
    // vLLM拡張はトップレベルで送る。Node SDKは未知キーも本文へ転送し、spreadなので型エラーも出ない。
    // ※ Python SDK の extra_body は TS SDK には存在しないので使わない。
    ...{ top_k: m.top_k, chat_template_kwargs: { enable_thinking: m.enable_thinking } },
  });

  // content だけを検証対象に。reasoning_content(思考過程)は監査ログへ(PIIは載せない)
  const raw = resp.choices[0]?.message.content ?? "{}";
  const finding = RiskFinding.parse(JSON.parse(raw)); // 不正な形なら throw → 上位で握る

  cache.set(key, finding);
  return finding;
}

The crux of this client is that it never takes the output of a "reasoning LLM" at face value. Keep reasoning_content for human audit and evaluation, and what you return to the user is only the type that passed through Zod. Furthermore, by taking enable_thinking in and out by task difficulty, the cost design of the easy 80% cheaply in non-thinking, only the hard 20% accurately in thinking holds in one model.


Production build-out (avoiding duplication, just the key points)

Observability, auto-scaling, graceful drain, resilience, and network isolation are the same practice whether the model is Qwen or Llama. Without duplicating the details, I leave them to the vLLM production-self-host operation log. Only the points that especially matter for Qwen3-8B-AWQ:

  • Observability: monitor num_requests_waiting / gpu_cache_usage_perc / TTFT with vLLM's /metrics (Prometheus). Since thinking mode tends to produce long output, visualizing reasoning's token consumption per use gives a sense of the cost (the correlation design of OpenTelemetry).
  • Resilience: don't let a self-node failure become a total failure. Timeout + retry + fallback. If the 8B goes down, escape to another node / a higher model (retry, circuit breaker).
  • Security: vLLM's OpenAI-compatible server has no strong authentication itself. Don't expose it directly to the internet. Place it in a private VPC, and authenticate, rate-limit, and audit at the front API Gateway. Don't keep the prompt body (PII) in logs — after all, "not sending it outside" is the motive of self-hosting.

Gotchas & official best practices

From the official documentation, let me extract what crumbles the quality if you remove it.

  • 🔴 Greedy decoding forbidden in thinking mode. Don't set temp=0. It causes repetition and infinite loops.
  • 🟠 presence_penalty ≈ 1.5 against repetition in quantized models (range 0–2). But raising it too much rarely causes language mixing or a slight performance drop, so apply it to the degree that it works when it appears.
  • 🟠 Don't keep thinking content (<think>) in the history in multi-turn. Stack only the final answer into the next turn's input. Continuing to stack thinking pollutes the context and worsens both quality and cost.
  • 🟠 Don't narrow the output length too much. Thinking is long before the answer. Secure max_tokens with a guideline of 32,768 normally / 38,912 tokens for complex problems.
  • 🟢 For math, specify the output format: "Please reason step by step, and put your final answer within \boxed{}." makes the final solution easier to machine-extract.
  • 🟢 YaRN only when needed. Enabling it all the time lowers the quality of short text.
  • 🟢 AWQ depends on the awq kernel with transformers. Lean production throughput toward vLLM (continuous batching).

Frequently asked questions (FAQ)

Q. Qwen3-8B-AWQ or the FP8 version, which to use? A. For "load it and run cheaply" on a 24GB-class general-purpose GPU (L4/A10/RTX 4090), AWQ 4-bit. For speed-first on an FP8-supported GPU like H100, FP8. The weights are roughly AWQ≈6GB / FP8≈8GB / FP16≈16GB (needs measurement).

Q. Should "thinking" always be on? A. No. Classification, extraction, and routine dialogue are fine with non-thinking (fast, cheap). Only math, code, and complex judgment to thinking. Routing by task difficulty is the main thrust of cost design (the client example in the body).

Q. I want reproducibility with greedy (temp=0)? A. In thinking mode, the official forbids it. If you need deterministic output, guarantee reproducibility not with reasoning but with the final answer's schema (Zod) and an idempotent cache. Killing the sampling itself is counterproductive.

Q. How many GPUs do I need? A. First, one. Since the weights are ≈6GB, it runs on a single 24GB. Concurrency and throughput depend on the KV cache, so measure the saturation point with a load test and work back the number of units from there.

Q. Does it run on Ollama too? A. It's handy for a single-machine try in local development. But for high-throughput production (continuous batching), vLLM. Use the handy one for development, vLLM for production.

Q. Is the license commercially usable? A. It's Apache-2.0. You own the weights, modify them, and run them in your own environment. There are no constraints like Llama's "700M MAU limit / 'Built with Llama' notation," and the license side is simpler (for the difference from a closed API, see the pillar article).


Summary

Qwen3-8B-AWQ straightforwardly satisfies the requirement that was previously hard to reconcile: "running a small model that can think, without sending data outside, cheaply on a single GPU."

  1. With AWQ 4-bit, weights ≈6GB → fits on a single 24GB GPU (Apache-2.0, commercial OK).
  2. Route hybrid thinking by difficulty — easy cheaply in non-thinking, hard accurately in thinking.
  3. OpenAI-compatible serving with vLLM (--reasoning-parser qwen3), separate thinking and answer with reasoning_content.
  4. Strictly observe the official sampling (thinking: 0.6/0.95/20, greedy forbidden, presence_penalty≈1.5, don't keep thinking in the history).
  5. Type-validate output at the boundary (Zod) + idempotent cache. The production practice is consolidated in the vLLM operation log.

I build a "reasoning LLM" on-prem / inside a VPC, including observability, resilience, and type safety. See my GPU-production track record and consult me on model selection, cost design, and self-host migration. With one person × generative AI, fast, cheap, and safe.

Sources / official resources

※ Specs, flags, and sampling recommendations get updated. VRAM and throughput are environment-dependent and need benchmarking. Before implementing, always confirm with the primary sources and your own bench.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading