# Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

> Explaining Qwen3-8B-AWQ faithful to the official documentation. With AWQ 4-bit quantization, compress the weights to about 6GB and run in production on a single 24GB GPU. Switching hybrid thinking (thinking/non-thinking), OpenAI-compatible serving with vLLM, the recommended sampling per mode, 131K extension with YaRN, tool calling, and quantization-specific pitfalls (presence_penalty / greedy forbidden), all in real code.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Qwen, AWQ, 量子化, vLLM, 生成AI, セルフホスト, Python
- URL: https://tomodahinata.com/en/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide
- Category: Quantized LLMs & self-hosting

## Key points

- Qwen3-8B-AWQ is an Apache-2.0 open weight. With AWQ 4-bit it compresses the weights to about 1/3 (≈6GB) and lets you self-host a 'reasoning LLM' on a single 24GB-class GPU.
- The biggest differentiator is 'hybrid thinking.' With enable_thinking, switch a complex-reasoning mode and a fast-dialogue mode in one model. The /think and /no_think soft switches are also an official spec.
- The recommended sampling is per mode: thinking is Temp0.6 / TopP0.95 / TopK20, non-thinking is Temp0.7 / TopP0.8 / TopK20. Forbidding greedy decoding in thinking mode is the official's clear warning.
- With vLLM, `vllm serve Qwen/Qwen3-8B-AWQ --reasoning-parser qwen3` is OpenAI-compatible. Separate thinking and answer with reasoning_content, and enable tool calling and YaRN (131K) by flags.
- Notes specific to quantized models: presence_penalty≈1.5 against never-ending repetition, and don't keep thinking content in the history — deviating from the official best practices crumbles the quality.

---

## The goal of this article

In [the big picture of putting an open-weight LLM into production](/blog/meta-llama-open-weight-llm-production-guide), I wrote "for self-hosting, vLLM" and "what decides the cost is quantization and utilization." This piece is **the definitive concrete example of that.** The subject is **Qwen3-8B-AWQ** — an Apache-2.0 open weight where Alibaba's Qwen team's 8B model is **quantized to 4-bit with AWQ.**

What's interesting about this model comes down to the single point that it's **"small and cheap, yet 'thinks.'"** Since AWQ compresses the weights to about 1/3, it fits on **a single 24GB GPU,** and moreover **you can switch a reasoning mode and a fast-dialogue mode in one model.** That is, "o1-class staged reasoning, on your own server, cheaply" becomes a reality.

The goal isn't to be able to type `vllm serve`, but to reach **a state where, staying faithful to the official documentation, you understand in which scene to use what, and what crumbles the quality if you remove it.** All code is shown in a form that runs on your machine.

> **Disclosure of credibility**: the difficulty of running an LLM in production on a GPU (VRAM, throughput, resilience) is an area I actually stepped on at the [video-AI-localization platform](/case-studies/ai-video-localization-lipsync). This article's specs, commands, and sampling values are all based on official primary sources — the [Qwen3-8B-AWQ model card](https://huggingface.co/Qwen/Qwen3-8B-AWQ), the [Qwen official documentation](https://qwen.readthedocs.io/), and the [vLLM documentation](https://docs.vllm.ai/). I clearly state that the actual numbers of VRAM and throughput are **environment-dependent and need benchmarking.**

---

## The 30-second conclusion: when to choose Qwen3-8B-AWQ

| | Qwen3-8B-AWQ (4-bit) | Qwen3-8B (FP16/BF16) | Closed API |
| --- | --- | --- | --- |
| Weight VRAM | **≈6GB (needs measurement)** | ≈16GB | 0 (no self-hosting) |
| GPU it fits | **a single 24GB-class** (L4 / A10 / RTX 4090, etc.) | 24GB+ | — |
| Thinking mode | **yes (switchable)** | yes | model-dependent |
| License | **Apache-2.0 (commercial OK)** | Apache-2.0 | terms of use |
| Data sovereignty | **self-contained** | self-contained | external transmission |
| Operation | self (this article) | self | almost zero |

**The essence is this one line**: Qwen3-8B-AWQ is an option for "**running a small model that can think, without sending data outside, cheaply on a single GPU.**"

- ✅ **Suited**: inference on-prem / inside a VPC, projects where sensitive data can't go outside, optimizing the cost of the "reasoning role" of RAG or an agent, running verification to small-scale production on 1 GPU.
- ❌ **Not suited**: you need the absolute highest quality (go to a bigger model) / want to start with zero operation (start with an API first). For the overall view of the judgment axes, align with [the break-even of API vs. self-host](/blog/llama-inference-cost-optimization-self-host-vs-api).

---

## What Qwen3-8B-AWQ is (the official specs accurately)

First, let me lay out the **official facts** without inflating them (source: the [model card](https://huggingface.co/Qwen/Qwen3-8B-AWQ)).

| Item | Value |
| --- | --- |
| Total parameters | 8.2B (of which 6.95B non-embedding) |
| Number of layers | 36 |
| Attention | GQA: 32 Q heads / 8 KV heads |
| Native context length | 32,768 tokens |
| Extended context length | 131,072 tokens (when using YaRN) |
| Quantization | **AWQ 4-bit** |
| License | **Apache-2.0** |
| Citation | Qwen3 Technical Report (arXiv:2505.09388) |

The official states that Qwen3 **surpasses the past QwQ and Qwen2.5 family in reasoning ability,** and that the alignment with human preferences in creative writing, role-play, and instruction-following also improved (for specific scores, see the technical report). This piece concentrates not on claiming bench numbers but on **"running it in production exactly per the official spec."**

> 🔎 **Beware of derivatives**: Hugging Face lines up `Qwen/Qwen3-8B` (full precision), `Qwen/Qwen3-8B-FP8`, `Qwen/Qwen3-8B-Base`, the community MLX version, etc. What this article handles is the official **`Qwen/Qwen3-8B-AWQ`** (4-bit). GQA is Q32/KV8 — a design with few KV heads has the implication of a light KV cache and being **easy to load concurrency onto** = self-host-friendly.

---

## What AWQ (4-bit quantization) is: why it fits on "a single GPU"

**AWQ = Activation-aware Weight Quantization.** The official explains it as "**hardware-friendly low-bit weight quantization.**" What's important is *activation-aware* — since it quantizes so as not to break the weights that matter to the output more, it suppresses quality degradation even at 4-bit. The AutoAWQ implementation shows **about 3x memory reduction and about 3x speedup vs. FP16** (the [AWQ documentation](https://qwen.readthedocs.io/en/latest/quantization/awq.html)).

A rough VRAM estimate makes "why it fits on one" click.

```text
FP16 weights  ≈ 8.2B × 2 byte ≈ 16.4 GB
AWQ 4bit      ≈ 8.2B × 0.5 byte ≈ 4.1 GB (+ scale/zero-point, ~6GB measured)
```

In other words, **the weights are roughly around 6GB.** You can turn the remaining VRAM toward **the KV cache (which decides concurrency and context length).** With a single 24GB GPU, it's the picture of weights 6GB + plenty of KV room (**always fix the actual concurrency and throughput with a [load test](/blog/vllm-llama-self-hosting-production-inference-server#負荷試験本番前にどこで折れるかを知る)**).

| Quantization | Weight VRAM | Quality | Main use |
| --- | --- | --- | --- |
| FP16/BF16 | ≈16GB | baseline | the highest quality when there's room |
| **AWQ 4-bit** | **≈6GB** | practically slight degradation | **1-GPU self-host, cost optimization** |
| FP8 | ≈8GB | almost no degradation | speed-focused on a supported GPU (H100, etc.) |

> 💡 **Use distinction**: if you have an FP8-supported GPU like H100, FP8 is also strong (see [the episode of serving Llama in FP8](/blog/vllm-llama-self-hosting-production-inference-server#サーブするscout-を-fp8--テンソル並列で)). On the other hand, **for "just load it and run it cheaply" on a 24GB GPU like L4 / A10 / RTX 4090, AWQ 4-bit** is the royal road. The choice among AWQ, GPTQ, FP8, and GGUF is detailed in [how to choose a quantization method](/blog/qwen3-quantization-awq-gptq-fp8-gguf-comparison-guide).

---

## The killer feature: hybrid thinking (thinking / non-thinking)

Qwen3's biggest differentiator is being able to **switch "think" and "answer immediately" in one model.**

- **Thinking mode**: reasons in stages inside `<think> … </think>` before answering. Strong at math, code, and complex judgment.
- **Non-thinking mode**: answers immediately without emitting a reasoning block. For uses where **speed and cost** matter, like classification, extraction, and routine dialogue.

### Switching is in two ways (official spec)

1. **Hard switch**: `apply_chat_template(..., enable_thinking=True/False)` (in the API, `chat_template_kwargs`).
2. **Soft switch**: in the `enable_thinking=True` state, writing **`/think` / `/no_think`** in the user utterance overrides it per turn.

### The recommended sampling per mode (it breaks if you remove it)

Note that the official specifies **different sampling per mode.**

| Parameter | Thinking mode | Non-thinking mode |
| --- | --- | --- |
| Temperature | **0.6** | 0.7 |
| TopP | **0.95** | 0.8 |
| TopK | 20 | 20 |
| MinP | 0 | 0 |
| greedy decoding | **forbidden** | — |

> ⚠️ **The official's clear warning**: in thinking mode, **you must not use greedy decoding (equivalent to temperature=0).** It invites infinite loops, repetition, and performance degradation. "I want reasoning output deterministically, so temp=0" is **counterproductive.**

There's also an official guideline for output length: **most queries are 32,768 tokens,** and complex problems like math or programming should be expected up to **38,912 tokens.** Since thinking mode "thinks" for a long time before the answer, if you skimp on `max_tokens`, it **cuts off before reaching the conclusion.**

---

## Running it ①: minimal confirmation with transformers (for development)

First, minimal confirmation locally. Running AWQ weights needs the awq kernel, so install `autoawq` (for production, vLLM described later is recommended).

```python
# pip install -U transformers accelerate autoawq
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "9.11 と 9.9 はどちらが大きい？理由も。"}]

# enable_thinking=True で <think>...</think> による段階推論を有効化（ハードスイッチ）
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# 思考モードの公式推奨サンプリング（greedy は使わない）
generated = model.generate(
    **inputs, max_new_tokens=32768,
    temperature=0.6, top_p=0.95, top_k=20,
)
output_ids = generated[0][len(inputs.input_ids[0]):].tolist()
print(tokenizer.decode(output_ids, skip_special_tokens=True))
```

For a single-machine try in development, this is enough. But `transformers`'s `generate` **doesn't have continuous batching,** so it's not suited for many simultaneous requests in production. So I move to vLLM.

---

## Running it ②: production serving with vLLM (OpenAI-compatible)

Production is **vLLM.** With continuous batching and PagedAttention, it doesn't idle the GPU, and it exposes an **OpenAI-compatible endpoint,** so you can swap only the destination without changing the app-side code (the whole of building the foundation is consolidated in [the vLLM production-self-host operation log](/blog/vllm-llama-self-hosting-production-inference-server)).

```bash
# Qwen3-8B-AWQ を OpenAI 互換でサーブ。思考の分離パースを有効化
vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000
```

- `--reasoning-parser qwen3`: the specification for **vLLM 0.9.0 and later.** It **separates the `<think>` block from the body** and adds a `reasoning_content` field to the response. On older vLLM, use `--enable-reasoning --reasoning-parser deepseek_r1`.
- `--max-model-len 32768`: the native upper limit. **Narrowing to the length you actually need** is the iron rule of cost and latency (don't make it unlimited).
- `--gpu-memory-utilization 0.90`: the reservation ratio. Too high = OOM, too low = concurrency decreases.

Once it's up, you can hit it **with an OpenAI client as-is.** Control thinking on/off with `extra_body.chat_template_kwargs.enable_thinking`.

```python
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

resp = client.chat.completions.create(
    model="Qwen/Qwen3-8B-AWQ",
    messages=[{"role": "user", "content": "在庫が負になりうる箇所をこのSQLから指摘して"}],
    temperature=0.6, top_p=0.95, extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": True},  # 思考モード
        # 量子化モデルで“終わらない繰り返し”が出たら presence_penalty を 0〜2 で（推奨 1.5）
        "presence_penalty": 1.5,
    },
)
msg = resp.choices[0].message
print("思考:", getattr(msg, "reasoning_content", None))  # <think> の中身（ログ/監査用）
print("回答:", msg.content)                               # ユーザーに返す最終回答
```

`reasoning_content` (the thinking process) and `content` (the final answer) being **structurally separated** pays off in production. **Return only `content` to the user, and handle `reasoning_content` internally for audit/evaluation** — this separation becomes the foundation of the type safety and observability described later.

---

## Long context: 32K → 131K with YaRN (only when needed)

The native is 32,768. Only when you need more context, extend with **YaRN.** You can also declare it on the CLI.

```bash
vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
  --max-model-len 131072
```

The form of writing the equivalent in `config.json` (`factor:4.0`, `original_max_position_embeddings:32768`) is also shown officially.

> ⚠️ **Don't enable YaRN all the time.** The official clearly states "**only when needed.**" Applying long-context scaling even to short inputs has the side effect of **lowering the quality of short text.** If 32K is enough, remove YaRN — "max just in case" is counterproductive. The longer the context, the more the KV cache swells and VRAM and latency increase, so set `max-model-len` **to match actual demand.**

---

## Tool calling / agents

Qwen3 supports function calling (tool use). In vLLM, enable it with a flag.

```bash
vllm serve Qwen/Qwen3-8B-AWQ \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes
```

After that, if you pass OpenAI-compatible `tools`, the model returns a function call as needed. If you build agent-like tool integration, using the official **Qwen-Agent** (which has templates for tool definition and parsing) saves you from writing the tool-call handling yourself. The implementation of a safe loop including Zod validation of arguments, an iteration upper limit, and idempotent side effects is concretized in [making Qwen3 an agent](/blog/qwen3-agent-tool-use-function-calling-qwen-agent-production). For the crux of the design — "separate **the judgment you leave to the LLM** and **the execution you lean toward deterministic code**" — align with [the design of tool use / function calling](/blog/ai-agent-tool-use-function-calling-production-design).

---

## In what scenes to use it: a type-safe production client (application)

From here is the **application.** Qwen3-8B-AWQ's strengths are "can think," "cheap," "self-hosted." Wrapping this in **type safety, idempotency, and resilience** makes it a component that earns in production. A representative application is [a self-hosted RAG that doesn't send sensitive data outside](/blog/qwen3-self-hosted-rag-reasoning-hybrid-search-production).

Design policy (as in [the principles of CLAUDE.md](/about), make SRP / KISS / type-safe boundaries effective):

1. **Route the mode by difficulty** — easy tasks to non-thinking (fast, cheap), only the hard parts to thinking ([the main thrust of cost optimization](/blog/llama-inference-cost-optimization-self-host-vs-api#コスト削減レバー効果の大きい順)).
2. **Separate the thinking process and the answer** — `reasoning_content` to logs, validate and return only `content`.
3. **Validate structured output with Zod at the boundary** — LLM output is external input. Reject the invalid with `parse` ([the discipline of type safety](/blog/typescript-type-safety-discipline-zod-nevererror-no-any)). For a two-stage setup with guided decoding that doesn't let invalid JSON be generated at the generation stage, go to [type-safe structured output](/blog/qwen3-structured-output-json-vllm-guided-decoding-zod).
4. **Timeout + idempotent cache** — don't generate the same input twice = the cost is also stable.

```ts
// lib/qwen-client.ts — 思考/非思考をルーティングし、出力を境界で型検証する薄いクライアント
import OpenAI from "openai";
import { z } from "zod";
import { createHash } from "node:crypto";

const client = new OpenAI({
  baseURL: process.env.QWEN_BASE_URL, // 例: http://qwen-internal:8000/v1（privateで公開しない）
  apiKey: "internal",
  timeout: 60_000, // 思考モードは長い。短すぎる timeout は“正常な熟考”を切る
});

/** タスク難度 → モードと公式推奨サンプリングを決める（SSoT） */
const MODE = {
  fast: { enable_thinking: false, temperature: 0.7, top_p: 0.8, top_k: 20 },
  think: { enable_thinking: true, temperature: 0.6, top_p: 0.95, top_k: 20 },
} as const;
type Difficulty = keyof typeof MODE;

/** 期待する構造化出力。LLMの“それっぽい文字列”を信用せず、ここで弾く */
const RiskFinding = z.object({
  hasRisk: z.boolean(),
  severity: z.enum(["low", "medium", "high"]),
  reason: z.string().min(1),
});
type RiskFinding = z.infer<typeof RiskFinding>;

const cache = new Map<string, RiskFinding>(); // 実運用は Redis 等に置換

/** 入力が同じなら生成も同じ（冪等）。連打・リトライ・重複依頼でコストを無駄にしない */
const keyOf = (prompt: string, d: Difficulty) =>
  createHash("sha256").update(`qwen3-8b-awq:${d}:${prompt}`).digest("hex");

export async function assessRisk(prompt: string, difficulty: Difficulty = "think"): Promise<RiskFinding> {
  const key = keyOf(prompt, difficulty);
  const hit = cache.get(key);
  if (hit) return hit;

  const m = MODE[difficulty];
  const resp = await client.chat.completions.create({
    model: "Qwen/Qwen3-8B-AWQ",
    messages: [
      { role: "system", content: 'JSONのみで返答: {"hasRisk":bool,"severity":"low|medium|high","reason":string}' },
      { role: "user", content: prompt },
    ],
    temperature: m.temperature,
    top_p: m.top_p,
    response_format: { type: "json_object" },
    presence_penalty: 1.5, // 量子化モデルの繰り返し対策（公式推奨レンジ 0〜2・OpenAI互換）
    // vLLM拡張はトップレベルで送る。Node SDKは未知キーも本文へ転送し、spreadなので型エラーも出ない。
    // ※ Python SDK の extra_body は TS SDK には存在しないので使わない。
    ...{ top_k: m.top_k, chat_template_kwargs: { enable_thinking: m.enable_thinking } },
  });

  // content だけを検証対象に。reasoning_content（思考過程）は監査ログへ（PIIは載せない）
  const raw = resp.choices[0]?.message.content ?? "{}";
  const finding = RiskFinding.parse(JSON.parse(raw)); // 不正な形なら throw → 上位で握る

  cache.set(key, finding);
  return finding;
}
```

The crux of this client is that it **never takes the output of a "reasoning LLM" at face value.** Keep `reasoning_content` for human audit and evaluation, and what you return to the user is **only the type that passed through Zod.** Furthermore, by taking `enable_thinking` in and out by task difficulty, the cost design of **the easy 80% cheaply in non-thinking, only the hard 20% accurately in thinking** holds in one model.

---

## Production build-out (avoiding duplication, just the key points)

Observability, auto-scaling, graceful drain, resilience, and network isolation are **the same practice** whether the model is Qwen or Llama. Without duplicating the details, I leave them to [the vLLM production-self-host operation log](/blog/vllm-llama-self-hosting-production-inference-server). Only the points that especially matter for Qwen3-8B-AWQ:

- **Observability**: monitor `num_requests_waiting` / `gpu_cache_usage_perc` / TTFT with vLLM's `/metrics` (Prometheus). Since **thinking mode tends to produce long output,** visualizing `reasoning`'s token consumption per use gives a sense of the cost ([the correlation design of OpenTelemetry](/blog/opentelemetry-observability-production-tracing-metrics-logs)).
- **Resilience**: don't let a self-node failure become a total failure. Timeout + retry + fallback. If the 8B goes down, escape to another node / a higher model ([retry, circuit breaker](/blog/retry-backoff-circuit-breaker-resilience-patterns-guide)).
- **Security**: vLLM's OpenAI-compatible server **has no strong authentication itself.** **Don't expose it directly to the internet.** Place it in a private VPC, and authenticate, rate-limit, and audit at the front API Gateway. Don't keep the prompt body (PII) in logs — after all, "not sending it outside" is the motive of self-hosting.

---

## Gotchas & official best practices

From the official documentation, let me extract what **crumbles the quality if you remove it.**

- 🔴 **Greedy decoding forbidden in thinking mode.** Don't set temp=0. It causes repetition and infinite loops.
- 🟠 **`presence_penalty ≈ 1.5` against repetition in quantized models** (range 0–2). But raising it too much rarely causes language mixing or a slight performance drop, so apply it to the degree that it works when it appears.
- 🟠 **Don't keep thinking content (`<think>`) in the history in multi-turn.** Stack **only the final answer** into the next turn's input. Continuing to stack thinking pollutes the context and worsens both quality and cost.
- 🟠 **Don't narrow the output length too much.** Thinking is long before the answer. Secure `max_tokens` with a guideline of 32,768 normally / 38,912 tokens for complex problems.
- 🟢 **For math, specify the output format**: "Please reason step by step, and put your final answer within \boxed{}." makes the final solution easier to machine-extract.
- 🟢 **YaRN only when needed.** Enabling it all the time lowers the quality of short text.
- 🟢 **AWQ depends on the awq kernel with `transformers`.** Lean production throughput toward vLLM (continuous batching).

---

## Frequently asked questions (FAQ)

**Q. Qwen3-8B-AWQ or the FP8 version, which to use?**
A. **For "load it and run cheaply" on a 24GB-class general-purpose GPU (L4/A10/RTX 4090), AWQ 4-bit.** **For speed-first on an FP8-supported GPU like H100, FP8.** The weights are roughly AWQ≈6GB / FP8≈8GB / FP16≈16GB (needs measurement).

**Q. Should "thinking" always be on?**
A. No. Classification, extraction, and routine dialogue are fine with **non-thinking (fast, cheap).** Only math, code, and complex judgment to **thinking.** Routing by task difficulty is the main thrust of cost design (the client example in the body).

**Q. I want reproducibility with greedy (temp=0)?**
A. In thinking mode, **the official forbids it.** If you need deterministic output, guarantee reproducibility not with `reasoning` but with **the final answer's schema (Zod) and an idempotent cache.** Killing the sampling itself is counterproductive.

**Q. How many GPUs do I need?**
A. **First, one.** Since the weights are ≈6GB, it runs on a single 24GB. Concurrency and throughput depend on the KV cache, so measure the saturation point with a [load test](/blog/vllm-llama-self-hosting-production-inference-server#負荷試験本番前にどこで折れるかを知る) and work back the number of units from there.

**Q. Does it run on Ollama too?**
A. It's handy for a single-machine try in local development. But **for high-throughput production (continuous batching), vLLM.** Use the handy one for development, vLLM for production.

**Q. Is the license commercially usable?**
A. It's **Apache-2.0.** You own the weights, modify them, and run them in your own environment. There are no constraints like Llama's "700M MAU limit / 'Built with Llama' notation," and **the license side is simpler** (for the difference from a closed API, see the [pillar article](/blog/meta-llama-open-weight-llm-production-guide)).

---

## Summary

Qwen3-8B-AWQ straightforwardly satisfies the requirement that was previously hard to reconcile: "**running a small model that can think, without sending data outside, cheaply on a single GPU.**"

1. With **AWQ 4-bit**, weights ≈6GB → fits on **a single 24GB GPU** (Apache-2.0, commercial OK).
2. **Route hybrid thinking by difficulty** — easy cheaply in non-thinking, hard accurately in thinking.
3. **OpenAI-compatible serving with vLLM** (`--reasoning-parser qwen3`), separate thinking and answer with `reasoning_content`.
4. **Strictly observe the official sampling** (thinking: 0.6/0.95/20, greedy forbidden, presence_penalty≈1.5, don't keep thinking in the history).
5. **Type-validate output at the boundary (Zod) + idempotent cache.** The production practice is consolidated in [the vLLM operation log](/blog/vllm-llama-self-hosting-production-inference-server).

> I build a "reasoning LLM" on-prem / inside a VPC, including observability, resilience, and type safety. See my GPU-production [track record](/case-studies/ai-video-localization-lipsync) and consult me on model selection, cost design, and self-host migration. With **one person × generative AI,** fast, cheap, and safe.

### Sources / official resources

- [Qwen/Qwen3-8B-AWQ — Hugging Face model card](https://huggingface.co/Qwen/Qwen3-8B-AWQ) — specs, mode switching, recommended sampling, best practices
- [Qwen official documentation (vLLM deployment)](https://qwen.readthedocs.io/en/latest/deployment/vllm.html) — reasoning-parser, tool-call, YaRN flags
- [Qwen official documentation (AWQ quantization)](https://qwen.readthedocs.io/en/latest/quantization/awq.html) — how AWQ works and how to use it
- [vLLM official documentation](https://docs.vllm.ai/) — server, metrics, quantization
- [QwenLM/Qwen3 (GitHub)](https://github.com/QwenLM/qwen3) — reference implementation
- Qwen3 Technical Report (arXiv:2505.09388)

※ Specs, flags, and sampling recommendations get updated. VRAM and throughput are **environment-dependent and need benchmarking.** Before implementing, always confirm with the primary sources and your own bench.
