量子化LLM・セルフホスト（Qwen3-8B-AWQ / AWQ / vLLM）の実装ガイド

量子化は『大きく賢いモデルを、安いGPU 1枚に載せて本番で稼がせる』ための鍵です。AWQ 4bitなら重みが約1/3に圧縮され、24GBクラスのGPUで“思考するLLM”を自前運用できる——データを外に出さず、トークン従量ではなく固定費で。本クラスタは、Qwen3-8B-AWQを題材に、量子化方式の選び方、vLLMでの型安全な構造化出力(JSON)、思考モード×ハイブリッド検索の自前RAG、安全なエージェント化までを——型安全・冪等性・可観測性・回復性・コストを軸に、量子化セルフホストを本番で稼がせる設計として扱います。

6 articles in total

Foundational guide (start here)

Qwen

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

Explaining Qwen3-8B-AWQ faithful to the official documentation. With AWQ 4-bit quantization, compress the weights to about 6GB and run in production on a single 24GB GPU. Switching hybrid thinking (thinking/non-thinking), OpenAI-compatible serving with vLLM, the recommended sampling per mode, 131K extension with YaRN, tool calling, and quantization-specific pitfalls (presence_penalty / greedy forbidden), all in real code.

6/25/202616 min read

量子化LLM・セルフホスト（Qwen3-8B-AWQ / AWQ / vLLM）の実装ガイド

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

Related practical articles

The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost

Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling

How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use

Self-hosted RAG with Qwen3-8B-AWQ: a production design of thinking mode × hybrid search

Type-safe structured output with Qwen3-8B-AWQ: vLLM guided decoding × Zod