Category
量子化LLM・セルフホスト(Qwen3-8B-AWQ / AWQ / vLLM)の実装ガイド
量子化は『大きく賢いモデルを、安いGPU 1枚に載せて本番で稼がせる』ための鍵です。AWQ 4bitなら重みが約1/3に圧縮され、24GBクラスのGPUで“思考するLLM”を自前運用できる——データを外に出さず、トークン従量ではなく固定費で。本クラスタは、Qwen3-8B-AWQを題材に、量子化方式の選び方、vLLMでの型安全な構造化出力(JSON)、思考モード×ハイブリッド検索の自前RAG、安全なエージェント化までを——型安全・冪等性・可観測性・回復性・コストを軸に、量子化セルフホストを本番で稼がせる設計として扱います。
6 articles in total
Foundational guide
Foundational guide (start here)
Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization
Explaining Qwen3-8B-AWQ faithful to the official documentation. With AWQ 4-bit quantization, compress the weights to about 6GB and run in production on a single 24GB GPU. Switching hybrid thinking (thinking/non-thinking), OpenAI-compatible serving with vLLM, the recommended sampling per mode, 131K extension with YaRN, tool calling, and quantization-specific pitfalls (presence_penalty / greedy forbidden), all in real code.
Related practical articles
- 生成AILLMvLLMセルフホストコスト最適化
The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost
Choosing LLM quantization (AWQ / GPTQ / FP8 / GGUF) tends to be discussed in terms of 'accuracy' alone, but production-serving cost is decided by 'how you allocate a single GPU's VRAM budget between the model weights and the KV cache.' Quantization shrinks the weights and lets you spend the freed space on concurrency and long context — this essence is explained as serving economics, with VRAM-budget estimation code and experience running a quantized model in production on a T4 GPU.
9 min read - QwenエージェントTool UsevLLMTypeScript
Turning Qwen3-8B-AWQ into an agent: a production design of Qwen-Agent × function calling
A production design that turns your own Qwen3-8B-AWQ into a tool-using agent. With world-class code, it explains: enabling vLLM's Hermes-format tool calling, a type-safe tool contract (Zod → JSON Schema), a safe loop that validates arguments before executing, an iteration cap / idempotent side effects / authorization guards, and the official caution against ReAct in thinking mode.
9 min read - QwenAWQ量子化vLLMGGUF
How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use
Which quantization to run Qwen3-8B with — comparing AWQ, GPTQ, FP8, and GGUF by supported hardware, VRAM, throughput, and official support status. It lets you decide without hesitation, with VRAM calculations and a type-safe selection function (with tests): AWQ/FP8 for GPU production, GGUF for Mac/CPU local.
10 min read - QwenRAGvLLMpgvectorセルフホスト
Self-hosted RAG with Qwen3-8B-AWQ: a production design of thinking mode × hybrid search
A production design that makes Qwen3-8B-AWQ — running on your own GPU without sending confidential documents outside — the 'reasoner' of RAG. It explains in real code, from hybrid search → re-ranking → integration in thinking mode → cited structured answers, alongside existence-verification of citations (hallucination countermeasure), prompt-injection defense, context budget, and observability.
9 min read - QwenvLLMTypeScriptZod型安全
Type-safe structured output with Qwen3-8B-AWQ: vLLM guided decoding × Zod
A practical guide to making your own LLM's JSON output 'unbreakable.' With vLLM's structured output (guided decoding / response_format json_schema), make grammatically invalid JSON impossible to generate, then add a double guard of boundary validation with Zod. With one Zod schema as the source of truth, it satisfies both the constraint to vLLM and the app's validation, with real code covering coexistence with thinking mode and a repair loop.
9 min read