The goal of this article
In the Qwen3-8B-AWQ practical guide, I wrote "with AWQ it fits on one 24GB GPU." So which of AWQ, GPTQ, FP8, GGUF should you choose — this question, always asked in projects, this piece answers with hardware, VRAM, throughput, and official support status.
Quantization isn't something you choose with "4bit for now." The right answer changes by which GPU (or CPU/Mac) and what you prioritize. When you finish reading, the goal is a state where you can assert with a formula "for our environment, XX, because YY."
Reliability disclosure: the method characteristics, commands, and official support status here are based on the Qwen official documentation (quantization) and each model card (AWQ / FP8 / GGUF). VRAM and throughput are environment-dependent and require benchmarking. The numbers here are rules of thumb; always measure in your own environment before implementation. Production GPU operation is an area I actually faced on the video AI localization platform.
The 30-second conclusion: quick-reference table
| Method | Bits | Supported HW | Weight VRAM (8.2B) | Main use | Qwen3 official |
|---|---|---|---|---|---|
| FP8 | 8bit float | FP8-capable GPU (H100/H200, Ada, Blackwell) | ≈8GB | GPU server, speed-first, nearly no degradation | Qwen3-8B-FP8 |
| AWQ | 4bit weight | General GPU (even 24GB-class) | ≈6GB | GPU server, max compression, single-card operation | Qwen3-8B-AWQ |
| GPTQ | 4/8bit | GPU (vLLM is Marlin) | ≈6GB (Int4) | When you have existing GPTQ assets | ⚠️ "To be updated" |
| GGUF | 2–8bit variable | CPU / Apple Metal / GPU | ≈5GB (Q4_K_M) | Local, Mac, edge, single machine | Qwen3-8B-GGUF |
In 3 lines:
- Production serving on GPU → FP8 (fast and clean if you have a capable GPU) or AWQ (max compression on a general GPU).
- Local on Mac/CPU → GGUF (Ollama / LM Studio / llama.cpp).
- GPTQ is "awaiting update" per official for Qwen3 with known issue reports, so basically choose AWQ for Qwen is safe.
First, "what happens in quantization" in one minute
LLM weights are usually FP16 (2 bytes per parameter). Quantization represents these weights in fewer bits, cutting VRAM and memory bandwidth to achieve "fits, fast, cheap." The price is slight quality degradation — how cleverly you cut is each method's show of skill.
The VRAM estimation formula is simple.
重みVRAM ≈ パラメータ数 × (ビット数 / 8) [byte]
FP16 : 8.2e9 × 2 ≈ 16.4 GB
FP8 : 8.2e9 × 1 ≈ 8.2 GB
4bit : 8.2e9 × 0.5 ≈ 4.1 GB(+スケール/ゼロ点で実測 5〜6GB)
💡 Lighter weights = more concurrency: free VRAM can be diverted to the KV cache (= which determines the number of concurrent requests and context length). So "cutting the weights" not only makes it fit but directly pushes up self-hosting's effective throughput (the denominator that determines cost).
AWQ: the "main option" for a GPU server
AWQ (Activation-aware Weight Quantization) is activation-aware 4bit weight quantization that "doesn't break the weights more impactful to the output." The official docs show about 3× memory reduction and about 3× speedup vs. FP16 with the AutoAWQ implementation.
- Strength: small quality degradation at 4bit. Pairs well with vLLM's continuous batching, running practically on one 24GB-class GPU. Qwen official distributes
Qwen3-8B-AWQ. - Suits: wanting to serve in production while max-compressing on a general GPU like L4 / A10 / RTX 4090.
# AWQ:そのままOpenAI互換でサーブ(詳細は実践ガイドへ)
vllm serve Qwen/Qwen3-8B-AWQ --reasoning-parser qwen3 --max-model-len 32768
The practices for serving and production operation are consolidated in the Qwen3-8B-AWQ practical guide.
FP8: the leading choice if you have a capable GPU
FP8 is 8bit floating-point quantization. Qwen3's FP8 is fine-grained (block size 128) and is characterized by being close to no degradation.
- Strength: minimal accuracy degradation. Computation is fast on a FP8-capable GPU (Tensor Cores process FP8 directly).
- Premise: a FP8-capable GPU is required — Hopper (H100/H200), Ada (L4/L40S/RTX 40 series), Blackwell. On an unsupported GPU (e.g., A100/V100), it gets slow or fails to start.
- VRAM: ≈8GB, heavier than AWQ, but fits comfortably in 24GB.
# FP8:対応GPUなら速くて綺麗。重みは≈8GB
vllm serve Qwen/Qwen3-8B-FP8 --reasoning-parser qwen3 --max-model-len 32768
🔧 AWQ or FP8, which: if you have an H100/Ada and speed is the top priority, FP8. If you want to "just fit it and go cheap" on a general 24GB GPU, AWQ. If VRAM is tight and you want to gain concurrency, 4bit AWQ is advantageous; if you don't want to lose a millimeter of quality, FP8 — that's the division.
GPTQ: "awaiting update" for Qwen3, basically can be avoided
GPTQ is one-shot weight quantization using approximate second-order information, with Int4/Int8 common. In vLLM it's accelerated with the Marlin kernel.
However, Qwen's GPTQ official documentation states at the top "Attention: To be updated for Qwen3," so official GPTQ for Qwen3 is awaiting update. In addition, in the Qwen2.5 family, known issues like "72B-GPTQ-Int4 can't stop generating," "32B-GPTQ-Int4 garbles on multi-GPU vLLM" were reported, and AWQ is guided as a workaround.
⚠️ Conclusion: if choosing quantization for Qwen, AWQ first. Unless there's a clear reason like "I have an existing GPTQ pipeline," there's little reason to actively choose GPTQ for Qwen3. Even if you choose it, always verify the behavior on a single machine, single GPU.
GGUF: "another world" of Mac, CPU, and local
GGUF is a llama.cpp family format. Whereas AWQ/FP8/GPTQ are "for a GPU server," GGUF is in a different arena.
- Strength: runs on both CPU and Apple Silicon (Metal). Ollama / LM Studio / llama.cpp read it directly. The definitive choice for local development, Mac, edge, single machine.
- Bits: variable. Q4_K_M is the standard sweet spot of size and quality (effective ~4.5bit), and Q8_0 is nearly no degradation (≈8bit). Extreme Q2/Q3 show visible degradation.
- Weakness: high-throughput serving of many concurrent requests. That's the domain of vLLM × AWQ/FP8.
# Mac/ローカルで最短:Ollama で Qwen3-8B(GGUF)を動かす
ollama run qwen3:8b
# llama.cpp で特定の量子化を選ぶ場合(例:Q4_K_M)
llama-cli -hf Qwen/Qwen3-8B-GGUF:Q4_K_M -p "RAGの再ランキングを2行で"
💡 The essence of using them differently: GGUF is "on my own Mac, without sending outside, try right away." AWQ/FP8 is "on a server, many concurrent, handle fast." Align with the division of Ollama and vLLM: GGUF for development, vLLM for production.
VRAM and hardware correspondence table
| Environment | Recommended method | Reason |
|---|---|---|
| H100 / H200 / L40S (FP8-capable) + speed-first | FP8 | Fast on Tensor Cores, nearly no degradation |
| L4 / A10 / RTX 4090 (24GB-class) + max compression | AWQ 4bit | Compress to 6GB and gain concurrency |
| A100 / V100 (no FP8) | AWQ 4bit | FP8 impossible. Fit it with 4bit |
| Apple Silicon Mac | GGUF (Metal) | Runs immediately with Ollama/LM Studio |
| No GPU, CPU only | GGUF (Q4_K_M) | CPU inference is llama.cpp, no other choice |
| Have existing GPTQ assets | GPTQ (verify) | Runs with Marlin but Qwen3 awaits update |
A "selection function" to not hesitate (type-safe, testable)
"World-class" isn't flashiness but a small pure function determined only by inputs. Put in the environment and priority axis and it returns the quantization method deterministically, going straight into a unit test (the discipline of type safety).
// lib/pick-quantization.ts — 環境×優先軸 → 量子化方式を決定的に返す純粋関数
export type Hardware = "fp8-gpu" | "ampere-24gb" | "older-gpu" | "apple-silicon" | "cpu-only";
export type Priority = "throughput" | "max-compression" | "no-quality-loss" | "local-dev";
export interface QuantChoice {
readonly method: "FP8" | "AWQ" | "GGUF";
readonly model: `Qwen/Qwen3-8B${"-FP8" | "-AWQ" | "-GGUF"}`;
readonly runtime: "vLLM" | "Ollama/llama.cpp";
readonly reason: string;
}
/** ハードウェアと優先軸から、Qwen3-8B の量子化方式を一意に決める。 */
export function pickQuantization(hw: Hardware, priority: Priority): QuantChoice {
// ローカル/CPU/Macは土俵が違う:GGUF一択(GPUサーバ系の議論に巻き込まない)
if (hw === "cpu-only" || hw === "apple-silicon" || priority === "local-dev") {
return { method: "GGUF", model: "Qwen/Qwen3-8B-GGUF", runtime: "Ollama/llama.cpp",
reason: "CPU/Apple Metal/ローカル単機。Ollama・LM Studio・llama.cpp が直接読む。" };
}
// FP8対応GPUで「劣化させたくない or 速度最優先」なら FP8
if (hw === "fp8-gpu" && (priority === "no-quality-loss" || priority === "throughput")) {
return { method: "FP8", model: "Qwen/Qwen3-8B-FP8", runtime: "vLLM",
reason: "FP8対応GPUではTensor Coreが速く、fine-grained FP8はほぼ無劣化。" };
}
// それ以外のGPU(FP8非対応含む)/ 最大圧縮 → AWQ 4bit
return { method: "AWQ", model: "Qwen/Qwen3-8B-AWQ", runtime: "vLLM",
reason: "汎用GPUで最大圧縮(重み≈6GB)。activation-awareで品質劣化が小さく、vLLM連続バッチと好相性。" };
}
// pick-quantization.test.ts — 仕様を表で固定(退行を止める)
import { describe, it, expect } from "vitest";
import { pickQuantization } from "./pick-quantization";
describe("pickQuantization", () => {
it("Mac はランタイムに関わらず GGUF", () => {
expect(pickQuantization("apple-silicon", "throughput").method).toBe("GGUF");
});
it("FP8対応GPUで無劣化優先なら FP8", () => {
expect(pickQuantization("fp8-gpu", "no-quality-loss").method).toBe("FP8");
});
it("FP8非対応GPUは AWQ にフォールバック", () => {
expect(pickQuantization("older-gpu", "throughput").method).toBe("AWQ");
});
it("24GB級で最大圧縮なら AWQ", () => {
expect(pickQuantization("ampere-24gb", "max-compression").model).toBe("Qwen/Qwen3-8B-AWQ");
});
});
Dropping selection into code and tests makes anyone on the team reach the same conclusion, and you maintain the branches in one place even as environments grow (DRY, SRP).
Pitfalls & best practices
- 🔴 FP8 requires a capable GPU. Putting FP8 on an unsupported GPU (A100/V100, etc.) causes slowdown / startup failure. Confirm the generation with
nvidia-smiand the actual dtype in the startup log. - 🟠 GPTQ is awaiting update + has known issues for Qwen3. If choosing, verify the behavior on a single GPU. For Qwen, basically AWQ.
- 🟠 Don't use GGUF for a high-throughput server. Many-concurrent is vLLM × AWQ/FP8. GGUF's strength is single-machine, local.
- 🟠 Don't double-quantize. Don't further quantize an already-quantized distributed model. Use the official quantized version as-is.
- 🟢 Keep the thinking-mode sampling in any method. Regardless of quantization, no greedy decoding and
presence_penalty≈1.5as per the practical guide. - 🟢 Evaluate quality on your own task. Not a general benchmark — compare AWQ and FP8 on your prompts. If there's no difference, the lighter one (AWQ) wins.
Frequently asked questions (FAQ)
Q. So what do you recommend? A. For GPU production, AWQ (general GPU) or FP8 (capable GPU), for Mac/CPU local, GGUF. For Qwen, you basically don't need to choose GPTQ. If in doubt, put your environment into the selection function in the body.
Q. How much does smartness drop with 4bit? A. AWQ is activation-aware, so practically slight. But it's task-dependent. Not a general benchmark — compare AWQ and FP8/FP16 on your own eval set and confirm the difference.
Q. FP8 or AWQ, which is higher quality? A. Generally FP8 has smaller degradation (8bit, fine-grained). On the other hand, AWQ is lighter (4bit) so it gains concurrency. FP8 for "don't lose a millimeter of quality," AWQ for "fit it and go cheap."
Q. GGUF's Q4_K_M or Q8_0, which? A. Q4_K_M if size-first (the standard sweet spot), Q8_0 if quality-first (nearly no degradation but about 2× the size). Choose in consultation with your Mac's memory.
Q. How much does quantization change cost? A. Lighter weights = more fit simultaneously on the same GPU = effective throughput rises and the cost per 1M output tokens drops. Always confirm the effect with a benchmark.
Conclusion
Choosing Qwen3-8B quantization is, not a trend, a function of the environment and the priority axis.
- GPU server × speed/no-degradation → FP8 (a capable GPU is the premise).
- GPU server × max compression (24GB-class) → AWQ 4bit (Qwen's main option).
- Mac / CPU / local single machine → GGUF (Ollama, llama.cpp).
- GPTQ awaits update + has known issues for Qwen3 — basically AWQ.
- Make the final judgment with your task's evaluation and measured VRAM/throughput. Drop selection into code and tests for reproducibility.
I accompany you on model-quantization selection and benchmark design through vLLM production serving and cost estimation. Take a look at my GPU-inference-platform track record and consult me. With one person × generative AI, fast, cheap, and safe.
Sources / official resources
- Qwen official documentation (AWQ) / (GPTQ)
- Model cards: Qwen3-8B-AWQ / Qwen3-8B-FP8 / Qwen3-8B-GGUF
- vLLM official documentation / llama.cpp / Ollama
- Method characteristics, official support status, and VRAM/throughput depend on updates and the environment. Always confirm with primary sources and your own benchmark before implementation.