# How to choose a Qwen3-8B quantization method: deciding AWQ, GPTQ, FP8, and GGUF by use

> Which quantization to run Qwen3-8B with — comparing AWQ, GPTQ, FP8, and GGUF by supported hardware, VRAM, throughput, and official support status. It lets you decide without hesitation, with VRAM calculations and a type-safe selection function (with tests): AWQ/FP8 for GPU production, GGUF for Mac/CPU local.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Qwen, AWQ, 量子化, vLLM, GGUF, FP8, セルフホスト
- URL: https://tomodahinata.com/en/blog/qwen3-quantization-awq-gptq-fp8-gguf-comparison-guide
- Category: Quantized LLMs & self-hosting
- Pillar guide: https://tomodahinata.com/en/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide

## Key points

- Quick conclusion: with a FP8-capable GPU like H100/Ada and speed-first, 'FP8'; with a 24GB-class GPU and max compression, 'AWQ (4bit)'; for Mac/CPU local, 'GGUF'. GPTQ is 'To be updated for Qwen3' per Qwen official with known issues, so for Qwen, AWQ is basically recommended.
- VRAM rule of thumb (8.2B, measure): FP16≈16GB / FP8≈8GB / AWQ・GPTQ Int4≈6GB / GGUF Q4_K_M≈5GB. The lighter the weights, the more you can divert to the KV cache (concurrency, context length).
- AWQ is activation-aware weight quantization with small quality degradation, pairing well with vLLM's continuous batching — 'for a GPU server.' GGUF is the llama.cpp family supporting CPU/Apple Metal — 'for local/edge.' Different arenas.
- FP8 is fine-grained (block size 128) with nearly no degradation, but premised on a FP8-capable GPU (Hopper/Ada/Blackwell). On an unsupported GPU it gets slow or fails to start.
- Select with a formula, not feel: make it reproducible selection with a pure function (unit-testable) that takes supported HW and the priority axis as input and deterministically returns the quantization method.

---

## The goal of this article

In the [Qwen3-8B-AWQ practical guide](/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide), I wrote "with AWQ it fits on one 24GB GPU." So which of **AWQ, GPTQ, FP8, GGUF** should you choose — this question, always asked in projects, this piece answers with **hardware, VRAM, throughput, and official support status.**

Quantization isn't something you choose with "4bit for now." The right answer changes by **which GPU (or CPU/Mac) and what you prioritize.** When you finish reading, the goal is a state where you can **assert with a formula** "for our environment, XX, because YY."

> **Reliability disclosure**: the method characteristics, commands, and official support status here are based on the [Qwen official documentation (quantization)](https://qwen.readthedocs.io/en/latest/quantization/awq.html) and each model card ([AWQ](https://huggingface.co/Qwen/Qwen3-8B-AWQ) / [FP8](https://huggingface.co/Qwen/Qwen3-8B-FP8) / [GGUF](https://huggingface.co/Qwen/Qwen3-8B-GGUF)). **VRAM and throughput are environment-dependent and require benchmarking.** The numbers here are rules of thumb; always measure in your own environment before implementation. Production GPU operation is an area I actually faced on the [video AI localization platform](/case-studies/ai-video-localization-lipsync).

---

## The 30-second conclusion: quick-reference table

| Method | Bits | Supported HW | Weight VRAM (8.2B) | Main use | Qwen3 official |
| --- | --- | --- | --- | --- | --- |
| **FP8** | 8bit float | **FP8-capable GPU** (H100/H200, Ada, Blackwell) | ≈8GB | **GPU server, speed-first, nearly no degradation** | `Qwen3-8B-FP8` |
| **AWQ** | 4bit weight | General GPU (even 24GB-class) | ≈6GB | **GPU server, max compression, single-card operation** | `Qwen3-8B-AWQ` |
| **GPTQ** | 4/8bit | GPU (vLLM is Marlin) | ≈6GB (Int4) | When you have existing GPTQ assets | ⚠️ "To be updated" |
| **GGUF** | 2–8bit variable | **CPU / Apple Metal / GPU** | ≈5GB (Q4_K_M) | **Local, Mac, edge, single machine** | `Qwen3-8B-GGUF` |

**In 3 lines**:

- **Production serving on GPU** → **FP8** (fast and clean if you have a capable GPU) or **AWQ** (max compression on a general GPU).
- **Local on Mac/CPU** → **GGUF** (Ollama / LM Studio / llama.cpp).
- **GPTQ** is "awaiting update" per official for Qwen3 with known issue reports, so **basically choose AWQ for Qwen** is safe.

---

## First, "what happens in quantization" in one minute

LLM weights are usually FP16 (2 bytes per parameter). **Quantization represents these weights in fewer bits**, cutting VRAM and memory bandwidth to achieve "fits, fast, cheap." The price is **slight quality degradation** — how cleverly you cut is each method's show of skill.

The VRAM estimation formula is simple.

```text
重みVRAM ≈ パラメータ数 × (ビット数 / 8) [byte]
  FP16 : 8.2e9 × 2     ≈ 16.4 GB
  FP8  : 8.2e9 × 1     ≈ 8.2 GB
  4bit : 8.2e9 × 0.5   ≈ 4.1 GB（+スケール/ゼロ点で実測 5〜6GB）
```

> 💡 **Lighter weights = more concurrency**: free VRAM can be diverted to the **KV cache** (= which determines the number of concurrent requests and context length). So "cutting the weights" not only makes it fit but directly pushes up **self-hosting's effective throughput** ([the denominator that determines cost](/blog/llama-inference-cost-optimization-self-host-vs-api)).

---

## AWQ: the "main option" for a GPU server

**AWQ (Activation-aware Weight Quantization)** is **activation-aware 4bit weight quantization** that "doesn't break the weights more impactful to the output." The official docs show **about 3× memory reduction and about 3× speedup vs. FP16** with the AutoAWQ implementation.

- **Strength**: **small quality degradation** at 4bit. Pairs well with vLLM's continuous batching, running practically on **one 24GB-class GPU.** Qwen official distributes `Qwen3-8B-AWQ`.
- **Suits**: wanting to **serve in production while max-compressing** on a general GPU like L4 / A10 / RTX 4090.

```bash
# AWQ：そのままOpenAI互換でサーブ（詳細は実践ガイドへ）
vllm serve Qwen/Qwen3-8B-AWQ --reasoning-parser qwen3 --max-model-len 32768
```

The practices for serving and production operation are consolidated in the [Qwen3-8B-AWQ practical guide](/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide).

---

## FP8: the leading choice if you have a capable GPU

**FP8** is 8bit floating-point quantization. Qwen3's FP8 is **fine-grained (block size 128)** and is characterized by being **close to no degradation.**

- **Strength**: minimal accuracy degradation. **Computation is fast on a FP8-capable GPU** (Tensor Cores process FP8 directly).
- **Premise**: **a FP8-capable GPU is required** — Hopper (H100/H200), Ada (L4/L40S/RTX 40 series), Blackwell. On an unsupported GPU (e.g., A100/V100), it **gets slow or fails to start.**
- **VRAM**: ≈8GB, heavier than AWQ, but **fits comfortably in 24GB.**

```bash
# FP8：対応GPUなら速くて綺麗。重みは≈8GB
vllm serve Qwen/Qwen3-8B-FP8 --reasoning-parser qwen3 --max-model-len 32768
```

> 🔧 **AWQ or FP8, which**: **if you have an H100/Ada and speed is the top priority, FP8.** **If you want to "just fit it and go cheap" on a general 24GB GPU, AWQ.** If VRAM is tight and you want to gain concurrency, 4bit AWQ is advantageous; if you don't want to lose a millimeter of quality, FP8 — that's the division.

---

## GPTQ: "awaiting update" for Qwen3, basically can be avoided

**GPTQ** is one-shot weight quantization using approximate second-order information, with Int4/Int8 common. In vLLM it's accelerated with the **Marlin kernel.**

However, Qwen's [GPTQ official documentation](https://qwen.readthedocs.io/en/latest/quantization/gptq.html) states at the top **"Attention: To be updated for Qwen3,"** so **official GPTQ for Qwen3 is awaiting update.** In addition, in the Qwen2.5 family, known issues like **"72B-GPTQ-Int4 can't stop generating," "32B-GPTQ-Int4 garbles on multi-GPU vLLM"** were reported, and **AWQ is guided as a workaround.**

> ⚠️ **Conclusion**: **if choosing quantization for Qwen, AWQ first.** Unless there's a clear reason like "I have an existing GPTQ pipeline," there's little reason to actively choose GPTQ for Qwen3. Even if you choose it, always verify the behavior on **a single machine, single GPU.**

---

## GGUF: "another world" of Mac, CPU, and local

**GGUF** is a **llama.cpp** family format. Whereas AWQ/FP8/GPTQ are "for a GPU server," GGUF is in **a different arena.**

- **Strength**: **runs on both CPU and Apple Silicon (Metal).** **Ollama / LM Studio / llama.cpp** read it directly. The definitive choice for local development, Mac, edge, single machine.
- **Bits**: variable. **Q4_K_M is the standard sweet spot of size and quality** (effective ~4.5bit), and **Q8_0 is nearly no degradation (≈8bit).** Extreme Q2/Q3 show visible degradation.
- **Weakness**: **high-throughput serving of many concurrent requests.** That's the domain of vLLM × AWQ/FP8.

```bash
# Mac/ローカルで最短：Ollama で Qwen3-8B（GGUF）を動かす
ollama run qwen3:8b
# llama.cpp で特定の量子化を選ぶ場合（例：Q4_K_M）
llama-cli -hf Qwen/Qwen3-8B-GGUF:Q4_K_M -p "RAGの再ランキングを2行で"
```

> 💡 **The essence of using them differently**: GGUF is "**on my own Mac, without sending outside, try right away.**" AWQ/FP8 is "**on a server, many concurrent, handle fast.**" Align with [the division of Ollama and vLLM](/blog/vllm-llama-self-hosting-production-inference-server#よくある質問faq): GGUF for development, vLLM for production.

---

## VRAM and hardware correspondence table

| Environment | Recommended method | Reason |
| --- | --- | --- |
| H100 / H200 / L40S (FP8-capable) + speed-first | **FP8** | Fast on Tensor Cores, nearly no degradation |
| L4 / A10 / RTX 4090 (24GB-class) + max compression | **AWQ 4bit** | Compress to 6GB and gain concurrency |
| A100 / V100 (no FP8) | **AWQ 4bit** | FP8 impossible. Fit it with 4bit |
| Apple Silicon Mac | **GGUF (Metal)** | Runs immediately with Ollama/LM Studio |
| No GPU, CPU only | **GGUF (Q4_K_M)** | CPU inference is llama.cpp, no other choice |
| Have existing GPTQ assets | GPTQ (verify) | Runs with Marlin but Qwen3 awaits update |

---

## A "selection function" to not hesitate (type-safe, testable)

"World-class" isn't flashiness but **a small pure function determined only by inputs.** Put in the environment and priority axis and it returns the quantization method **deterministically**, going straight into a unit test ([the discipline of type safety](/blog/typescript-type-safety-discipline-zod-nevererror-no-any)).

```ts
// lib/pick-quantization.ts — 環境×優先軸 → 量子化方式を決定的に返す純粋関数
export type Hardware = "fp8-gpu" | "ampere-24gb" | "older-gpu" | "apple-silicon" | "cpu-only";
export type Priority = "throughput" | "max-compression" | "no-quality-loss" | "local-dev";

export interface QuantChoice {
  readonly method: "FP8" | "AWQ" | "GGUF";
  readonly model: `Qwen/Qwen3-8B${"-FP8" | "-AWQ" | "-GGUF"}`;
  readonly runtime: "vLLM" | "Ollama/llama.cpp";
  readonly reason: string;
}

/** ハードウェアと優先軸から、Qwen3-8B の量子化方式を一意に決める。 */
export function pickQuantization(hw: Hardware, priority: Priority): QuantChoice {
  // ローカル/CPU/Macは土俵が違う：GGUF一択（GPUサーバ系の議論に巻き込まない）
  if (hw === "cpu-only" || hw === "apple-silicon" || priority === "local-dev") {
    return { method: "GGUF", model: "Qwen/Qwen3-8B-GGUF", runtime: "Ollama/llama.cpp",
      reason: "CPU/Apple Metal/ローカル単機。Ollama・LM Studio・llama.cpp が直接読む。" };
  }
  // FP8対応GPUで「劣化させたくない or 速度最優先」なら FP8
  if (hw === "fp8-gpu" && (priority === "no-quality-loss" || priority === "throughput")) {
    return { method: "FP8", model: "Qwen/Qwen3-8B-FP8", runtime: "vLLM",
      reason: "FP8対応GPUではTensor Coreが速く、fine-grained FP8はほぼ無劣化。" };
  }
  // それ以外のGPU（FP8非対応含む）/ 最大圧縮 → AWQ 4bit
  return { method: "AWQ", model: "Qwen/Qwen3-8B-AWQ", runtime: "vLLM",
    reason: "汎用GPUで最大圧縮（重み≈6GB）。activation-awareで品質劣化が小さく、vLLM連続バッチと好相性。" };
}
```

```ts
// pick-quantization.test.ts — 仕様を表で固定（退行を止める）
import { describe, it, expect } from "vitest";
import { pickQuantization } from "./pick-quantization";

describe("pickQuantization", () => {
  it("Mac はランタイムに関わらず GGUF", () => {
    expect(pickQuantization("apple-silicon", "throughput").method).toBe("GGUF");
  });
  it("FP8対応GPUで無劣化優先なら FP8", () => {
    expect(pickQuantization("fp8-gpu", "no-quality-loss").method).toBe("FP8");
  });
  it("FP8非対応GPUは AWQ にフォールバック", () => {
    expect(pickQuantization("older-gpu", "throughput").method).toBe("AWQ");
  });
  it("24GB級で最大圧縮なら AWQ", () => {
    expect(pickQuantization("ampere-24gb", "max-compression").model).toBe("Qwen/Qwen3-8B-AWQ");
  });
});
```

Dropping selection **into code and tests** makes anyone on the team reach the same conclusion, and you maintain the branches in one place even as environments grow (DRY, SRP).

---

## Pitfalls & best practices

- 🔴 **FP8 requires a capable GPU.** Putting FP8 on an unsupported GPU (A100/V100, etc.) causes slowdown / startup failure. Confirm the generation with `nvidia-smi` and the actual dtype in the startup log.
- 🟠 **GPTQ is awaiting update + has known issues for Qwen3.** If choosing, verify the behavior on a single GPU. For Qwen, basically AWQ.
- 🟠 **Don't use GGUF for a high-throughput server.** Many-concurrent is vLLM × AWQ/FP8. GGUF's strength is single-machine, local.
- 🟠 **Don't double-quantize.** Don't further quantize an already-quantized distributed model. Use the official quantized version as-is.
- 🟢 **Keep the thinking-mode sampling in any method.** Regardless of quantization, no greedy decoding and `presence_penalty≈1.5` as per the [practical guide](/blog/qwen3-8b-awq-self-hosting-reasoning-production-guide#ハマりどころ--公式ベストプラクティス).
- 🟢 **Evaluate quality on your own task.** Not a general benchmark — compare AWQ and FP8 **on your prompts.** If there's no difference, the lighter one (AWQ) wins.

---

## Frequently asked questions (FAQ)

**Q. So what do you recommend?**
A. **For GPU production, AWQ (general GPU) or FP8 (capable GPU)**, **for Mac/CPU local, GGUF.** For Qwen, you basically don't need to choose GPTQ. If in doubt, put your environment into the selection function in the body.

**Q. How much does smartness drop with 4bit?**
A. AWQ is activation-aware, so **practically slight.** But it's **task-dependent.** Not a general benchmark — compare AWQ and FP8/FP16 on **your own eval set** and confirm the difference.

**Q. FP8 or AWQ, which is higher quality?**
A. Generally **FP8 has smaller degradation** (8bit, fine-grained). On the other hand, AWQ is **lighter (4bit)** so it gains concurrency. FP8 for "don't lose a millimeter of quality," AWQ for "fit it and go cheap."

**Q. GGUF's Q4_K_M or Q8_0, which?**
A. **Q4_K_M if size-first** (the standard sweet spot), **Q8_0 if quality-first** (nearly no degradation but about 2× the size). Choose in consultation with your Mac's memory.

**Q. How much does quantization change cost?**
A. Lighter weights = **more fit simultaneously on the same GPU** = effective throughput rises and [the cost per 1M output tokens](/blog/llama-inference-cost-optimization-self-host-vs-api#検証可能なコスト計算機純粋関数) drops. Always confirm the effect with a benchmark.

---

## Conclusion

Choosing Qwen3-8B quantization is, not a trend, **a function of the environment and the priority axis.**

1. **GPU server × speed/no-degradation** → **FP8** (a capable GPU is the premise).
2. **GPU server × max compression (24GB-class)** → **AWQ 4bit** (Qwen's main option).
3. **Mac / CPU / local single machine** → **GGUF** (Ollama, llama.cpp).
4. **GPTQ** awaits update + has known issues for Qwen3 — **basically AWQ.**
5. Make the final judgment with **your task's evaluation and measured VRAM/throughput.** Drop selection into code and tests for reproducibility.

> I accompany you on model-quantization selection and benchmark design through vLLM production serving and cost estimation. Take a look at my GPU-inference-platform [track record](/case-studies/ai-video-localization-lipsync) and consult me. With **one person × generative AI**, fast, cheap, and safe.

### Sources / official resources

- [Qwen official documentation (AWQ)](https://qwen.readthedocs.io/en/latest/quantization/awq.html) / [(GPTQ)](https://qwen.readthedocs.io/en/latest/quantization/gptq.html)
- Model cards: [Qwen3-8B-AWQ](https://huggingface.co/Qwen/Qwen3-8B-AWQ) / [Qwen3-8B-FP8](https://huggingface.co/Qwen/Qwen3-8B-FP8) / [Qwen3-8B-GGUF](https://huggingface.co/Qwen/Qwen3-8B-GGUF)
- [vLLM official documentation](https://docs.vllm.ai/) / [llama.cpp](https://github.com/ggml-org/llama.cpp) / [Ollama](https://ollama.com/)

* Method characteristics, official support status, and VRAM/throughput depend on updates and the environment. Always confirm with primary sources and your own benchmark before implementation.
