Skip to main content
友田 陽大
Quantized LLMs & self-hosting
生成AI
LLM
vLLM
セルフホスト
コスト最適化
型安全

The serving economics of quantization: AWQ vs FP8, and how the KV cache and VRAM budget decide your production cost

Choosing LLM quantization (AWQ / GPTQ / FP8 / GGUF) tends to be discussed in terms of 'accuracy' alone, but production-serving cost is decided by 'how you allocate a single GPU's VRAM budget between the model weights and the KV cache.' Quantization shrinks the weights and lets you spend the freed space on concurrency and long context — this essence is explained as serving economics, with VRAM-budget estimation code and experience running a quantized model in production on a T4 GPU.

Published
Reading time
9 min read
Author
友田 陽大
Share

Let me state the conclusion first. Production selection of LLM quantization (AWQ / GPTQ / FP8 / GGUF) is decided not by "which loses the least accuracy" but by the serving economics of "how you allocate a single GPU's VRAM budget between the model weights and the KV cache." The essence of quantization is that compressing the weights lets you spend the freed VRAM on the KV cache — that is, on concurrency and context length. This directly governs the number of requests one GPU can handle, and thus the cost per token. Choosing a quantization method on the accuracy argument alone misjudges production cost efficiency.

This article, based on my experience running Qwen3-8B-AWQ in production on vLLM atop a Tesla T4 GPU, reframes quantization as "serving economics" and explains it with VRAM-budget estimation code. For an accuracy comparison of the quantization methods themselves, see the Qwen3 quantization comparison (AWQ/GPTQ/FP8/GGUF); for the API vs self-hosting break-even, see generative-AI cost and break-even.


1. A GPU's VRAM is fought over by "weights" and "KV cache"

When you run an LLM on a GPU, VRAM (GPU memory) is used mainly for two things.

VRAM useContentHow its size is determined
Model weightsThe model's parameters themselvesParameter count × bit width
KV cacheA working area holding the context during generationContext length × concurrency × model configuration

These two fight over the limited VRAM. If the weights are large, the remainder available for the KV cache shrinks, so the number of requests you can handle concurrently and the context length you can handle become smaller. Conversely, if you shrink the weights via quantization, you can spend that amount on the KV cache and increase concurrency = raise throughput = lower per-token cost.

This is the core of "serving economics." Quantization is not merely a "memory-saving technique" but a lever that decides how much token processing you can convert the fixed cost of a GPU into.


2. Estimating the VRAM budget

Let's make this fight visible with numbers. With a pure function that states its premises, we decompose the VRAM budget into "weights" and "KV cache."

/**
 * 単一GPUのVRAM予算を「重み」と「KVキャッシュ」に分け、
 * 量子化が同時実行・文脈長にどれだけ余地を生むかを試算する純粋関数。
 * 前提(ビット数・KV単価・オーバーヘッド)はすべて引数で受け取る(再現可能・テスト容易)。
 */
interface VramBudgetInputs {
  readonly paramsBillions: number;        // パラメータ数(B)
  readonly weightBitsPerParam: number;    // 量子化ビット数(AWQ=4, FP8=8, FP16=16)
  readonly gpuVramGiB: number;            // GPU搭載VRAM(例: T4=16, L4=24, H100=80)
  readonly kvBytesPerToken: number;       // 1トークンあたりKVキャッシュのバイト数(モデル構成依存)
  readonly runtimeOverheadGiB: number;    // ランタイム・アクティベーションの固定オーバーヘッド
}

interface VramBudget {
  readonly weightGiB: number;
  readonly kvBudgetGiB: number;           // KVキャッシュに使える残り
  readonly maxKvTokens: number;           // 収容できる総トークン(≒ 文脈長 × 同時実行数)
  readonly fits: boolean;                 // そもそも重みが載るか
}

const BYTES_PER_GIB = 1024 ** 3;

export function estimateVramBudget(input: VramBudgetInputs): VramBudget {
  const weightBytes = input.paramsBillions * 1e9 * (input.weightBitsPerParam / 8);
  const weightGiB = weightBytes / BYTES_PER_GIB;

  const kvBudgetGiB = input.gpuVramGiB - weightGiB - input.runtimeOverheadGiB;
  const maxKvTokens =
    kvBudgetGiB > 0 ? Math.floor((kvBudgetGiB * BYTES_PER_GIB) / input.kvBytesPerToken) : 0;

  return {
    weightGiB,
    kvBudgetGiB: Math.max(0, kvBudgetGiB),
    maxKvTokens,
    fits: kvBudgetGiB > 0,
  };
}

What this estimate shows is the reality that, for example, loading an 8B model in FP16 (16-bit) occupies about 16 GiB in weights alone, so it essentially won't fit on a 16GB-class GPU (T4). But switching to AWQ (4-bit) shrinks the weights to about a quarter, so the model fits on the same GPU and room is left for the KV cache too. The reason I ran Qwen on a T4 with AWQ (4-bit quantization) was exactly this — to "fit a model of a size that can think onto a 16GB GPU and run it on fixed cost" (since the T4 doesn't support FP8, the attention-mechanism backend also needs care).

Note: the estimate above is a first-order approximation of weights and KV cache. In reality it varies with activations, reserved memory, and whether KV-cache quantization is used. kvBytesPerToken changes greatly with the model's layer count, hidden dimension, and attention-head configuration (GQA, etc.), so always calibrate with the measured value for the target model.


3. AWQ vs FP8 vs GGUF: choosing from the serving perspective

Lining up quantization methods by "serving economics" rather than "accuracy," we can organize them like this.

MethodWeight compressionCompute throughputHardware requirementSuited serving
AWQ (4-bit weight-only)Large (~1/4 vs FP16)Helps reduce decode memory bandwidthPossible even on older GPUs (T4, etc.)Tight VRAM, older GPUs, fitting a large model on one card
GPTQ (4-bit weight-only)LargeSame family as AWQPossible even on older GPUsSame as AWQ (different calibration method)
FP8 (8-bit)Medium (~1/2 vs FP16)High (native FP8 compute)Hopper/Ada generation and later requiredHigh throughput while preserving accuracy on newer GPUs
GGUF (llama.cpp family)Variable (multi-bit widths)Tends to be lower than vLLMStrong at CPU+GPU offloadLocal, edge, no-GPU/small-VRAM

The practical guidance is as follows.

  • Tight VRAM / older GPU (T4, etc.) / want to fit a large model on one cardAWQ (or GPTQ). Shrink the weights greatly at 4-bit and maximize KV-cache headroom.
  • Newer GPU (L4 / L40S / H100, etc.), high throughput while limiting accuracy degradationFP8. Native compute is fast and the accuracy falloff is gentle. But weight compression isn't as much as AWQ, so it can be a disadvantage in a configuration where VRAM is right at the edge.
  • No GPU / edge / personal useGGUF (llama.cpp). It can offload across CPU and GPU and is flexible. But for maximum server-use throughput it tends to lose to vLLM + AWQ/FP8.

The optimal quantization method changes with "which GPU generation you use." Choosing FP8 on an FP8-unsupported T4 is meaningless, and going out of your way to 4-bit weight-only on an H100 can fail to fully exploit its compute performance.


4. Decode is "memory-bandwidth-bound" — that's why quantization helps speed

Finally, let's grasp why quantization helps not only "memory saving" but also "speed-up." LLM inference is:

  • Prefill (processing the input) — tends to be compute-bound.
  • Decode (generating one token at a time) — tends to be memory-bandwidth-bound.

In decode, the model weights are read from VRAM each time a token is generated. That is, "the byte count of the weights" tends to be the decode-speed bottleneck. If quantization reduces the weights' byte count, the amount of data read shrinks and decode gets faster. This is the mechanism by which quantization brings "memory saving" and "throughput improvement" at the same time.

Serving's per-token cost is roughly decided by GPU cost per hour ÷ throughput (tokens/sec). Quantization, by (1) shrinking weights to increase KV cache = concurrency, and (2) cutting decode memory bandwidth to speed up each request, raises throughput on both fronts and lowers per-token cost. Choosing the trade-off of "accuracy drops a little but cost drops greatly" against the workload's accuracy tolerance is the judgment of serving economics.


FAQ

Q. On what basis should I choose a quantization method?

Choose by "serving economics," not "accuracy" alone. Concretely, three axes: GPU generation × workload (concurrency, context length) × accuracy tolerance. The basic line is: AWQ (4-bit) if VRAM is tight or on an older GPU (T4, etc.); FP8 if you want to preserve accuracy on a newer GPU (H100, etc.); GGUF if no GPU / edge. Quantization is a lever that shrinks the weights to create KV-cache headroom (concurrency, context length) and lower per-token cost.

Q. Which is better, AWQ or FP8?

It depends on hardware and workload. AWQ (4-bit weight-only) has large weight compression and lets even older GPUs without FP8 support (T4, etc.) fit a large model on one card. FP8 has small accuracy degradation and high compute throughput, but native support is Hopper/Ada generation and later, and its weight compression isn't as much as AWQ. The rule of thumb: AWQ if VRAM is tight, FP8 if you prioritize accuracy on a newer GPU.

Q. What is the KV cache? Why is it important?

The KV cache is a working area that holds the context during generation, and it consumes VRAM in proportion to context length × concurrency. Because it fights over VRAM with the model weights, large weights shrink the remainder available for the KV cache, making the number of concurrent requests and the handleable context length smaller. Shrinking the weights via quantization lets you spend that on the KV cache and increase concurrency — this governs throughput and cost.

Q. Why does quantization make inference faster?

LLM decode (generating one token at a time) tends to be memory-bandwidth-bound, because the weights are read from VRAM per token. Reducing the weights' byte count via quantization reduces the amount of data read and speeds up decode. So quantization brings "memory saving" and "speed-up" simultaneously. Since serving's per-token cost is decided by GPU time ÷ throughput, this directly affects cost.

Q. Can an 8B LLM run on a 16GB GPU (T4)?

In FP16 (16-bit) the weights alone occupy about 16 GiB, so it essentially won't fit. But switching to AWQ (4-bit) shrinks the weights to about a quarter, so the model fits and room is left for the KV cache too. I have a track record of running Qwen3-8B-AWQ in production on a T4. However, since the T4 doesn't support FP8, configuration-side care is needed for things like the attention-mechanism backend.


Summary: quantization is a VRAM-budget allocation problem

To correctly choose LLM quantization for production serving, here's what to grasp.

  1. Production selection is "serving economics," not "accuracy" — how you allocate the VRAM budget between weights and KV cache.
  2. Quantization shrinks the weights and creates KV-cache (concurrency, context length) headroom — this governs per-token cost.
  3. AWQ (4-bit) achieves large VRAM reduction, letting even older GPUs (T4, etc.) host a large model.
  4. FP8 has small accuracy degradation and high throughput, but needs Hopper/Ada generation and later.
  5. Decode is memory-bandwidth-bound — quantization brings memory saving and speed-up at the same time.

"I want to run an LLM on my own GPU, but on which GPU and with which quantization does it pencil out?" — that discernment is solved by analysis of the VRAM budget and the workload. From quantization-method selection through production serving on vLLM to cost optimization, I take it on at production-operations quality.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading