Designing Llama inference cost: deriving the break-even of API vs. self-hosting with TCO

The goal of this article

"How much does it actually cost to run Llama in production?" — it's the first question asked in a project. This piece answers it not by feel but with TCO (total cost of ownership). The aim is to nail down the "cost" chapter of the big picture of putting Llama into production with formulas, real numbers, and code.

When you finish reading, the goal is a state where you can:

Calculate the cost of API and self-hosting on the same footing.
Derive where the break-even is for your own volume by plugging into the formula.
Judge which lever to pull for the biggest effect, in design order.

Reliability disclosure: GPU cost is an area I actually faced on the video AI localization platform. The numbers here are examples based on public information as of June 2026, and pricing fluctuates. The value is in "the formulas and the way of thinking" — judge by plugging in your own numbers, not the absolute values.

The 30-second conclusion

	API (Bedrock, etc., usage-based)	Self-host (rent GPUs and run them yourself)
Billing	Pay-per-use (input/output tokens)	GPU-hours are a fixed cost (billed even when idle)
Initial/operations	Nearly zero	Environment setup, scaling, monitoring, incident response
Condition for lower cost	Low–mid volume, high variability	High, stable utilization
Suited for	Prototypes, variable demand, small-batch high-variety	Steady high volume, data sovereignty, minimizing per-request cost

The essence is this one line: API is variable cost only, self-hosting is fixed cost (GPU-hours). So "how much you can keep the GPU from sitting idle (utilization)" decides whether self-hosting wins.

Cost formulas: put them on the same footing

API (usage-based) cost

It's a multiplication of token unit prices. Being simple, the advantage is that it's easy to predict.

APIコスト = (入力トークン / 1e6) × 入力単価 + (出力トークン / 1e6) × 出力単価

Example as of June 2026 (Amazon Bedrock, on-demand, confirm):

Model	Input / 1M	Output / 1M	Positioning
Llama 4 Scout	$0.17	$0.66	Long-context, standard
Llama 4 Maverick	$0.24	$0.97	High-quality flagship

Self-host cost

Divide the fixed GPU-hour price by the amount of tokens you actually emitted in that time. Utilization matters here.

セルフホスト 出力1Mあたり = (GPU時間料金 × GPU台数) ÷ (実効出力スループット[tok/s] × 3600 × 稼働率 / 1e6)

Example as of June 2026 (confirm): H100 is roughly $1.5–4 / GPU-hour (on-demand median ≈ $3, spot ≈ $1–2.5). Always benchmark the effective throughput yourself (it varies greatly by model, quantization, batching, and context length).

A verifiable cost calculator (pure functions)

"World-class" isn't flashy code but small, testable pure functions. With zero side effects and determined only by inputs, they go straight into unit tests (the discipline of type safety).

// lib/llm-cost.ts — API とセルフホストを同じ単位($/1M出力tok)で比較する純粋関数群
export type TokenPrice = { readonly inputPer1M: number; readonly outputPer1M: number };

/** API の従量コスト（USD）。入力・出力トークンと単価から決定的に決まる。 */
export function apiCost(inputTokens: number, outputTokens: number, p: TokenPrice): number {
  return (inputTokens / 1e6) * p.inputPer1M + (outputTokens / 1e6) * p.outputPer1M;
}

/** セルフホストの「出力1Mトークンあたり原価」。稼働率が全てを左右する。 */
export function selfHostCostPer1MOutput(params: {
  gpuHourlyUsd: number;      // 例: 3.0（H100 オンデマンド）
  gpuCount: number;          // 例: 2（テンソル並列）
  aggOutputTokPerSec: number; // 例: 2500（要ベンチ：連続バッチ時の集約スループット）
  utilization: number;        // 0〜1（GPUが実際に生成に使われている割合）
}): number {
  const { gpuHourlyUsd, gpuCount, aggOutputTokPerSec, utilization } = params;
  const tokensPerHour = aggOutputTokPerSec * 3600 * utilization;
  if (tokensPerHour <= 0) return Infinity; // 暇なGPUは原価無限大＝丸損
  return (gpuHourlyUsd * gpuCount) / (tokensPerHour / 1e6);
}

/** セルフホストが API の出力単価を下回るのに必要な“損益分岐稼働率”。 */
export function breakEvenUtilization(params: {
  gpuHourlyUsd: number; gpuCount: number; aggOutputTokPerSec: number; apiOutputPer1M: number;
}): number {
  const fullUtilCost = selfHostCostPer1MOutput({ ...params, utilization: 1 });
  return Math.min(1, fullUtilCost / params.apiOutputPer1M);
}

Plug in numbers (example, benchmark required)

Setting H100×2 at $3/GPU-hour and an aggregate 2,500 tok/s (a hypothetical continuous-batching throughput):

Output cost at 100% utilization = (3×2) ÷ (2500×3600×1.0 / 1e6) = about $0.67 / 1M output tok
Almost the same as Bedrock Scout's output $0.66 / 1M — i.e., at 100% utilization you only just break even.
Break-even utilization ≈ 0.67 / 0.66 ≈ 1.0. If throughput is higher (e.g., 5,000 tok/s), the break-even drops to ~0.5.

💡 Interpretation: self-hosting "loses the moment you let the GPU sit idle." Unless it's a steady workload running near full utilization 24 hours a day, API (variable cost only) is often cheaper. First confirm quality and demand with API, then move to self-hosting once utilization is steadily high — that's the standard cost playbook. Spot ($1–2/hour) and Bedrock's Provisioned Throughput are levers that move this break-even.

Cost-reduction levers (in order of largest effect)

Cost is determined by "which term of the formula you hit." Listed in order of effectiveness in design.

1. Model routing (the most effective)

Sending everything through the top model is the biggest waste. Route models by task difficulty.

// 難度に応じて安いモデルへ。簡単8割を安く流すだけで総額は大きく下がる。
function pickModel(task: { kind: "classify" | "extract" | "reason" }): string {
  switch (task.kind) {
    case "classify": return "meta/llama-3.3-8b"; // 分類・モデレーションは小型で十分
    case "extract":  return "meta/llama-4-scout"; // 抽出は中位
    case "reason":   return "meta/llama-4-maverick"; // 難所だけ上位
  }
}

2. Quantization (FP8)

Using FP8 with vLLM, you can compress VRAM to pack more onto the same GPU and raise throughput (= the denominator). --kv-cache-dtype fp8 effectively widens the usable context, and the accuracy degradation is often practically small.

3. Continuous batching

vLLM's true value. Dynamically bundle many requests so the GPU never idles. This determines "effective throughput" and governs self-hosting's break-even. It's the heart of "raise utilization = lower cost."

4. Idempotent caching (zero regeneration)

Don't run the same generation twice for the same input. Just caching with sha256(model + prompt + parameters) as the key erases the cost of rapid-fire, retries, and duplicate requests (the idempotent route implementation in the pillar article). Prompt caching also works for the boilerplate part of prompts.

5. Spot/preemptible GPUs

You can cut the self-host GPU-hour price itself to less than half. But interruption tolerance (checkpoints, draining, retry design) is a prerequisite.

Lever	Term it hits	Felt effect	Prerequisite
Model routing	unit price × count	Large	Difficulty judgment
Quantization (FP8)	throughput	Mid–large	Supported GPU
Continuous batching	utilization/throughput	Large	vLLM
Idempotent caching	count	Mid	KV/Redis
Spot GPU	GPU-hour price	Mid–large	Interruption tolerance

Observability: you can't lower what you don't measure

Cost optimization starts from measurement. Drop input/output tokens, model, and latency per call into structured logs, and visualize per-model, per-use unit cost on a dashboard. Without this, "somehow it's expensive" leaves you with no improvement move (observability with OpenTelemetry). Don't include PII (prompt bodies) — metadata alone is enough to track cost.

Frequently asked questions (FAQ)

Q. In the end, which is cheaper, API or self-hosting? A. API if small/highly variable, self-hosting if steadily high-volume (you can maintain high utilization). The boundary is "how close to full utilization you can keep the GPU." Plug your measured throughput and volume into this article's formula to derive it.

Q. What tok/s should I compute with? A. Always benchmark yourself. It varies several-fold by model, quantization, batching, context length, and GPU. The 2,500 tok/s here is an example. Take a measured value in a vllm load test and plug it into the formula.

Q. What is Provisioned Throughput? A. A method to reserve dedicated throughput on an hourly basis in Bedrock. When usage-based gets expensive at steady high volume, it's an option just short of running your own GPUs. Consider it when you want both "the ease of managed × the per-unit cost at high volume."

Q. Should I worry about input tokens or output tokens? A. Generally output unit price is higher (e.g., Scout is input $0.17 / output $0.66). Not letting long generations run on, narrowing maxTokens, and keeping summaries concise help. On the other hand, long-context RAG balloons the input, so narrowing the context with retrieval is also a cost measure.

Q. What's the single most effective move? A. Model routing. Just sending easy tasks to cheap models moves the total significantly. Use the top model "only for the hard parts."

Conclusion

Llama's cost is determined by design more than by model choice. API is variable cost, self-hosting is fixed cost (GPU-hours) — capture this difference with formulas and put utilization in the lead role, and "how much does it cost" becomes a calculation, not a feeling.

First, confirm quality and demand with API (variable cost, zero operations).
Measure and visualize per-model, per-use unit cost.
Cut in the order routing → quantization → batching → idempotent caching → spot.
Once utilization is steadily high, confirm the self-host/Provisioned break-even with the formula and migrate.

I can accompany you through the cost design of your LLM stack (deriving the break-even, routing, self-host migration). Take a look at my track record facing GPU cost, and feel free to consult me. With one person × generative AI, fast, cheap, and safe.

Sources / official resources

Amazon Bedrock pricing — usage-based unit price for each Llama model (confirm)
Meta's Llama in Amazon Bedrock — models and delivery forms
vLLM Blog: Llama 4 — throughput and quantization
For cloud GPU pricing, refer to each provider's public prices (the H100 estimate here is as of June 2026)

Pricing and throughput fluctuate and depend on measurement. The numbers here are examples; recompute with your own real numbers before implementation.

Designing Llama inference cost: deriving the break-even of API vs. self-hosting with TCO

The goal of this article

The 30-second conclusion

Cost formulas: put them on the same footing

API (usage-based) cost

Self-host cost

A verifiable cost calculator (pure functions)

Plug in numbers (example, benchmark required)

Cost-reduction levers (in order of largest effect)

1. Model routing (the most effective)

2. Quantization (FP8)

3. Continuous batching

4. Idempotent caching (zero regeneration)

5. Spot/preemptible GPUs

Observability: you can't lower what you don't measure

Frequently asked questions (FAQ)

Conclusion

Sources / official resources

Llama Complete Guide: Shipping Meta's Open-Weight LLM to Production, Faithful to the Official Docs (Llama 4, Bedrock, Llama API)

Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

Selecting commercial licenses for open-weight LLMs: treating Apache 2.0 / Llama / Qwen / Gemma as a 'design decision'

Also worth reading

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

Vercel cost-optimization guide: understand the Active CPU pricing model and lower your bill

Vercel Functions × Fluid Compute implementation guide: concurrency, streaming, waitUntil, and Cron at production quality

The goal of this article

The 30-second conclusion

Cost formulas: put them on the same footing

API (usage-based) cost

Self-host cost

A verifiable cost calculator (pure functions)

Plug in numbers (example, benchmark required)

Cost-reduction levers (in order of largest effect)

1. Model routing (the most effective)

2. Quantization (FP8)

3. Continuous batching

4. Idempotent caching (zero regeneration)

5. Spot/preemptible GPUs

Observability: you can't lower what you don't measure

Frequently asked questions (FAQ)

Conclusion

Sources / official resources

Related articles

Llama Complete Guide: Shipping Meta's Open-Weight LLM to Production, Faithful to the Official Docs (Llama 4, Bedrock, Llama API)

Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

Selecting commercial licenses for open-weight LLMs: treating Apache 2.0 / Llama / Qwen / Gemma as a 'design decision'

Also worth reading

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

Vercel cost-optimization guide: understand the Active CPU pricing model and lower your bill

Vercel Functions × Fluid Compute implementation guide: concurrency, streaming, waitUntil, and Cron at production quality