# Designing Llama inference cost: deriving the break-even of API vs. self-hosting with TCO

> An article that answers 'how much does running Llama in production cost?' not by feel but with TCO. It explains, with verifiable code and real numbers: the cost formula for usage-based billing (Bedrock, etc.) and self-hosting (GPU-hours × throughput), how to derive the break-even, and the cost-reduction levers of model routing, quantization, batching, idempotent caching, and spot GPUs.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Llama, コスト最適化, 生成AI, AWS Bedrock, FinOps, GPU, TypeScript
- URL: https://tomodahinata.com/en/blog/llama-inference-cost-optimization-self-host-vs-api
- Category: Llama & open-weight LLMs
- Pillar guide: https://tomodahinata.com/en/blog/meta-llama-open-weight-llm-production-guide

## Key points

- API (usage-based) is pay-per-use with zero operations. Self-hosting makes GPU-hours a fixed cost — 'billed even when idle' is the essential difference.
- Cost formulas: API = (input tok × unit price + output tok × unit price); self-host = GPU-hour price ÷ (effective throughput × utilization). For the latter, utilization decides everything.
- Real numbers (June 2026, confirm): Bedrock Llama 4 Scout is $0.17/$0.66, Maverick $0.24/$0.97 (input/output per 1M); H100 is roughly $1.5–4/GPU-hour.
- Break-even rule of thumb: for self-hosting to beat API, you need high, stable utilization. Small/variable → API; steady/high-volume → self-host or Provisioned.
- The effective levers are ① model routing ② quantization (FP8) ③ continuous batching ④ idempotent caching for zero regeneration ⑤ spot GPUs. The effect is larger in design order.

---

## The goal of this article

"How much does it actually cost to run Llama in production?" — it's the first question asked in a project. This piece answers it **not by feel but with TCO (total cost of ownership).** The aim is to nail down the "cost" chapter of [the big picture of putting Llama into production](/blog/meta-llama-open-weight-llm-production-guide) **with formulas, real numbers, and code.**

When you finish reading, the goal is a state where you can:

1. **Calculate the cost of API and self-hosting on the same footing.**
2. **Derive where the break-even is for your own volume** by plugging into the formula.
3. **Judge which lever to pull for the biggest effect**, in design order.

> **Reliability disclosure**: GPU cost is an area I actually faced on the [video AI localization platform](/case-studies/ai-video-localization-lipsync). The numbers here are **examples** based on **public information as of June 2026**, and pricing fluctuates. The value is in "**the formulas and the way of thinking**" — judge by plugging in your own numbers, not the absolute values.

---

## The 30-second conclusion

| | API (Bedrock, etc., usage-based) | Self-host (rent GPUs and run them yourself) |
| --- | --- | --- |
| Billing | **Pay-per-use** (input/output tokens) | **GPU-hours are a fixed cost** (billed even when idle) |
| Initial/operations | Nearly zero | Environment setup, scaling, monitoring, incident response |
| Condition for lower cost | Low–mid volume, high variability | **High, stable utilization** |
| Suited for | Prototypes, variable demand, small-batch high-variety | Steady high volume, data sovereignty, minimizing per-request cost |

**The essence is this one line**: API is **variable cost only**, self-hosting is **fixed cost (GPU-hours).** So "**how much you can keep the GPU from sitting idle (utilization)**" decides whether self-hosting wins.

---

## Cost formulas: put them on the same footing

### API (usage-based) cost

It's a multiplication of token unit prices. Being simple, the advantage is that it's **easy to predict.**

```text
APIコスト = (入力トークン / 1e6) × 入力単価 + (出力トークン / 1e6) × 出力単価
```

**Example as of June 2026 (Amazon Bedrock, on-demand, confirm)**:

| Model | Input / 1M | Output / 1M | Positioning |
| --- | --- | --- | --- |
| **Llama 4 Scout** | **$0.17** | **$0.66** | Long-context, standard |
| **Llama 4 Maverick** | **$0.24** | **$0.97** | High-quality flagship |

### Self-host cost

Divide the fixed **GPU-hour price** by the amount of tokens you **actually emitted** in that time. **Utilization** matters here.

```text
セルフホスト 出力1Mあたり = (GPU時間料金 × GPU台数) ÷ (実効出力スループット[tok/s] × 3600 × 稼働率 / 1e6)
```

**Example as of June 2026 (confirm)**: H100 is roughly **$1.5–4 / GPU-hour** (on-demand median ≈ $3, spot ≈ $1–2.5). **Always benchmark the effective throughput yourself** (it varies greatly by model, quantization, batching, and context length).

---

## A verifiable cost calculator (pure functions)

"World-class" isn't flashy code but **small, testable pure functions.** With zero side effects and determined only by inputs, they go straight into unit tests ([the discipline of type safety](/blog/typescript-type-safety-discipline-zod-nevererror-no-any)).

```ts
// lib/llm-cost.ts — API とセルフホストを同じ単位($/1M出力tok)で比較する純粋関数群
export type TokenPrice = { readonly inputPer1M: number; readonly outputPer1M: number };

/** API の従量コスト（USD）。入力・出力トークンと単価から決定的に決まる。 */
export function apiCost(inputTokens: number, outputTokens: number, p: TokenPrice): number {
  return (inputTokens / 1e6) * p.inputPer1M + (outputTokens / 1e6) * p.outputPer1M;
}

/** セルフホストの「出力1Mトークンあたり原価」。稼働率が全てを左右する。 */
export function selfHostCostPer1MOutput(params: {
  gpuHourlyUsd: number;      // 例: 3.0（H100 オンデマンド）
  gpuCount: number;          // 例: 2（テンソル並列）
  aggOutputTokPerSec: number; // 例: 2500（要ベンチ：連続バッチ時の集約スループット）
  utilization: number;        // 0〜1（GPUが実際に生成に使われている割合）
}): number {
  const { gpuHourlyUsd, gpuCount, aggOutputTokPerSec, utilization } = params;
  const tokensPerHour = aggOutputTokPerSec * 3600 * utilization;
  if (tokensPerHour <= 0) return Infinity; // 暇なGPUは原価無限大＝丸損
  return (gpuHourlyUsd * gpuCount) / (tokensPerHour / 1e6);
}

/** セルフホストが API の出力単価を下回るのに必要な“損益分岐稼働率”。 */
export function breakEvenUtilization(params: {
  gpuHourlyUsd: number; gpuCount: number; aggOutputTokPerSec: number; apiOutputPer1M: number;
}): number {
  const fullUtilCost = selfHostCostPer1MOutput({ ...params, utilization: 1 });
  return Math.min(1, fullUtilCost / params.apiOutputPer1M);
}
```

### Plug in numbers (example, benchmark required)

Setting H100×2 at $3/GPU-hour and an aggregate 2,500 tok/s (a **hypothetical** continuous-batching throughput):

- Output cost at 100% utilization = `(3×2) ÷ (2500×3600×1.0 / 1e6)` = **about $0.67 / 1M output tok**
- Almost the same as Bedrock Scout's output **$0.66 / 1M** — **i.e., at 100% utilization you only just break even.**
- Break-even utilization ≈ `0.67 / 0.66` ≈ **1.0**. If throughput is higher (e.g., 5,000 tok/s), the break-even drops to ~0.5.

> 💡 **Interpretation**: self-hosting "loses the moment you let the GPU sit idle." Unless it's a **steady workload running near full utilization 24 hours a day**, API (variable cost only) is often cheaper. **First confirm quality and demand with API, then move to self-hosting once utilization is steadily high** — that's the standard cost playbook. Spot ($1–2/hour) and Bedrock's **Provisioned Throughput** are levers that move this break-even.

---

## Cost-reduction levers (in order of largest effect)

Cost is determined by "which term of the formula you hit." Listed **in order of effectiveness in design.**

### 1. Model routing (the most effective)

Sending everything through the top model is **the biggest waste.** Route models by task difficulty.

```ts
// 難度に応じて安いモデルへ。簡単8割を安く流すだけで総額は大きく下がる。
function pickModel(task: { kind: "classify" | "extract" | "reason" }): string {
  switch (task.kind) {
    case "classify": return "meta/llama-3.3-8b"; // 分類・モデレーションは小型で十分
    case "extract":  return "meta/llama-4-scout"; // 抽出は中位
    case "reason":   return "meta/llama-4-maverick"; // 難所だけ上位
  }
}
```

### 2. Quantization (FP8)

Using [FP8 with vLLM](/blog/vllm-llama-self-hosting-production-inference-server), you can compress VRAM to **pack more onto the same GPU** and raise throughput (= the denominator). `--kv-cache-dtype fp8` effectively widens the usable context, and the accuracy degradation is often practically small.

### 3. Continuous batching

vLLM's true value. **Dynamically bundle many requests so the GPU never idles.** This determines "effective throughput" and governs self-hosting's break-even. It's the heart of "**raise utilization = lower cost.**"

### 4. Idempotent caching (zero regeneration)

Don't run the same generation twice for the same input. Just caching with `sha256(model + prompt + parameters)` as the key **erases the cost of rapid-fire, retries, and duplicate requests** ([the idempotent route implementation in the pillar article](/blog/meta-llama-open-weight-llm-production-guide#冪等性つきの本番ルートハンドラnextjs)). **Prompt caching** also works for the boilerplate part of prompts.

### 5. Spot/preemptible GPUs

You can cut the self-host GPU-hour price itself to **less than half.** But interruption tolerance (checkpoints, draining, [retry design](/blog/retry-backoff-circuit-breaker-resilience-patterns-guide)) is a prerequisite.

| Lever | Term it hits | Felt effect | Prerequisite |
| --- | --- | --- | --- |
| Model routing | unit price × count | Large | Difficulty judgment |
| Quantization (FP8) | throughput | Mid–large | Supported GPU |
| Continuous batching | utilization/throughput | Large | vLLM |
| Idempotent caching | count | Mid | KV/Redis |
| Spot GPU | GPU-hour price | Mid–large | Interruption tolerance |

---

## Observability: you can't lower what you don't measure

Cost optimization starts **from measurement.** Drop **input/output tokens, model, and latency** per call into structured logs, and **visualize per-model, per-use unit cost** on a dashboard. Without this, "somehow it's expensive" leaves you with no improvement move ([observability with OpenTelemetry](/blog/opentelemetry-observability-production-tracing-metrics-logs)). Don't include PII (prompt bodies) — **metadata alone is enough to track cost.**

---

## Frequently asked questions (FAQ)

**Q. In the end, which is cheaper, API or self-hosting?**
A. **API if small/highly variable**, **self-hosting if steadily high-volume (you can maintain high utilization).** The boundary is "how close to full utilization you can keep the GPU." Plug your measured throughput and volume into this article's formula to derive it.

**Q. What tok/s should I compute with?**
A. **Always benchmark yourself.** It varies several-fold by model, quantization, batching, context length, and GPU. The 2,500 tok/s here is an **example.** Take a measured value in a `vllm` load test and plug it into the formula.

**Q. What is Provisioned Throughput?**
A. A method to **reserve dedicated throughput on an hourly basis** in Bedrock. When usage-based gets expensive at steady high volume, it's an option just short of running your own GPUs. Consider it when you want both "the ease of managed × the per-unit cost at high volume."

**Q. Should I worry about input tokens or output tokens?**
A. Generally **output unit price is higher** (e.g., Scout is input $0.17 / output $0.66). Not letting long generations run on, narrowing `maxTokens`, and keeping summaries concise help. On the other hand, **long-context RAG balloons the input**, so narrowing the context with retrieval is also a cost measure.

**Q. What's the single most effective move?**
A. **Model routing.** Just sending easy tasks to cheap models moves the total significantly. Use the top model "only for the hard parts."

---

## Conclusion

Llama's cost is determined by **design** more than by model choice. API is variable cost, self-hosting is fixed cost (GPU-hours) — capture this difference **with formulas** and put **utilization** in the lead role, and "how much does it cost" becomes a **calculation**, not a feeling.

1. First, **confirm quality and demand with API** (variable cost, zero operations).
2. **Measure** and visualize per-model, per-use unit cost.
3. Cut in the order **routing → quantization → batching → idempotent caching → spot.**
4. Once utilization is steadily high, confirm the **self-host/Provisioned break-even** with the formula and migrate.

> I can accompany you through the cost design of your LLM stack (deriving the break-even, routing, self-host migration). Take a look at my [track record](/case-studies/ai-video-localization-lipsync) facing GPU cost, and feel free to consult me. With **one person × generative AI**, fast, cheap, and safe.

### Sources / official resources

- [Amazon Bedrock pricing](https://aws.amazon.com/bedrock/pricing/) — usage-based unit price for each Llama model (confirm)
- [Meta's Llama in Amazon Bedrock](https://aws.amazon.com/bedrock/meta/) — models and delivery forms
- [vLLM Blog: Llama 4](https://blog.vllm.ai/2025/04/05/llama4.html) — throughput and quantization
- For cloud GPU pricing, refer to each provider's public prices (the H100 estimate here is as of June 2026)

* Pricing and throughput fluctuate and depend on measurement. The numbers here are examples; recompute with your own real numbers before implementation.
