Skip to main content
友田 陽大
Llama & open-weight LLMs
Llama
コスト最適化
生成AI
AWS Bedrock
FinOps
GPU
TypeScript

Designing Llama inference cost: deriving the break-even of API vs. self-hosting with TCO

An article that answers 'how much does running Llama in production cost?' not by feel but with TCO. It explains, with verifiable code and real numbers: the cost formula for usage-based billing (Bedrock, etc.) and self-hosting (GPU-hours × throughput), how to derive the break-even, and the cost-reduction levers of model routing, quantization, batching, idempotent caching, and spot GPUs.

Published
Reading time
9 min read
Author
友田 陽大
Share

The goal of this article

"How much does it actually cost to run Llama in production?" — it's the first question asked in a project. This piece answers it not by feel but with TCO (total cost of ownership). The aim is to nail down the "cost" chapter of the big picture of putting Llama into production with formulas, real numbers, and code.

When you finish reading, the goal is a state where you can:

  1. Calculate the cost of API and self-hosting on the same footing.
  2. Derive where the break-even is for your own volume by plugging into the formula.
  3. Judge which lever to pull for the biggest effect, in design order.

Reliability disclosure: GPU cost is an area I actually faced on the video AI localization platform. The numbers here are examples based on public information as of June 2026, and pricing fluctuates. The value is in "the formulas and the way of thinking" — judge by plugging in your own numbers, not the absolute values.


The 30-second conclusion

API (Bedrock, etc., usage-based)Self-host (rent GPUs and run them yourself)
BillingPay-per-use (input/output tokens)GPU-hours are a fixed cost (billed even when idle)
Initial/operationsNearly zeroEnvironment setup, scaling, monitoring, incident response
Condition for lower costLow–mid volume, high variabilityHigh, stable utilization
Suited forPrototypes, variable demand, small-batch high-varietySteady high volume, data sovereignty, minimizing per-request cost

The essence is this one line: API is variable cost only, self-hosting is fixed cost (GPU-hours). So "how much you can keep the GPU from sitting idle (utilization)" decides whether self-hosting wins.


Cost formulas: put them on the same footing

API (usage-based) cost

It's a multiplication of token unit prices. Being simple, the advantage is that it's easy to predict.

APIコスト = (入力トークン / 1e6) × 入力単価 + (出力トークン / 1e6) × 出力単価

Example as of June 2026 (Amazon Bedrock, on-demand, confirm):

ModelInput / 1MOutput / 1MPositioning
Llama 4 Scout$0.17$0.66Long-context, standard
Llama 4 Maverick$0.24$0.97High-quality flagship

Self-host cost

Divide the fixed GPU-hour price by the amount of tokens you actually emitted in that time. Utilization matters here.

セルフホスト 出力1Mあたり = (GPU時間料金 × GPU台数) ÷ (実効出力スループット[tok/s] × 3600 × 稼働率 / 1e6)

Example as of June 2026 (confirm): H100 is roughly $1.5–4 / GPU-hour (on-demand median ≈ $3, spot ≈ $1–2.5). Always benchmark the effective throughput yourself (it varies greatly by model, quantization, batching, and context length).


A verifiable cost calculator (pure functions)

"World-class" isn't flashy code but small, testable pure functions. With zero side effects and determined only by inputs, they go straight into unit tests (the discipline of type safety).

// lib/llm-cost.ts — API とセルフホストを同じ単位($/1M出力tok)で比較する純粋関数群
export type TokenPrice = { readonly inputPer1M: number; readonly outputPer1M: number };

/** API の従量コスト(USD)。入力・出力トークンと単価から決定的に決まる。 */
export function apiCost(inputTokens: number, outputTokens: number, p: TokenPrice): number {
  return (inputTokens / 1e6) * p.inputPer1M + (outputTokens / 1e6) * p.outputPer1M;
}

/** セルフホストの「出力1Mトークンあたり原価」。稼働率が全てを左右する。 */
export function selfHostCostPer1MOutput(params: {
  gpuHourlyUsd: number;      // 例: 3.0(H100 オンデマンド)
  gpuCount: number;          // 例: 2(テンソル並列)
  aggOutputTokPerSec: number; // 例: 2500(要ベンチ:連続バッチ時の集約スループット)
  utilization: number;        // 01(GPUが実際に生成に使われている割合)
}): number {
  const { gpuHourlyUsd, gpuCount, aggOutputTokPerSec, utilization } = params;
  const tokensPerHour = aggOutputTokPerSec * 3600 * utilization;
  if (tokensPerHour <= 0) return Infinity; // 暇なGPUは原価無限大=丸損
  return (gpuHourlyUsd * gpuCount) / (tokensPerHour / 1e6);
}

/** セルフホストが API の出力単価を下回るのに必要な“損益分岐稼働率”。 */
export function breakEvenUtilization(params: {
  gpuHourlyUsd: number; gpuCount: number; aggOutputTokPerSec: number; apiOutputPer1M: number;
}): number {
  const fullUtilCost = selfHostCostPer1MOutput({ ...params, utilization: 1 });
  return Math.min(1, fullUtilCost / params.apiOutputPer1M);
}

Plug in numbers (example, benchmark required)

Setting H100×2 at $3/GPU-hour and an aggregate 2,500 tok/s (a hypothetical continuous-batching throughput):

  • Output cost at 100% utilization = (3×2) ÷ (2500×3600×1.0 / 1e6) = about $0.67 / 1M output tok
  • Almost the same as Bedrock Scout's output $0.66 / 1Mi.e., at 100% utilization you only just break even.
  • Break-even utilization ≈ 0.67 / 0.661.0. If throughput is higher (e.g., 5,000 tok/s), the break-even drops to ~0.5.

💡 Interpretation: self-hosting "loses the moment you let the GPU sit idle." Unless it's a steady workload running near full utilization 24 hours a day, API (variable cost only) is often cheaper. First confirm quality and demand with API, then move to self-hosting once utilization is steadily high — that's the standard cost playbook. Spot ($1–2/hour) and Bedrock's Provisioned Throughput are levers that move this break-even.


Cost-reduction levers (in order of largest effect)

Cost is determined by "which term of the formula you hit." Listed in order of effectiveness in design.

1. Model routing (the most effective)

Sending everything through the top model is the biggest waste. Route models by task difficulty.

// 難度に応じて安いモデルへ。簡単8割を安く流すだけで総額は大きく下がる。
function pickModel(task: { kind: "classify" | "extract" | "reason" }): string {
  switch (task.kind) {
    case "classify": return "meta/llama-3.3-8b"; // 分類・モデレーションは小型で十分
    case "extract":  return "meta/llama-4-scout"; // 抽出は中位
    case "reason":   return "meta/llama-4-maverick"; // 難所だけ上位
  }
}

2. Quantization (FP8)

Using FP8 with vLLM, you can compress VRAM to pack more onto the same GPU and raise throughput (= the denominator). --kv-cache-dtype fp8 effectively widens the usable context, and the accuracy degradation is often practically small.

3. Continuous batching

vLLM's true value. Dynamically bundle many requests so the GPU never idles. This determines "effective throughput" and governs self-hosting's break-even. It's the heart of "raise utilization = lower cost."

4. Idempotent caching (zero regeneration)

Don't run the same generation twice for the same input. Just caching with sha256(model + prompt + parameters) as the key erases the cost of rapid-fire, retries, and duplicate requests (the idempotent route implementation in the pillar article). Prompt caching also works for the boilerplate part of prompts.

5. Spot/preemptible GPUs

You can cut the self-host GPU-hour price itself to less than half. But interruption tolerance (checkpoints, draining, retry design) is a prerequisite.

LeverTerm it hitsFelt effectPrerequisite
Model routingunit price × countLargeDifficulty judgment
Quantization (FP8)throughputMid–largeSupported GPU
Continuous batchingutilization/throughputLargevLLM
Idempotent cachingcountMidKV/Redis
Spot GPUGPU-hour priceMid–largeInterruption tolerance

Observability: you can't lower what you don't measure

Cost optimization starts from measurement. Drop input/output tokens, model, and latency per call into structured logs, and visualize per-model, per-use unit cost on a dashboard. Without this, "somehow it's expensive" leaves you with no improvement move (observability with OpenTelemetry). Don't include PII (prompt bodies) — metadata alone is enough to track cost.


Frequently asked questions (FAQ)

Q. In the end, which is cheaper, API or self-hosting? A. API if small/highly variable, self-hosting if steadily high-volume (you can maintain high utilization). The boundary is "how close to full utilization you can keep the GPU." Plug your measured throughput and volume into this article's formula to derive it.

Q. What tok/s should I compute with? A. Always benchmark yourself. It varies several-fold by model, quantization, batching, context length, and GPU. The 2,500 tok/s here is an example. Take a measured value in a vllm load test and plug it into the formula.

Q. What is Provisioned Throughput? A. A method to reserve dedicated throughput on an hourly basis in Bedrock. When usage-based gets expensive at steady high volume, it's an option just short of running your own GPUs. Consider it when you want both "the ease of managed × the per-unit cost at high volume."

Q. Should I worry about input tokens or output tokens? A. Generally output unit price is higher (e.g., Scout is input $0.17 / output $0.66). Not letting long generations run on, narrowing maxTokens, and keeping summaries concise help. On the other hand, long-context RAG balloons the input, so narrowing the context with retrieval is also a cost measure.

Q. What's the single most effective move? A. Model routing. Just sending easy tasks to cheap models moves the total significantly. Use the top model "only for the hard parts."


Conclusion

Llama's cost is determined by design more than by model choice. API is variable cost, self-hosting is fixed cost (GPU-hours) — capture this difference with formulas and put utilization in the lead role, and "how much does it cost" becomes a calculation, not a feeling.

  1. First, confirm quality and demand with API (variable cost, zero operations).
  2. Measure and visualize per-model, per-use unit cost.
  3. Cut in the order routing → quantization → batching → idempotent caching → spot.
  4. Once utilization is steadily high, confirm the self-host/Provisioned break-even with the formula and migrate.

I can accompany you through the cost design of your LLM stack (deriving the break-even, routing, self-host migration). Take a look at my track record facing GPU cost, and feel free to consult me. With one person × generative AI, fast, cheap, and safe.

Sources / official resources

  • Pricing and throughput fluctuate and depend on measurement. The numbers here are examples; recompute with your own real numbers before implementation.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading