Let me state the conclusion first. The essence of "use generative AI via a cloud API, or host it on your own GPU" is the choice of 'metered billing vs fixed cost.' And what decides the break-even point is usage volume and utilization (GPU utilization). For small, fluctuating usage, the API has an overwhelming edge. Only when conditions are clear — high-volume, always-on, and data can't go outside, or there's a regulatory requirement — is self-hosting justified. For many companies, the first correct answer is "start with an API," and building your own GPU is usually too early.
This article, based on my experience running a quantized open model on a self-hosted GPU in production in an AI video-localization platform (#1 in CrowdWorks contract ranking), organizes (1) the difference in cost structure, (2) how to discern the break-even point (with estimation code), and (3) a hybrid strategy, from the perspective of buyers and decision-makers. This is the "cost installment" of the guide to taking generative AI to production, and an application of the in-house vs outsource decision to AI infrastructure.
1. The difference in cost structure: metered billing vs fixed cost
First, understand the two cost structures accurately.
| Cloud API (ChatGPT / Claude / Gemini, etc.) | Self-hosting (open model × GPU) | |
|---|---|---|
| Billing model | Metered (per token) | Fixed cost (monthly GPU instance + operations) |
| Initial cost | Almost zero (usable immediately) | GPU build, model selection, ops setup |
| Small usage | Cheap | Expensive (GPU cost even when unused) |
| High usage | Expensive (unbounded) | Cheap (use the fixed cost to the hilt) |
| Data residency | Sent externally | Stays in your own environment |
| Operational load | Small (left to the provider) | Large (GPU, scaling, failure response) |
The API is "only what you use." So it gets linearly more expensive as usage grows and can be unbounded at peak. Self-hosting is "only the GPU you rented." So a fixed cost applies whether you use it or not, but the more you process, the lower the unit price per request.
Where these two lines cross is the break-even point.
2. The break-even is decided by "usage volume × utilization"
The break-even idea is simple. Find the usage volume at which "the monthly API metered cost" equals "the monthly fixed cost of your own GPU." Beyond this, self-hosting becomes favorable.
What's decisively important here is utilization. Even if you rent a GPU costing several hundred thousand yen a month, if you use it for only one hour a day, the fixed cost is almost wasted. "Self-hosting is cheap" is a story limited to when you can use the GPU to the hilt at high utilization. Conversely, if usage is sporadic, no matter how high the volume, the API is often cheaper.
Let me express the estimate with stated assumptions as a pure function. A design with no hidden constants, where the caller passes the assumptions — this is the key to making cost estimation "reproducible and verifiable."
/**
* 月間の推論コストを「API従量課金」と「自前GPU固定費」で比較する純粋関数。
* 前提(単価・GPU月額・実効スループット)はすべて引数で受け取り、隠れた定数を持たない。
* これにより、誰でも自社の数字を入れて再現・検証できる(テスト容易性 / KISS)。
*/
interface InferenceCostInputs {
/** 月間の処理トークン数(入力+出力の合計) */
readonly monthlyTokens: number;
/** API単価(USD / 100万トークン)。入出力で異なる場合は加重平均を渡す */
readonly apiPricePerMillionTokens: number;
/** 自前GPUの月額固定費(USD)。インスタンス代+運用人件費の按分を含める */
readonly gpuMonthlyCostUsd: number;
/** GPUの実効スループット(トークン/秒)。バッチ効率・稼働率を織り込んだ現実値 */
readonly gpuEffectiveTokensPerSecond: number;
}
interface InferenceCostComparison {
readonly apiMonthlyCostUsd: number;
readonly selfHostMonthlyCostUsd: number;
/** この月間トークン数を超えると、単純コストでは自前が有利になる */
readonly breakEvenTokens: number;
/** GPU 1機が1か月で物理的に処理できるトークン数の上限 */
readonly monthlyCapacityTokens: number;
readonly cheaper: "api" | "self-host";
/** 損益分岐がGPU 1機の処理能力を超える=1機では分岐に到達できない(増設が必要) */
readonly breakEvenExceedsSingleGpuCapacity: boolean;
}
const SECONDS_PER_MONTH = 60 * 60 * 24 * 30;
export function compareInferenceCost(
input: InferenceCostInputs,
): InferenceCostComparison {
const apiPricePerToken = input.apiPricePerMillionTokens / 1_000_000;
const apiMonthlyCostUsd = input.monthlyTokens * apiPricePerToken;
const breakEvenTokens = input.gpuMonthlyCostUsd / apiPricePerToken;
const monthlyCapacityTokens =
input.gpuEffectiveTokensPerSecond * SECONDS_PER_MONTH;
return {
apiMonthlyCostUsd,
selfHostMonthlyCostUsd: input.gpuMonthlyCostUsd,
breakEvenTokens,
monthlyCapacityTokens,
cheaper:
apiMonthlyCostUsd <= input.gpuMonthlyCostUsd ? "api" : "self-host",
// 分岐点が1機の能力を超えるなら、稼働率を上げるか、そもそもAPI向きの需要
breakEvenExceedsSingleGpuCapacity: breakEvenTokens > monthlyCapacityTokens,
};
}
This estimate has a cost intentionally not included. That's the "engineering tax" of the next chapter.
3. Self-hosting's "engineering tax"
The cost of self-hosting isn't just the GPU instance fee. Always factor the invisible cost of operations (the engineering tax) into the break-even.
| Invisible cost | Content |
|---|---|
| GPU operations/scaling | Scaling to follow demand, handling spot-GPU interruptions, avoiding VRAM exhaustion |
| Model selection/updates | Choosing the quantization method, following new models, verifying accuracy/speed |
| Failure response/observability | Monitoring GPU nodes, detecting inference latency/failures, restart/resume |
| Security | Protecting models/data, least privilege, network separation |
In my AI video-localization platform, to keep cost down I adopted Azure spot GPUs (Tesla T4), but since spot instances "are forcibly stopped without notice at the cloud's convenience," I had to design with resumability from interruption as a precondition. I split the video into segments per speech interval and cache each segment's output to a persistent disk. Even if a spot GPU goes down, processing can resume from completed segments. This is a typical engineering tax needed behind "the GPU is cheap."
Implication for buyers: always confirm whether the "self-hosting is cheap" estimate includes the cost of GPU operations, scaling, failure response, and model updates. A self-hosting estimate missing these is usually an underestimate.
4. The four conditions under which self-hosting is still justified
So, when should you choose self-hosting? When one of the following four conditions is clear.
- High-volume/always-on — stable demand enough to use the GPU to the hilt at high utilization (consistently exceeding the break-even).
- Data sovereignty — confidential data/personal information can't be sent to an external API (finance, healthcare, public sector, highly confidential B2B).
- Regulation/governance — on-premises requirements or regulations like restrictions on cross-border data transfer.
- An extreme unit-cost requirement / customization — the per-token price directly governs the business's profitability, or domain-specific fine-tuning is essential.
My video-localization platform chose self-hosting mainly for (1) and (4). Translating and dubbing videos into eight languages needs a large amount of LLM inference, and metered API billing doesn't pencil out. So I made a configuration where translation runs on a self-hosted GPU with quantized open models (Qwen3-8B-AWQ on vLLM, 4-bit quantized Llama-3), and transcription uses faster-whisper (large-v3 / int8–float16). By optimizing the license, quality, and speed trade-offs in stages, I run high-volume processing within the fixed cost.
Conversely, if these conditions aren't clear, self-hosting is overinvestment. "I vaguely don't want data to go outside" or "self-hosting is cooler" don't count among the four conditions.
5. The real lever to lower cost is "design that doesn't call"
Whichever you choose — API or self-hosting — the biggest cost reduction comes from a design that "reduces the number of times you call AI at all." This is far more effective than negotiating a pricing plan.
Concrete examples of cost optimization I implemented.
- Silence-segment skipping (video localization) — extract only "intervals actually being spoken" from the dubbed audio and subtitle intervals, and preserve the original footage for silent intervals without passing them through the GPU. This alone cut GPU processing by about 40% and simultaneously resolved the diffusion model's mouth hallucinations.
- Hybrid OCR (broadcaster platform) — rather than applying expensive LLM OCR to every video frame, detect telop "transitions" via local processing and apply the LLM only to unique diffs. Minimized LLM calls while preserving accuracy.
- Content-hash idempotent caching — reuse the result for the same input to prevent double processing and double billing.
"How to cut wasted AI calls while preserving quality" — design skill shows here. Captivated by unit price alone and neglecting this "design that doesn't call," cost swells for both API and self-hosting.
6. The realistic answer: hybrid and "design that avoids lock-in"
The optimal answer for most companies is a hybrid that "reaches production with an API first, and moves only high-frequency, sensitive workloads to self-hosting in stages." Building your own GPU platform from the start delays the launch and prepays the engineering tax.
And what makes this staged migration possible is lock-in avoidance via a provider abstraction. In implementation, I place AI engines (transcription, translation, speech synthesis, lip sync, etc.) behind the abstraction boundary of "interface → provider → factory," making them swappable with just an environment variable. With this,
- a migration like "translation starts with an API, and moves to a self-hosted open model when volume grows" can be done without touching business logic (ETC: ease of change).
- you localize dependence on a specific vendor and reduce the risk of price revisions and service shutdowns.
Don't make "API or self-hosting" a one-time, irreversible decision; keep a structure you can switch anytime — this is the core of cost strategy in the fast-changing AI domain.
FAQ
Q. For generative AI, is an API or self-hosting cheaper?
It depends on usage volume and utilization. For small, fluctuating usage, an API (metered) is overwhelmingly cheaper; if high-volume/always-on and you can use the GPU to the hilt at high utilization, self-hosting (fixed cost) is cheaper. The break-even point is the usage volume where "API monthly = GPU monthly fixed cost." But since self-hosting has invisible operational costs (the engineering tax), estimate including them. Most companies should start with an API first.
Q. I don't want data to go outside, so should I self-host?
If data sovereignty is a "clear requirement" (confidential data/personal info that can't be sent to an API due to regulation), self-hosting is justified. On the other hand, if it's just "vaguely anxious," the requirement can often be met by the API providers' data handling (settings to not use for training, zero-data-retention options, region selection). First confirm "whether there's truly a regulatory or contractual constraint the API can't meet."
Q. What are self-hosting's "invisible costs"?
Besides the GPU instance fee, there are operational costs (the engineering tax): GPU operations/scaling, handling spot-GPU interruptions, model selection/updates/accuracy verification, failure response/observability, and security. Estimating "self-hosting is cheap" without factoring these in is almost always an underestimate.
Q. What's the most effective way to lower cost?
A design that "reduces the number of times you call AI at all." Reducing wasted calls is far more effective than negotiating unit price. Cache the result for the same input, narrow the processing target to only relevant spots (don't apply an expensive model to all data), filter with cheap pre-processing before calling, etc. In the real example, I cut GPU cost by about 40% by excluding silent intervals from GPU processing.
Q. Can I switch between API and self-hosting later?
It's possible depending on the design. If you place AI engines behind the abstraction boundary of "interface → provider → factory" and make them swappable with an environment variable, you can do the migration "API first, self-hosting when volume grows" without touching business logic. Not making it an irreversible decision from the start, but keeping a switchable structure, is the standard play in the fast-changing AI domain.
Summary: decided by utilization, prepared for with hybrid
To not lose out on generative-AI infrastructure cost, here's what to grasp.
- The essence is "metered billing vs fixed cost" — the break-even is decided by usage volume, and utilization governs everything.
- Most companies should start with an API — self-hosting only when the four conditions (high-volume/always-on, data sovereignty, regulation, extreme unit-cost/customization) are clear.
- Self-hosting has an "engineering tax" — include GPU operations, scaling, and failure response in the estimate.
- The biggest lever is "design that doesn't call" — structurally reduce calls with caching, diff processing, and pre-filtering.
- Hybrid + provider abstraction — start with an API and move only the necessary parts to self-hosting. Keep a switchable structure.
Cost design for generative AI, estimating the API vs self-hosting break-even, building a self-hosted GPU platform, cost optimization of an existing AI system — the issue of "AI's cost is unpredictable / too expensive" can be solved by design. From requirements definition through cost design, infrastructure, and operations, I take it on one-stop at production-operations quality.