Skip to main content
友田 陽大
Llama & open-weight LLMs
Llama
vLLM
生成AI
GPU
MLOps
セルフホスト
Python

Self-hosting Llama in production with vLLM: a high-throughput inference-server operations log

A practical vLLM guide for running Llama in production on your own GPU. Maximize throughput with continuous batching and PagedAttention, pack it in with FP8 quantization and tensor parallelism, and serve it as an OpenAI-compatible endpoint. With real code, it covers how to build an inference platform that doesn't fall over: health checks, observability, autoscaling, graceful drain, Bedrock fallback, and network isolation.

Published
Reading time
8 min read
Author
友田 陽大
Share

The goal of this article

In the big picture of putting Llama into production, I wrote "vLLM for self-hosting." This piece is its implementation runbook. You can't let data leave, you want to minimize the per-request unit cost, you want to run fine-tuned weights — for such requirements, it shows the full process, in real code, for running Llama on your own GPU, without falling over, and fast.

The goal isn't being able to type vllm serve but being able to assemble a "production inference platform" equipped with health checks, observability, autoscaling, resilience, and security.

Reliability disclosure: the difficulty of running GPUs in production (throughput, failures, cost) is an area I actually walked through on the video AI localization platform. The commands and flags here are based on the official sources (vLLM Blog: Llama 4, vLLM Recipes). Throughput numbers are environment-dependent and require benchmarking.


Why vLLM

Self-serving's win or loss is decided by "how much you keep the GPU from idling (effective throughput)." vLLM is optimized for this.

  • PagedAttention: page-manages the KV cache like an OS's virtual memory, suppressing fragmentation to fit more concurrent requests.
  • Continuous batching: dynamically bundles arriving requests and processes them. Since it packs in the next before one finishes, the GPU doesn't spin idle. This directly moves self-hosting's break-even.
  • OpenAI-compatible server: provides :8000/v1. You can swap just the destination without changing production code at all.

Serve: Scout with FP8 × tensor parallelism

A minimal production command in line with the official recipe. Compress VRAM with FP8 quantization and distribute across multiple GPUs with tensor parallelism.

# Llama 4 Scout(FP8)を 2GPU にテンソル並列、KVキャッシュもFP8で文脈を稼ぐ
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8 \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --max-model-len 500000 \
  --gpu-memory-utilization 0.90 \
  --port 8000
  • --tensor-parallel-size 2: split the weights and KV across 2 GPUs. Decide the count by VRAM and model.
  • --kv-cache-dtype fp8: make the KV cache FP8. The usable context length effectively grows, and throughput improves (accuracy degradation is practically small).
  • --max-model-len: the effective upper bound. Scout is originally 10M, but narrow it by the reality of VRAM and latency. Don't make it unlimited.
  • --gpu-memory-utilization: the reservation rate. Too high causes OOM, too low reduces concurrency.

🔧 Sticking point: FP8 is premised on a supported GPU like H100/H200/Ada/Blackwell. On an unsupported GPU it fails to start or gets slow with a fallback. Always confirm the generation with nvidia-smi and the actual dtype in the startup log.

Once it's up, you can hit it with the OpenAI client as-is.

import OpenAI from "openai";
// 本番コードは無改修。baseURL を自前エンドポイントに向けるだけ。
const llm = new OpenAI({ baseURL: "http://llama-internal:8000/v1", apiKey: "internal" });
const r = await llm.chat.completions.create({
  model: "meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8",
  messages: [{ role: "user", content: "RAGの再ランキングを2行で" }],
});

Containerization: make it a reproducible image

In production, a manual pip install loses reproducibility. The principle is to bake it into an image.

# Dockerfile — vLLM の公式イメージを基盤に、モデルは実行時にマウント/取得
FROM vllm/vllm-openai:latest

# モデルIDと並列数は環境変数で外出し(イメージは1つ、構成は差し替え)
ENV MODEL_ID=meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8 \
    TP_SIZE=2

# ヘルスチェック:/health が 200 を返すまでトラフィックを入れない
HEALTHCHECK --interval=15s --timeout=5s --start-period=600s --retries=10 \
  CMD curl -fsS http://localhost:8000/health || exit 1

ENTRYPOINT ["sh", "-c", \
  "vllm serve $MODEL_ID --tensor-parallel-size $TP_SIZE --kv-cache-dtype fp8 --port 8000"]

Taking --start-period longer (model loading takes minutes) is plain but important. It prevents the accident of falling into a restart loop from an unhealthy verdict during loading.


Observability: vLLM emits metrics from the start

vLLM emits Prometheus-format metrics at /metrics. Ingest these and you can visualize throughput, the wait queue, GPU cache usage, and TTFT/TPOT, judging "when to scale" with numbers.

# 監視で見るべき主要メトリクス(例)
# vllm:num_requests_running     … 実行中リクエスト数(飽和の指標)
# vllm:num_requests_waiting     … 待ち行列(増え続けるならスケール)
# vllm:gpu_cache_usage_perc     … KVキャッシュ使用率(高止まりは限界)
# vllm:time_to_first_token_*    … 体感レイテンシ
curl -s http://localhost:8000/metrics | grep -E "num_requests|gpu_cache_usage"

If num_requests_waiting continuously piles up, it's a sign of insufficient capacity. Use this as the trigger to autoscale GPU nodes. Align the observability design philosophy with the OpenTelemetry article (correlating metrics/logs/traces, no PII output).


Autoscaling and graceful drain

GPUs are expensive, so follow the demand. At the same time, not cutting in-flight requests on scale-in or update is quality.

  • Scale-out decision: trigger on num_requests_waiting / queue dwell time to add GPU nodes. Account for the cold start (model loading takes minutes) and add proactively.
  • Graceful drain: on receiving a termination signal, stop accepting new requests, let in-flight ones run to completion, then shut down. Drop the load balancer's readiness first and wait in preStop.
# k8s(抜粋):レディネスで安全に出し入れし、preStop で在庫を捌いてから落とす
readinessProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 10
lifecycle:
  preStop:
    exec: { command: ["sh", "-c", "sleep 30"] } # 在庫リクエストの完走待ち
terminationGracePeriodSeconds: 120

Resilience: don't let a failure of your own node become a total outage

Self-hosting is designed on the premise of falling over. Give the client side timeout, retry, and fallback, and even if your own node dies, escape to Bedrock to not stop.

// lib/llama-client.ts — 自前vLLM優先、ダメならBedrockへフォールバック(止めない)
import OpenAI from "openai";

const selfHosted = new OpenAI({ baseURL: process.env.VLLM_URL, apiKey: "internal", timeout: 30_000 });

export async function generate(prompt: string): Promise<{ text: string; via: "self" | "bedrock" }> {
  try {
    const r = await selfHosted.chat.completions.create({
      model: "meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8",
      messages: [{ role: "user", content: prompt }],
    });
    return { text: r.choices[0]?.message.content ?? "", via: "self" };
  } catch (err) {
    // タイムアウト/接続不可/5xx は“正常系”として吸収し、マネージドへ退避する。
    const text = await viaBedrock(prompt); // Converse API 実装は別関数に分離(SRP)
    return { text, via: "bedrock" };
  }
}

To prevent a cascading failure, put a circuit breaker on the fallback side too to stop the rush (retry/backoff/circuit breaker).


Security: an inference endpoint is "not exposed"

vLLM's OpenAI-compatible server has no strong authentication itself. Not exposing it directly to the internet is the major premise.

  • Network isolation: place it in a private subnet / VPC so it can't be reached directly from outside.
  • An auth gateway in front: have an API Gateway / reverse proxy handle authentication, rate limiting, and audit logs. The app hits it via internal DNS.
  • No data egress: the very motive for self-hosting is "not letting it leave." Don't keep prompt bodies (PII) in logs either — record only metadata.
  • Input validation: the prompt is also external input. Do length-cap and schema validation at the boundary (the discipline of type safety).

Load testing: know "where it breaks" before production

Don't guess throughput and latency — measure. Raise concurrency in stages, find the point where num_requests_waiting jumps and TTFT degrades = the saturation point, and from there decide the safe concurrency per node.

# vLLM 同梱のベンチで実効スループットを測る(数値は環境依存・要実測)
python -m vllm.entrypoints.benchmarks.benchmark_serving \
  --backend openai --base-url http://localhost:8000 \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct-FP8 \
  --num-prompts 500 --request-rate 20

Only by putting the effective throughput obtained here into the cost article's formula does self-hosting's break-even come out in numbers.


Frequently asked questions (FAQ)

Q. How do Ollama and vLLM differ? A. Ollama is for local development, single-machine, easy, and vLLM is for production high throughput. vLLM handles many concurrent requests with continuous batching. Use Ollama for development and vLLM for production (the self-hosting chapter of the pillar article).

Q. How many GPUs are needed? A. It depends on the model, quantization, and context length. For Scout (FP8), tensor parallelism across a few GPUs is realistic. First load-test on 1 node and decide the count by working back from the point where num_requests_waiting saturates.

Q. May I set context length 10M? A. You can, but VRAM and latency break down. The production iron rule is to narrow --max-model-len to the length actually needed. Use long context within a realistic range together with --kv-cache-dtype fp8.

Q. Can I run fine-tuned weights too? A. Yes. Make LoRA-merged weights the target of vllm serve, and you can serve a specialized model in production as-is.

Q. Is operation heavier than managed (Bedrock)? A. It's heavier. Since you hold scaling, failures, updates, and monitoring yourself, choose it when the cost is worth it at steady high volume. For small/variable, Bedrock is the right answer.


Conclusion

vLLM's serve is one line, but the production inference platform is outside it.

  1. Maximize throughput with continuous batching × FP8 × tensor parallelism.
  2. Don't fall over, don't cut, with health/metrics/autoscale/drain.
  3. Don't let your own failure become a total outage, with client-side fallback.
  4. Protect with isolation + auth gate + no data egress.
  5. Derive the break-even by putting load-test numbers into the cost formula.

I build a private, observable, resilient Llama inference platform, including load testing and cost estimation. Take a look at my GPU production-operations track record and consult me. With one person × generative AI, fast, cheap, and safe.

Sources / official resources

  • Flags, throughput, and GPU requirements depend on updates and measurement. Confirm with primary sources and your own benchmark before implementation.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading