# Llama Complete Guide: Shipping Meta's Open-Weight LLM to Production, Faithful to the Official Docs (Llama 4, Bedrock, Llama API)

> An explanation of Meta's open-weight LLM 'Llama,' faithful to the official documentation (llama.com, Meta AI, Hugging Face). The mechanism of Llama 4 Scout/Maverick, implementation with the Llama API (OpenAI-compatible) and AWS Bedrock / Ollama/vLLM, type-safe structured output, the license (700M MAU, Built with Llama), and how to choose in the Muse Spark era — shown with production-operation code.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: Llama, 生成AI, LLM, AWS Bedrock, オープンウェイト, RAG, Python, TypeScript
- URL: https://tomodahinata.com/en/blog/meta-llama-open-weight-llm-production-guide
- Category: Llama & open-weight LLMs

## Key points

- Llama 4 is Meta's first 'native multimodal × MoE' open-weight model. Scout is 17B active/16 experts with an industry-longest-class 10M-token context; Maverick is 17B active/128 experts, total 400B. Both run only the 'smart part,' so inference is light
- To try fastest, the Llama API (OpenAI-compatible, `/compat/v1`); for fully-managed production, AWS Bedrock's Converse API; for self-hosting, Ollama / vLLM / transformers — 3 choices by use
- Open weight's value is data sovereignty, fine-tuning, cost optimization, and lock-in avoidance. The 'use-as-appropriate' with closed APIs (Claude, etc.) is shown as a design judgment
- Meta pivoted in April 2026 with Muse Spark (proprietary), but the open-system mainstay you can download, own, and self-operate is still Llama 4
- Commercial use is possible. But you must definitely step through the traps of the Llama Community License (the 700M MAU threshold, the 'Built with Llama' display, the 'Llama' prefix on derived model names, the AUP)

---

## The Goal of This Article

**Llama** is the collective name for **open-weight large language models (LLMs)** published by Meta. Unlike GPT or Claude — "borrowed things" beyond an API — **you can download the weights and own, modify, and operate them as your own** — this is Llama's essence.

This article, while **strictly based on the official documentation ([llama.com](https://www.llama.com/) / [Meta AI Blog](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) / [Llama Developer Docs](https://llama.developer.meta.com/docs/overview/) / [Hugging Face](https://huggingface.co/meta-llama))**, bundles into one — with actually-running code — "**which model, where to run it, in which scene how to use it, and where it clogs**," which is scattered in the official docs. By the time you finish reading, I aim for a state where you can do the following 3 things.

1. Explain to others **what model Llama 4 is and why it's "light yet long"** (MoE and the 10M-token context).
2. Judge which of the **Llama API / AWS Bedrock / self-hosting** to choose, and **get your hands moving today**.
3. Assemble an implementation that withstands not a demo but **production** — type-safe structured output, idempotency, observability, resilience, and **the license mines**.

> **About the author (disclosure for credibility)**: I single-handedly design, implement, and **run in production** generative-AI systems like RAG and voice agents, on a foundation of **AWS Bedrock** and the **Vercel AI SDK**. On Bedrock, Claude and Llama **line up behind the same Converse API**, so "**which to choose, a closed API or open weight**" is a judgment I actually make every project. This article's code and design principles are extracted from that real operation (the [voice-agent case study](/case-studies/ai-voice-chatbot), the [Bedrock×pgvector RAG design](/blog/production-voice-ai-sales-agent-bedrock-pgvector), [pgvector production RAG](/blog/pgvector-postgres-production-rag-hybrid-search)). I won't put out Llama-specific "inflated numbers" — the sources are all placed in the [official resources](#sources-and-official-resources) at the end.

---

## A 30-Second Summary (Conclusion First)

| Viewpoint | Conclusion |
| --- | --- |
| **What it is** | Meta's **open-weight LLM**. You can DL the weights to own, fine-tune, and self-operate |
| **The latest mainstay** | **Llama 4** (published 2025/04). Meta's first **native multimodal × MoE**. Scout / Maverick are DL-able |
| **Ultra-long context** | **Llama 4 Scout is up to 10M tokens** (official; longest-class across open/closed) |
| **The reason for the lightness** | MoE (Mixture-of-Experts) **activates only part of the total parameters (17B)**. So it's fast and cheap |
| **Just to try** | The **Llama API** (Meta's official host, **OpenAI-compatible** `https://api.llama.com/compat/v1/`). A 3-line swap |
| **Production** | **AWS Bedrock** (fully managed, Converse API). With existing AWS, production quality in the shortest path |
| **Self-hosting** | **Ollama** (local) / **vLLM** (high throughput) / **transformers** (fine-tuning foundation) |
| **The difference from the closed crowd** | The frontier reasoning power yields a step to proprietary models (Claude, etc.), but it **wins on ownership, fine-tuning, cost, and data sovereignty** |
| **The 2026 tectonic shift** | Meta shifted its footing to **Muse Spark** (proprietary), but **the open-system mainstay you can own is still Llama 4** |
| **Commercial use** | Possible. But follow the **Llama Community License** (the 700M MAU threshold, the `Built with Llama` display, the "Llama" prefix on derived names, the AUP) |

If you want to "**first check the smartness on your own prompt**," jump straight to the [Llama API chapter](#usage-a-llama-api-openai-compatible-the-fastest-way-to-try) right after this. It works just by swapping the OpenAI SDK's `baseURL`.

---

## What Is Llama 4: Native Multimodal × MoE

In April 2025, Meta published the **Llama 4 "herd."** As the official blog's words put it, it's a generation that raised "**a new era of natively multimodal AI innovation**," with 2 points decisively different from past Llamas.

1. **Native multimodal**: handle text and images **with the same model from the start**, not bolted on (early fusion).
2. **MoE (Mixture-of-Experts)**: of the huge total parameters, **activate only the necessary "experts" per input**. The total is large, but the computation (= cost) per token is small.

This MoE is the trick behind "**Llama 4 is large yet light**." For example, of Scout's total 109B, **only 17B actually runs**. So it spins even on a single GPU, with cheap and fast inference.

### The Herd's Composition (Official Specs)

| Model | active / total params | experts | Context length (model's native) | Modality | Availability | Positioning |
| --- | --- | --- | --- | --- | --- | --- |
| **Llama 4 Scout** | **17B / 109B** | 16 | **up to 10M tokens** | Text + image | **DL-able** (HF, llama.com) | Single H100 (Int4). The ultra-long-context star |
| **Llama 4 Maverick** | **17B / 400B** | 128 | ~1M tokens | Text + image | **DL-able** | Single H100 DGX host. The general-purpose flagship |
| **Llama 4 Behemoth** | 288B / about 2T | 16 | Not public | Multimodal | **Training, unreleased** | The teacher (distillation source) model |
| **Llama 3.3 70B Instruct** | 70B (dense) | — | 128k tokens | Text only | DL-able, Llama API | The mature, regular-use text model |
| **Llama 3.3 8B Instruct** | 8B (dense) | — | 128k tokens | Text only | Llama API | Low-latency, low-cost |

> ⚠️ **The most important caveat: "the model's native context length" and "the context length the host actually serves" are different things.** Scout is **natively 10M**, but **128k on the Llama API**, and **on AWS Bedrock, Scout is about 3.5M and Maverick 1M** (official description, to be expanded). Design assuming "I should have 10M tokens" and you'll get stuck at the host's effective limit. **Always confirm the limit of the place you use.**

Behemoth (about 2 trillion parameters) is, at the time of writing, **still training and unreleased**. Scout / Maverick are **boosted by distillation from Behemoth**, per the official explanation. In other words, a composition of "a huge parent made small children smart."

---

## Why Choose "Open Weight" (Use-as-Appropriate with Claude / GPT)

The most important thing in talking about Llama is not the performance numbers but the single point that **you can own the weights**. This is the watershed with Claude and GPT, and **using them as appropriate by purpose** is the right answer. There's no all-purpose one.

**Scenes where Llama (open weight) wins:**

- **Data sovereignty / compliance**: you can't put the input out to an external API (medical, finance, local government, confidential). Place the weights in **your own VPC / on-prem** and infer without putting data outside.
- **Fine-tuning**: you want to **specialize the weights themselves** to your domain. Only open weights with the weights in hand can be the base for LoRA / full FT.
- **Cost optimization**: for steady, large-volume inference, self-operation of **filling and spinning a GPU** tends to favor per-request unit cost over the metered **token-unit-price × infinite times**. That active is small at 17B with MoE also helps.
- **Lock-in avoidance**: you're **not held hostage in your business** by an API's price hike, spec change, or termination. The weights don't vanish.
- **Edge / isolated environments**: run them on sites or devices with no network connection.

**Scenes where a closed API (Claude, etc.) wins:**

- **Frontier reasoning power, coding, agents**: on high-difficulty tasks, proprietary frontier models are still a notch above (see the [benchmark comparison](#comparison-with-other-models-honest-version) later).
- **You want to start with zero operations**: no GPU procurement, no MLOps. Hit it and it works.
- **Mature safety mechanisms and tooling**: moderation, tool execution, the craftsmanship of multi-agent.

In my practice, I **use these 2 together behind the same abstraction**. With Bedrock, both Claude and Llama can be hit with the Converse API, so the **two-tier of cost and quality** — "**the hard parts to Claude, the brute-force preprocessing / extraction / classification to Llama**" — naturally assembles (that design philosophy connects to the [Vercel AI SDK production article](/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag) and the [Claude API implementation article](/blog/claude-api-ai-sdk-v6-production-ai-features)).

---

## The Current Location in 2026: Does Llama End with Muse Spark?

Honestly, let me answer the point everyone's worried about now. In April 2026, Meta Superintelligence Labs published **Muse Spark**. This is Meta's first **proprietary (weights-not-public), API-only reasoning model**, and Meta AI's smart glasses have already been swapped from Llama 4 to Muse Spark. Some media reported "**Meta abandoned open-source Llama**."

**To not err in the design judgment here, let me cut apart only the facts.**

- Muse Spark's **weights can't be obtained**. It **doesn't satisfy the very reasons you choose Llama** — ownership, fine-tuning, self-operation, data sovereignty. Meta states it "hopes to open-source a future version," but at present that's only a hope.
- On the other hand, **Llama 4 Scout / Maverick are still downloadable** and keep running on Hugging Face, Bedrock, and Ollama. The weights are **unrecallable** — once-distributed open weights can't be erased.
- That is, if you have a "**own it and run it**" requirement, the mainstay as of June 2026 is **still Llama 4**. Muse Spark differs in its evaluation axis as "another closed API (a rival of Claude / GPT)."

Conclusion: **Muse Spark's arrival is no reason to abandon Llama.** Rather, the line between "projects that need open weight" and "projects that need frontier closed reasoning" has simply become clearer than ever. This article is the map of the former, the **own-it-and-load-it-into-production** side.

---

## Usage A: Llama API (OpenAI-Compatible, the Fastest Way to Try)

At the stage of "first checking whether my prompt's smartness holds up," **hitting Meta's official Llama API without preparing a GPU** is the shortest path. The biggest advantage is **OpenAI SDK compatibility** — just swap the `baseURL` and set the key to `LLAMA_API_KEY`, and your existing OpenAI code runs as-is.

- **OpenAI-compatible endpoint**: `https://api.llama.com/compat/v1/`
- **Authentication**: `LLAMA_API_KEY` (Bearer token)
- **Model IDs**: `Llama-4-Maverick-17B-128E-Instruct-FP8` / `Llama-4-Scout-17B-16E-Instruct-FP8` / `Llama-3.3-70B-Instruct` / `Llama-3.3-8B-Instruct` (all 128k context on the API)

> 📌 **A note for accuracy**: the Llama API is, at the time of writing, **a preview offering** (waitlist-based). Model IDs, endpoints, and parameters may be updated, so **always confirm the latest in the [official documentation](https://llama.developer.meta.com/docs/overview/)**. This article's code shows the **structure**.

### TypeScript (Use the OpenAI SDK As-Is)

```ts
// scripts/llama-quickstart.ts
import OpenAI from "openai";

// Llama API は OpenAI 互換。baseURL を /compat/v1 に向け、鍵を差し替えるだけ。
const client = new OpenAI({
  apiKey: process.env.LLAMA_API_KEY,
  baseURL: "https://api.llama.com/compat/v1/",
});

const completion = await client.chat.completions.create({
  model: "Llama-4-Maverick-17B-128E-Instruct-FP8",
  messages: [
    { role: "system", content: "あなたは事実の正確さを最優先する技術アシスタントです。推測は推測と明示します。" },
    { role: "user", content: "Llama 4 Scout の文脈長を一文で。" },
  ],
});

console.log(completion.choices[0]?.message.content);
```

### Python (Same Form)

```python
# pip install openai
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["LLAMA_API_KEY"],
    base_url="https://api.llama.com/compat/v1/",
)

resp = client.chat.completions.create(
    model="Llama-4-Scout-17B-16E-Instruct-FP8",
    messages=[{"role": "user", "content": "MoE を3行で説明して"}],
)
print(resp.choices[0].message.content)
```

What "OpenAI-compatible" works for is that **the cost of switching providers is nearly zero**. Prototype on the Llama API, production on Bedrock, local verification on Ollama — the same code spins **just by changing `baseURL` and `model`**. This very "swappability" is the practical benefit for which lock-in-haters choose Llama.

---

## Usage B: AWS Bedrock (Production, Fully Managed)

If your existing stack is AWS, **Bedrock is the shortest path to production quality**. No GPU procurement, no MLOps, and IAM, VPC, audit logs, and billing are integrated into AWS. Llama can be hit with Bedrock's **Converse API** (a unified interface across models), so you can **swap Claude and Llama in one code**.

- **Model IDs (foundation models)**: `meta.llama4-scout-17b-instruct-v1:0` / `meta.llama4-maverick-17b-instruct-v1:0`
- **Cross-region inference profiles** (improved availability, recommended): `us.meta.llama4-scout-17b-instruct-v1:0` / `us.meta.llama4-maverick-17b-instruct-v1:0`
- **Context length (Bedrock effective)**: Scout about 3.5M / Maverick 1M (official description, to be expanded)

### A Minimal Production-Quality Implementation (Type-Safe, Resilient, Observable)

The difference between "works" code and "doesn't fall over in production" code lies in **boundary validation, retries, and the visibility of token billing**.

```python
# llama_bedrock.py — Bedrock 経由で Llama 4 を本番品質で叩く
import logging
import boto3
from botocore.config import Config

logger = logging.getLogger("llama")

# 認証はIAMロール/環境に委譲（鍵をコードに置かない＝CLAUDE.md準拠）。
# リトライは指数バックオフ。スロットリング(429)・一時障害を正常系として吸収する。
_bedrock = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
    config=Config(retries={"max_attempts": 4, "mode": "adaptive"}, read_timeout=60),
)

# クロスリージョン推論プロファイル（us. 接頭辞）で可用性とスループットを底上げ。
MODEL_ID = "us.meta.llama4-scout-17b-instruct-v1:0"


def ask_llama(system: str, user: str, *, max_tokens: int = 1024) -> str:
    """単発の質問応答。Converse API はモデル非依存なので Claude へも一行で差し替え可能。"""
    resp = _bedrock.converse(
        modelId=MODEL_ID,
        system=[{"text": system}],
        messages=[{"role": "user", "content": [{"text": user}]}],
        inferenceConfig={"maxTokens": max_tokens, "temperature": 0.2, "topP": 0.9},
    )
    # 可観測性：トークン課金を構造化ログに残す（本文＝PIIは出さない）。
    usage = resp["usage"]
    logger.info(
        "llama.converse",
        extra={"in_tokens": usage["inputTokens"], "out_tokens": usage["outputTokens"], "model": MODEL_ID},
    )
    return resp["output"]["message"]["content"][0]["text"]
```

### Streaming (Cut Perceived Latency)

In a conversational UI, just **displaying tokens incrementally** without waiting for the full text transforms the feel. Converse has a streaming version.

```python
def stream_llama(system: str, user: str):
    """生成を逐次 yield。UI 側はそのまま流し込めば“タイプしている”体験になる。"""
    stream = _bedrock.converse_stream(
        modelId=MODEL_ID,
        system=[{"text": system}],
        messages=[{"role": "user", "content": [{"text": user}]}],
        inferenceConfig={"maxTokens": 1024, "temperature": 0.2},
    )
    for event in stream["stream"]:
        if delta := event.get("contentBlockDelta"):
            yield delta["delta"]["text"]
```

> 💡 **The reason to choose the Converse API**: `InvokeModel` has **different input/output JSON shapes per model**. Converse is **the same shape across models**. So you can switch "Llama for cost, only the hard parts to Claude" with **a single conditional**, making vendor comparison and A/B much easier.

---

## Usage C: Self-Host / Local (Own It and Run It)

You can't put data outside, you want to shave the unit cost per request, you want to fine-tune — for **owning it and running it**, there are 3 choices.

### Ollama (Local Development, PoC, No Data Egress)

Stand up `ollama` locally and Llama runs **without sending data to the net**. Because `ollama` also doubles as an **OpenAI-compatible server** (`http://localhost:11434/v1`), you can verify with your production code as-is.

```bash
# テキストの常用（70B・枯れていて扱いやすい）
ollama run llama3.3

# Llama 4 Scout：109B MoE / 17B active（約 67GB）
ollama run llama4:16x17b

# Llama 4 Maverick：400B MoE / 17B active（約 245GB・要大容量メモリ）
ollama run llama4:128x17b
```

```ts
// 本番コードを一切変えず、ローカル Ollama に向けて検証する
import OpenAI from "openai";
const local = new OpenAI({ baseURL: "http://localhost:11434/v1", apiKey: "ollama" });
const r = await local.chat.completions.create({
  model: "llama3.3",
  messages: [{ role: "user", content: "RAG のチャンク戦略を3点で" }],
});
console.log(r.choices[0]?.message.content);
```

### vLLM (High-Throughput Self-Serving)

To handle large-volume, low-latency yourself, **vLLM**. Stand up an OpenAI-compatible endpoint (`:8000/v1`) and distribute across multiple GPUs with tensor parallelism.

```bash
# Scout を 8 GPU にテンソル並列で配置し、長文脈を許可してサーブ
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 1000000
```

### transformers (The Foundation for Fine-Tuning / Research)

If you want to touch the weights directly to fine-tune, or to inspect the internals, `transformers`. **v4.51.0 or later** is required, and you use the Llama 4-dedicated classes.

```python
# Maverick は 8GPU 構成が前提（torchrun --nproc-per-instance=8 script.py）
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",  # 長文脈を効率的に処理
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
```

| Method | What | Premise | Suited scene | Cost feel |
| --- | --- | --- | --- | --- |
| **Ollama** | Local/single machine | GPU/large memory | Development, PoC, **no data egress** | Own hardware |
| **vLLM** | High-throughput self-serving | Multiple GPUs (TP) | Large-volume, low-latency yourself | Own GPU |
| **transformers** | Research, **the fine-tuning foundation** | 8 GPUs (Maverick) | Fine-tune, experiment | Own GPU |

---

## Practical Patterns: In Which Scene, How to Use It

The official docs only write up to "how to run it." From here, I show **how to make it work in actual projects**, with the Vercel AI SDK (the central stack of this site's AI group). The key is to **bind the output with types**. The LLM's raw output is outside the trust boundary — **only after parsing with Zod is it safe** (the thoroughness of type safety is detailed in the [TypeScript type-safety article](/blog/typescript-type-safety-discipline-zod-nevererror-no-any)).

### Pattern 1: Type-Safe Structured Extraction (Invoices, Forms, Inquiries)

"Unstructured text → structured JSON" is Llama's most frequent use case. Run it in volume on Maverick's cheapness, and **validate the boundary with a Zod schema**.

```ts
// lib/extract-invoice.ts — Llama 4 で型安全に構造化抽出する
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { generateObject } from "ai";
import { z } from "zod";

// 出力の“あるべき形”を単一の真実源として宣言する。
const InvoiceSchema = z.object({
  vendor: z.string().min(1),
  total: z.number().nonnegative(),
  currency: z.enum(["JPY", "USD", "EUR"]),
  issuedAt: z.string().date(), // YYYY-MM-DD
  lineItems: z.array(z.object({ name: z.string(), amount: z.number() })).max(200),
});

export async function extractInvoice(text: string) {
  const { object, usage } = await generateObject({
    model: bedrock("us.meta.llama4-maverick-17b-instruct-v1:0"),
    schema: InvoiceSchema, // ← LLM 出力はここで検証される。形が違えば例外で弾く
    system: "請求書テキストから構造化データのみ抽出する。値を推測・捏造しない。",
    prompt: text,
  });
  // object は InvoiceSchema 準拠が“型レベルで”保証された安全な値。
  return { invoice: object, usage };
}
```

The point is that **whatever the model returns, only a value that passed `InvoiceSchema` flows downstream.** This makes the "LLM occasionally returns broken JSON" problem vanish **structurally**. Returning `usage` (token count) is for the cost visibility described later.

### Pattern 2: Ultra-Long-Context RAG (Leverage Scout's 10M)

The practical benefit of Scout's long context is that "**you can chunk coarsely**." That said, "just dump everything in" isn't right — RAG design of search → compress → feed still works ([pgvector production RAG](/blog/pgvector-postgres-production-rag-hybrid-search) / [LangChain×Pinecone](/blog/langchain-pinecone-production-rag-system)). Use the long context as "**insurance against omissions**," and binding the **citation of evidence** with types is the right move.

```ts
// 検索ヒットを根拠として渡し、回答 + 引用元を型で強制する
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { generateObject } from "ai";
import { z } from "zod";

const Answer = z.object({
  answer: z.string(),
  citations: z.array(z.string()).min(1), // 引用ゼロを許さない＝ハルシネーション抑止
});

export async function answerWithRag(question: string, chunks: { id: string; text: string }[]) {
  const context = chunks.map((c) => `[${c.id}] ${c.text}`).join("\n\n");
  const { object } = await generateObject({
    model: bedrock("us.meta.llama4-scout-17b-instruct-v1:0"),
    schema: Answer,
    system: "提供された根拠のみで答える。根拠に無いことは『分からない』と返す。citations には使った [id] を必ず列挙。",
    prompt: `# 質問\n${question}\n\n# 根拠\n${context}`,
  });
  return object;
}
```

### Pattern 3: Make It Provider-Independent with the Vercel AI Gateway

On Vercel, using a `"provider/model"` string via the **AI Gateway** is the most loosely coupled. You don't hold a provider SDK directly, and can lean **failover and usage measurement** onto the Gateway.

```ts
import { generateText } from "ai";

// プロバイダ非依存。Gateway 側でフォールバックや可観測性を一元管理できる。
const { text } = await generateText({
  model: "meta/llama-4-maverick", // ← 文字列を変えるだけで Claude 等にも切替可
  prompt: "オープンウェイトを選ぶ判断基準を箇条書きで",
});
```

Besides these, **agents / tool calling** ([the tool-execution design article](/blog/ai-agent-tool-use-function-calling-production-design)), **image understanding** (Scout/Maverick are natively multimodal), and **a primary classifier for input moderation** (build a gate with a cheap model) can all be assembled with the same "**bind the boundary with types**" thinking.

---

## Production-Operation Design Principles (Type Safety, Idempotency, Observability, Resilience, Cost)

What lifts an LLM from "works" to "earns in production" is not the model choice but **the surrounding design**. Whether Llama or a closed API, the principles that work are the same.

- **Type safety (boundary validation)**: the LLM output is outside the trust boundary. **Parse with Zod** before passing downstream. Don't receive it as `any`.
- **Idempotency**: cache high-cost / long-running generation with `sha256(model + prompt + params)` as the **job key**. Don't double-bill on resend, rapid clicks, or retries.
- **Observability**: per call, write the **model ID, input/output tokens, latency, and temperature** to structured logs. Make it a state where you can later trace "why is only this answer high-cost / broken" (don't emit PII = the prompt body. See the [OpenTelemetry article](/blog/opentelemetry-observability-production-tracing-metrics-logs)).
- **Resilience**: treat throttling, timeouts, and transient failures **as a normal case**. Don't stop, with exponential backoff + **fallback** (Maverick → Scout → 3.3 8B, or Bedrock → Llama API) (the [retry/circuit-breaker article](/blog/retry-backoff-circuit-breaker-resilience-patterns-guide)).
- **Cost efficiency**: carve the unit cost with 3 tiers — ① **model routing** (simple classification on 8B, only the hard parts on Maverick), ② put Llama, with its small active via MoE, on the volume side, ③ zero regeneration via the idempotency cache.

### A Production Route Handler with Idempotency (Next.js)

A minimal form bundling external-input validation, double-billing prevention, and fallback into one.

```ts
// app/api/generate/route.ts
import { NextResponse } from "next/server";
import { createHash } from "node:crypto";
import { generateText } from "ai";
import { z } from "zod";

// ① 外部入力は境界で必ず検証する（信頼境界はサーバー側）。
const Body = z.object({
  prompt: z.string().min(1).max(20_000),
  quality: z.enum(["fast", "best"]).default("fast"),
});

// ② 入力から決定的キーを作る。同じ入力＝同じ結果を返し、二重課金しない。
const keyOf = (s: string) => createHash("sha256").update(s).digest("hex");

// ③ 品質要件でモデルをルーティング（コスト最適化）。
const MODEL = {
  fast: "meta/llama-4-scout",
  best: "meta/llama-4-maverick",
} as const;

export async function POST(req: Request) {
  const parsed = Body.safeParse(await req.json());
  if (!parsed.success) {
    return NextResponse.json({ error: parsed.error.flatten() }, { status: 422 });
  }
  const { prompt, quality } = parsed.data;
  const key = keyOf(`${quality}:${prompt}`);

  const cached = await cache.get(key);
  if (cached) return NextResponse.json({ text: cached, cached: true });

  try {
    const { text, usage } = await generateText({ model: MODEL[quality], prompt });
    await cache.set(key, text);
    // usage を計測へ（トークン課金の可観測性）。PII は載せない。
    metrics.record({ model: MODEL[quality], tokens: usage.totalTokens });
    return NextResponse.json({ text, cached: false });
  } catch (err) {
    // ④ 回復性：失敗したら一段安いモデルへフォールバックして“止めない”。
    const { text } = await generateText({ model: MODEL.fast, prompt });
    return NextResponse.json({ text, degraded: true });
  }
}
```

Just making `sha256(input)` the key gets you **idempotency and cost efficiency** at once: "even if the user rapid-clicks submit, it runs only once" and "the same request returns the cache." The `degraded` path that drops to a cheaper model on failure is the last fortress that protects the SLO (once it goes async and large-volume, lean it toward [SQS/Lambda idempotent processing](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide)).

---

## The License Pitfalls (Must-Read Before Commercial Use)

Llama's commercial use is **possible**, but the license is **not OSI-compliant "open source."** It's an独自 license called the **Llama 4 Community License**, with 4 easy-to-step-on mines. Be sure to confirm, before entering a project, that **"free isn't free-for-all."**

| Item | Content | Practical meaning |
| --- | --- | --- |
| **The 700M MAU threshold** | A product/service with MAU exceeding **700 million** in the most recent calendar month needs separate permission from Meta | Irrelevant to most companies. Applies **only to ultra-giant services** (a GAFA-class rival-exclusion clause) |
| **Built with Llama** | **Explicitly display** on the site/UI/blog/about/product docs | **Mandatory display** if you distribute/offer a product/service containing Llama |
| **Naming of derived models** | A distributed derived (fine-tuned, etc.) model name must **start with "Llama"** | E.g. `Llama-MyCompany-7B`. Don't distribute under an arbitrary own name |
| **Bundling duty + Notice** | Bundle a copy of the license text and the Notice statement | Attach LICENSE / NOTICE on redistribution |
| **AUP (Acceptable Use Policy)** | Prohibition of illegal / harmful uses (incorporated by reference) | A violation is **a breach of contract**. The use of the generated output is also bound |

The Notice statement has an official fixed form.

```text
Llama 4 is licensed under the Llama 4 Community License,
Copyright © Meta Platforms, Inc. All Rights Reserved.
```

> ⚖️ **Here is the dividing line of trust**: delivering on a contract "I made it cheap with Llama," and **not knowing the `Built with Llama` display or the derived-model naming convention** — that's an accident. To load open weight into commercial use, **license compliance is part of the implementation.** In my projects, I include this display duty, naming convention, AUP, and (if needed) data handling in the design. The license gets updated, so **always confirm the primary sources** ([Llama 4 License](https://www.llama.com/llama4/license/) / [Acceptable Use Policy](https://www.llama.com/llama4/use-policy/)) **before shipping to production**.

---

## Comparison with Other Models (Honest Version)

To "so which do I use, after all?" let me answer **by use**. Choose by the absolute value of performance alone and you'll miss.

| Model | Type | Strengths | Weaknesses | Suited scene |
| --- | --- | --- | --- | --- |
| **Llama 4 (Scout/Maverick)** | Open weight | **Ownership, fine-tuning, ultra-long context, cost control, data sovereignty** | The load of self-operation. Frontier reasoning yields a step to the proprietary crowd | Data sovereignty / fine-tuning / on-prem / brute-force volume |
| **Claude (Opus/Sonnet)** | Closed API | **Frontier reasoning, coding, agents, safety** | Weights not public, lock-in | High-difficulty reasoning, implementation support, the hard parts |
| **GPT-5.x** | Closed API | Overall power, ecosystem | Same as above | General-purpose |
| **Gemini** | Closed API | Long context, Google integration | Same as above | GCP, long context |
| **Mistral / Qwen, etc.** | Open weight | Lightweight, multilingual, permissive license | Ecosystem scale | Lightweight self-operation, specific languages |
| **Muse Spark (reference)** | **Closed** (Meta's new line) | Compute-frugal reasoning | **Weights not public** = doesn't satisfy the ownership requirement | **Not** a substitute for Llama (a different evaluation axis) |

Practical decisions are roughly thus.

- **You want to own the weights / fine-tune / can't put data outside** → **Llama 4**.
- **You want to leave the highest-difficulty reasoning / coding to it** → **Claude** (I use Claude Code as the protagonist of implementation).
- **Brute-force preprocessing / extraction / classification on open weight, only the hard parts on closed** → **using both** is the optimal point of cost and quality.

> Benchmark numbers (Artificial Analysis, etc.) and each company's models **move every month**. Don't decide on fixed values; **evaluate small on your own task** before adopting.

---

## Frequently Asked Questions (FAQ)

**Q. Can I use Llama commercially?**
A. You can. But it's not OSI "open source"; it's the **Llama Community License**. You need to follow the **700M MAU threshold, the `Built with Llama` display, the "Llama" prefix on derived model names, and the AUP**. 700M MAU is irrelevant to most companies, but **the display and naming normally apply**. Always confirm the [primary source](https://www.llama.com/llama4/license/).

**Q. Can I use it in Japanese?**
A. You can. Llama 3.x onward improved multilingual performance, and Llama 4 is natively multilingual too. That said, there are scenes where "it's not as stable as English," so **verify important Japanese tasks with your own evaluation set** before loading.

**Q. How much GPU do I need?**
A. **Scout is a single H100 (Int4 quantized)**, **Maverick a single H100 DGX host** — the official guides. If you have no GPU on hand, the royal road is to start with the **Llama API or Bedrock** (no GPU needed) and move to self-operation once volume grows.

**Q. Can I fine-tune?**
A. You can. Having the weights in hand is the core of open weight. It's realistic to start with lightweight methods like LoRA. **If you distribute, beware the naming convention** of starting the derived model name with "Llama."

**Q. Which is cheaper, an API or self-operation?**
A. **For small volume, the API** (zero environment setup). **For steady large volume**, self-operation of filling and spinning a GPU tends to favor per-request unit cost. **First verify quality and demand with the API, and move to Bedrock proprietary or self-operation once volume grows** — that's the break-even standard.

**Q. Muse Spark came out, so should I quit Llama?**
A. Depends on the requirement. If you **want to own / fine-tune / self-operate the weights**, Muse Spark **doesn't satisfy the requirement with weights not public** — the mainstay is still Llama 4. Muse Spark is an object to evaluate as "another closed API," on the same ground as Claude / GPT.

**Q. Should I choose Claude or Llama?**
A. **Frontier reasoning, coding, and agents to Claude; ownership, fine-tuning, data sovereignty, and cost optimization to Llama.** I use both together behind Bedrock's same API, assembling **the hard parts to Claude, volume to Llama**. The right answer is not an either-or but **using them as appropriate**.

---

## Summary: Moving Llama from "Ownership" to "Earning in Production"

The essence of Llama lies not in the performance numbers but in "**you can own the weights, modify them, and run them in your own environment**." That's exactly why, in projects where data sovereignty, fine-tuning, cost, or lock-in avoidance is a requirement, it becomes an option a closed API can't be.

The path to implementation is simple.

1. First check the smartness with the **Llama API (OpenAI-compatible)** (just swap `baseURL`, no GPU needed).
2. For production, go to **AWS Bedrock's Converse API** (swappable with Claude in one line). Once volume grows, to **vLLM / proprietary**.
3. Weave **type safety, idempotency, observability, resilience, and cost** into the design, and **implement up to license compliance (`Built with Llama`, naming, AUP)**.

Only after going this far does it become not a demo but a product that "**doesn't fall over on the customer's constraints (can't put data out, want to shave cost, want to specialize)**." And, the point I most want to convey — anyone can make a **"just wire the model"** demo, but **using open weight and a closed API as appropriate, and designing up to the license, the cost, and data sovereignty** turns the number of judgments accumulated in real operation directly into quality.

> I **run in production** generative-AI systems including Llama, on a foundation of **AWS Bedrock × Vercel AI SDK**. If you're considering the selection of open / closed, RAG, agents, cost optimization, and license handling all the way through, see my [portfolio](/case-studies/ai-voice-chatbot) and feel free to consult me. With **one person × generative AI**, I build, from PoC to production operation, fast, cheap, and safe.

---

## Sources and Official Resources

- **Official site**: [llama.com](https://www.llama.com/) — model list, downloads, documentation
- **Llama 4 announcement (Meta AI Blog)**: [The Llama 4 herd](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) — the specs of Scout/Maverick/Behemoth, MoE, multimodal
- **Llama Developer Docs**: [overview](https://llama.developer.meta.com/docs/overview/) / [models](https://llama.developer.meta.com/docs/models) — the Llama API's model IDs, endpoints, OpenAI compatibility
- **Hugging Face (Meta official)**: [meta-llama](https://huggingface.co/meta-llama) — weight distribution, `transformers` use
- **AWS Bedrock (Meta)**: [Meta's Llama in Amazon Bedrock](https://aws.amazon.com/bedrock/meta/) — Bedrock's model IDs, the Converse API
- **License**: [Llama 4 Community License](https://www.llama.com/llama4/license/) / [Acceptable Use Policy](https://www.llama.com/llama4/use-policy/) — 700M MAU, Built with Llama, the naming convention, the AUP
- **LlamaCon 2025 wrap-up**: [Everything we announced at LlamaCon](https://ai.meta.com/blog/llamacon-llama-news/) — the Llama API announcement
- **Muse Spark (reference)**: [Introducing Muse Spark](https://ai.meta.com/blog/introducing-muse-spark-msl/) — Meta's pivot (proprietary)

* Versions, model IDs, context lengths, pricing, and licenses get updated. **Always confirm the primary source before implementing.** This article's numbers (Scout 17B active/16 experts/total 109B/up to 10M, Maverick 17B/128 experts/total 400B, Bedrock effective Scout about 3.5M / Maverick 1M, Llama API 128k, etc.) are based on the official information at the time of writing (June 2026). **This site is "Built with Llama"–aware: when you ship a product containing Llama, display the required attribution.**