The goal of this article
As stated in the full picture of putting Llama into production, one of the biggest reasons to choose open weights is that "you can fine-tune (FT) the weights on your own data." This article takes that FT all the way through — not as a demo but for production — from the viewpoint of reproducibility, evaluation, cost, and license, with real code.
By the time you finish reading, the goal is a state where you can do the following three things.
- Judge whether FT is even necessary (including "in many cases RAG suffices").
- Actually run it with LoRA/QLoRA (both the torchtune and Hugging Face TRL routes).
- Deploy safely only after passing an evaluation gate, and observe the license naming convention.
Reliability disclosure: I operate generative-AI systems in production on top of AWS Bedrock / Vercel AI SDK, and domain specialization (fixing part numbers, domain vocabulary, output formats) was also a central challenge in the voice-customer-service case study. This article is faithful to the practical order that "FT is the last resort." I won't present inflated accuracy-improvement numbers.
Doubt first: does that problem really need FT?
FT is powerful, but it's not the first tool you should reach for. Cost, operational burden, and obsolescence risk are large. Consider it in the following order.
| Method | Effective for | Not effective for | Cost/operation |
|---|---|---|---|
| Prompt/Few-shot | format instructions, simple behavior | large knowledge, strong style fixing | minimal |
| RAG (retrieval-augmented) | referencing latest knowledge / internal docs, citing sources | fundamental change of tone or output style | medium (pgvector RAG) |
| Fine-tuning | fixing behavior, output format, tone, domain vocabulary, reducing latency | freshness of knowledge (frozen at training time) | large (data, training, evaluation, re-training) |
Iron rules:
- Want to add knowledge → first RAG. Baking facts into the model is a hotbed of obsolescence and hallucination.
- Want it to always answer in the same format, tone, and judgment, the prompt got too long, want to eliminate misses of domain vocabulary → time for FT.
- In practice, the combination of RAG + light FT has the best cost-effectiveness (knowledge in RAG, style in FT).
💡 "FT makes it smarter" is a misconception. What FT changes is mainly the distribution of behavior, not reliable memory of new facts. Holding facts in RAG is the production standard.
What to fine-tune and how (realistic targets and LoRA/QLoRA)
Target model: start from dense models
Llama 4 is MoE with a huge total parameter count (Scout 109B / Maverick 400B). A full FT of this requires a substantial distributed-training infrastructure and isn't suited for the first step. The realistic practical target is a dense model — Llama 3.3 8B / 70B or a use-specific small variant — and applying LoRA/QLoRA here is the royal road.
LoRA and QLoRA in a sentence
- LoRA (Low-Rank Adaptation): freeze the huge weights and train only small low-rank matrices. Because only a tiny part is updated, it's memory-thrifty, fast, and the artifact (the adapter) is tens of MB.
- QLoRA: load the base model 4-bit quantized and train LoRA on top of it. The advantage is that even a single GPU reaches the 70B class. Accuracy degradation is often practically small.
| Method | Memory | Speed | Artifact | Best scene |
|---|---|---|---|---|
| Full FT | largest | slowest | whole model | large-scale, rebuilding the foundation |
| LoRA | small | fast | adapter (small) | standard. fixing style/vocabulary |
| QLoRA | smallest | fast | adapter (small) | a larger model on a single GPU |
Data preparation: 90% of success is here
FT quality is determined by data quality. Differences in model or hyperparameters are noise.
- Format: instruction/chat-format JSONL. 1 line = 1 sample.
- Quality > quantity: 1,000 hand-polished samples beat 100,000 noisy ones. Errors get "correctly learned."
- Dedup: remove near-dups. Duplicates of the same example invite overfitting.
- train/eval split: always carve out an eval set. Mixing it causes "grading your own homework" (leakage) and you mistake the accuracy.
- Diversity: cover the input distribution that comes in production. Include edge cases and ways of declining ("I don't know") in the training target too.
{"messages":[
{"role":"system","content":"あなたは当社の見積もりアシスタント。数値は推測せず、無い情報は『不明』と返す。"},
{"role":"user","content":"型番 TX-200 の標準納期は?"},
{"role":"assistant","content":"TX-200 の標準納期は5営業日です。在庫状況により前後します。"}
]}
⚠️ Leakage is the worst bug that "quietly" inflates accuracy. If part of the eval set mixes into training, the benchmark rises but it slips in production. Do the split first, deterministically (e.g., distribute by hash).
Implementation A: torchtune (PyTorch-native, official)
The Meta/PyTorch-official torchtune has the philosophy of "bare PyTorch with no trainer or abstraction in between." You configure a recipe in YAML and run it via CLI.
pip install torchtune
# 重みを取得(ライセンス同意済みのHFアカウントが必要)
tune download meta-llama/Llama-3.3-70B-Instruct \
--output-dir ./Llama-3.3-70B-Instruct
# LoRA で単一デバイス学習(recipe と config を指定するだけ)
tune run lora_finetune_single_device \
--config llama3_3/70B_lora_single_device \
dataset.source=json \
dataset.data_files=./data/train.jsonl
torchtune declaratively bundles the learning rate, LoRA rank, epochs, quantization, etc. into the config, so diff management (who ran with which settings) works. This reproducibility is the lifeline of production operation.
Implementation B: Hugging Face TRL + PEFT (QLoRA)
The most widespread route. Passing PEFT's LoraConfig to TRL's SFTTrainer makes it QLoRA.
# pip install trl peft transformers bitsandbytes datasets
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct" # 密モデルが現実的な対象
# QLoRA:ベースを 4bit(NF4) で載せる=単一GPUでも大きめモデルに届く
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# 学習するのは低ランクのアダプタだけ。注意機構の射影行列を対象にするのが定番。
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
dataset = load_dataset("json", data_files={"train": "data/train.jsonl"})["train"]
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora,
args=SFTConfig(
output_dir="out/llama-domain-lora",
num_train_epochs=2, # 入れすぎは過学習。1〜3で様子を見る
per_device_train_batch_size=1,
gradient_accumulation_steps=8, # 実効バッチを稼ぐ
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
),
)
trainer.train()
trainer.save_model("out/llama-domain-lora") # 成果物はアダプタ(小さい)
The point is that only the adapter is trained. The base 70B stays frozen and 4-bit, so the required VRAM fits within a realistic level.
Implementation C: managed (lean operation toward AWS)
If you don't want to own training infrastructure, lean toward managed on AWS.
- Bedrock managed FT: as of writing, targets include Llama 2 / Llama 3.2 (including Vision). Put the training data in S3, run a job from the console/API, and a custom model is issued. Zero GPU operation.
- SageMaker + Bedrock Custom Model Import: import the weights you made with torchtune/TRL (Llama 2/3/3.1/3.2-family architecture) into Bedrock and call them as-is via the existing Converse API.
📌 A note for accuracy: managed FT's supported models and regions are updated. Always check the latest in the Bedrock model-customization support table. Confirming first "whether the model you want is a managed-FT target" prevents accidents.
Evaluation gate: don't ship to production without this
The most dangerous thing in FT is shipping on a "vague feeling it got better." Machine-compare pre/post-FT on a holdout, detect regressions, and only then deploy.
# evaluate.py — ベース vs FT後 を同じ評価セットで比較し、合否を判定する
import json, statistics
from typing import Callable
def run_eval(generate: Callable[[str], str], eval_path: str) -> float:
"""各サンプルを採点し平均スコアを返す。採点は完全一致/数値一致/LLM-judge等、用途で選ぶ。"""
scores = []
with open(eval_path) as f:
for line in f:
ex = json.loads(line)
out = generate(ex["input"])
scores.append(score(out, ex["expected"])) # 0.0〜1.0
return statistics.mean(scores)
base = run_eval(generate_base, "data/eval.jsonl")
tuned = run_eval(generate_tuned, "data/eval.jsonl")
# 改善が閾値未満、または既存能力の回帰があれば“不合格”としてデプロイを止める。
assert tuned >= base + 0.03, f"改善不足: base={base:.3f} tuned={tuned:.3f}"
print(f"PASS base={base:.3f} -> tuned={tuned:.3f}")
Evaluation hinges on a scoring function matched to the use. Exact match / schema-conformance rate for structured output, LLM-judge for summarization, F1 for classification. Only once you can quantify "what is correct in production" does FT become engineering.
Deploy and license (mind the naming convention)
- Merge the adapter: integrate the LoRA adapter into the base into a single set of weights (to make inference lighter).
- Serve: either self-serve with vLLM or call from Converse via Bedrock Custom Model Import.
- Estimate cost: for self-operation, derive the break-even with inference-cost design first.
⚖️ Read the license: the Llama Community License requires that when you distribute or provide a derivative model built using Llama, you prepend "Llama" to the model name (e.g.,
Llama-YourCompany-Support-8B). It also requires aBuilt with Llamanotice. Internal-only use isn't distribution, but the moment you provide it externally, the naming/notice obligations arise. For details, see the license chapter of the pillar article.
Common pitfalls
- Catastrophic forgetting: over-specializing drops the original general ability. Mix general tasks into the eval set too and monitor regressions.
- Overfitting: too many epochs / too little data worsens evaluation. Watch eval early and stop.
- Data leakage: evaluation mixes into training. Decide the split first, deterministically.
- Too-small data: a few dozen examples won't change the style. Aim for hundreds to thousands of quality examples.
- Doing FT when RAG sufficed: adding knowledge is RAG by default. Baking facts via FT goes obsolete.
Frequently asked questions (FAQ)
Q. RAG or fine-tuning — which should I do first? A. RAG first. Knowledge and freshness in RAG, behavior and style in FT. Many projects meet requirements with RAG + prompting, and adding just a light LoRA for the missing style fixing is the cost-effectiveness optimum.
Q. How much data is needed? A. For fixing style and tone, hundreds to thousands of quality samples is the rough guide. Quality over quantity. A hand-polished few beats a miscellaneous many.
Q. Can Llama 4 (MoE) be fine-tuned too? A. Technically possible, but FT of a huge MoE requires distributed infrastructure and is unsuited for the first step. It's realistic to start from a dense model like Llama 3.3 8B/70B + LoRA/QLoRA.
Q. Can I do it without a GPU? A. You can. Bedrock managed FT (check the target models) means zero GPU operation. For verification only, you can run QLoRA of a small model on Colab / a single GPU.
Q. How big is the artifact (the adapter)? A. A LoRA adapter is in the tens-of-MB class. It can be distributed/swapped separately from the base, and an operation that switches among adapters for multiple domains is also possible.
Conclusion
Fine-tuning is not "magic that makes it smart" but "engineering that fixes behavior to spec." That is exactly why this discipline — doubt first whether RAG suffices, invest in data quality, pass an evaluation gate, and observe the license naming — separates a demo from production.
If you want to put a domain-specialized Llama into production — including its division of roles with RAG, an evaluation harness, cost design, and license compliance — see the case study and reach out. With one person × generative AI, I design end-to-end from PoC to production operation.
Sources / official resources
- torchtune (meta-pytorch/torchtune) — PyTorch-native LoRA/QLoRA recipes
- Llama official Fine-tuning guide
- Hugging Face TRL / PEFT
- Amazon Bedrock model-customization support table
- Llama Community License — derivative-model naming convention / Built with Llama
- Supported models, regions, and the license are updated. Always confirm the primary sources before implementing.