# Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

> The strength of open weights is that 'you can fine-tune the weights on your own data.' This explains, with real code from a production-operations viewpoint, the steps to fine-tune Llama with LoRA/QLoRA — starting from the judgment of 'is it really necessary (RAG vs. FT)', through data preparation, torchtune/TRL implementation, an evaluation gate, merge & deploy, and the license naming convention.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Llama, ファインチューニング, LoRA, 生成AI, AWS Bedrock, Python, MLOps
- URL: https://tomodahinata.com/en/blog/llama-fine-tuning-lora-qlora-production-guide
- Category: Llama & open-weight LLMs
- Pillar guide: https://tomodahinata.com/en/blog/meta-llama-open-weight-llm-production-guide

## Key points

- Doubt before fine-tuning: adding knowledge is RAG; fixing behavior, output format, tone, and domain vocabulary is FT. In many cases RAG + prompting suffices.
- The realistic target is dense models (Llama 3.3 8B/70B, etc.). With LoRA/QLoRA you can run even on a single GPU via 4-bit quantization. A full FT of Llama 4 MoE is a different beast in weight.
- 90% of success is data. Design instruction-format JSONL, quality-first, dedup, train/eval split, and leak prevention.
- Official is torchtune (PyTorch-native). HF TRL+PEFT is also a standard. Managed is Bedrock (Llama 2/3.2) / SageMaker + Custom Model Import.
- Don't ship without an evaluation gate. Compare pre/post-FT on a holdout to detect regressions. A distributed derivative model name must carry the 'Llama' prefix (license).

---

## The goal of this article

As stated in [the full picture of putting Llama into production](/blog/meta-llama-open-weight-llm-production-guide), one of the biggest reasons to choose open weights is that "**you can fine-tune (FT) the weights on your own data.**" This article takes that FT all the way through — **not as a demo but for production** — from the viewpoint of reproducibility, evaluation, cost, and license, with real code.

By the time you finish reading, the goal is a state where you can do the following three things.

1. **Judge whether FT is even necessary** (including "in many cases RAG suffices").
2. **Actually run it with LoRA/QLoRA** (both the torchtune and Hugging Face TRL routes).
3. **Deploy safely only after passing an evaluation gate**, and **observe the license naming convention.**

> **Reliability disclosure**: I operate generative-AI systems in production on top of AWS Bedrock / Vercel AI SDK, and domain specialization (fixing part numbers, domain vocabulary, output formats) was also a central challenge in the [voice-customer-service case study](/blog/production-voice-ai-sales-agent-bedrock-pgvector). This article is faithful to the practical order that "**FT is the last resort.**" I won't present inflated accuracy-improvement numbers.

---

## Doubt first: does that problem really need FT?

FT is powerful, but **it's not the first tool you should reach for.** Cost, operational burden, and obsolescence risk are large. Consider it in the following order.

| Method | Effective for | Not effective for | Cost/operation |
| --- | --- | --- | --- |
| **Prompt/Few-shot** | format instructions, simple behavior | large knowledge, strong style fixing | minimal |
| **RAG (retrieval-augmented)** | **referencing latest knowledge / internal docs**, citing sources | fundamental change of tone or output style | medium ([pgvector RAG](/blog/pgvector-postgres-production-rag-hybrid-search)) |
| **Fine-tuning** | **fixing behavior, output format, tone, domain vocabulary**, reducing latency | freshness of knowledge (frozen at training time) | large (data, training, evaluation, re-training) |

**Iron rules**:
- **Want to add knowledge** → first **RAG.** Baking facts into the model is a hotbed of obsolescence and hallucination.
- **Want it to always answer in the same format, tone, and judgment**, **the prompt got too long**, **want to eliminate misses of domain vocabulary** → time for **FT.**
- In practice, the combination of **RAG + light FT** has the best cost-effectiveness (knowledge in RAG, style in FT).

> 💡 "FT makes it smarter" is a misconception. What FT changes is mainly the **distribution of behavior**, not **reliable memory of new facts.** Holding facts in RAG is the production standard.

---

## What to fine-tune and how (realistic targets and LoRA/QLoRA)

### Target model: start from dense models

[Llama 4 is MoE](/blog/meta-llama-open-weight-llm-production-guide#llama-4-とは何かネイティブマルチモーダル-moe) with a huge total parameter count (Scout 109B / Maverick 400B). A **full FT** of this requires a substantial distributed-training infrastructure and isn't suited for the first step. **The realistic practical target is a dense model** — **Llama 3.3 8B / 70B** or a use-specific small variant — and applying **LoRA/QLoRA** here is the royal road.

### LoRA and QLoRA in a sentence

- **LoRA (Low-Rank Adaptation)**: freeze the huge weights and **train only small low-rank matrices.** Because only a tiny part is updated, it's **memory-thrifty, fast, and the artifact (the adapter) is tens of MB.**
- **QLoRA**: load the base model **4-bit quantized** and train LoRA on top of it. The advantage is that **even a single GPU reaches the 70B class.** Accuracy degradation is often practically small.

| Method | Memory | Speed | Artifact | Best scene |
| --- | --- | --- | --- | --- |
| Full FT | largest | slowest | whole model | large-scale, rebuilding the foundation |
| **LoRA** | small | fast | adapter (small) | standard. fixing style/vocabulary |
| **QLoRA** | **smallest** | fast | adapter (small) | **a larger model on a single GPU** |

---

## Data preparation: 90% of success is here

FT quality is **determined by data quality.** Differences in model or hyperparameters are noise.

- **Format**: instruction/chat-format **JSONL.** 1 line = 1 sample.
- **Quality > quantity**: 1,000 hand-polished samples beat 100,000 noisy ones. Errors get "correctly learned."
- **Dedup**: remove near-dups. Duplicates of the same example invite overfitting.
- **train/eval split**: always **carve out an eval set.** Mixing it causes "grading your own homework" (leakage) and you mistake the accuracy.
- **Diversity**: cover the input distribution that comes in production. Include edge cases and ways of declining ("I don't know") in the training target too.

```json
{"messages":[
  {"role":"system","content":"あなたは当社の見積もりアシスタント。数値は推測せず、無い情報は『不明』と返す。"},
  {"role":"user","content":"型番 TX-200 の標準納期は？"},
  {"role":"assistant","content":"TX-200 の標準納期は5営業日です。在庫状況により前後します。"}
]}
```

> ⚠️ **Leakage is the worst bug that "quietly" inflates accuracy.** If part of the eval set mixes into training, the benchmark rises but it slips in production. Do the **split first, deterministically** (e.g., distribute by hash).

---

## Implementation A: torchtune (PyTorch-native, official)

The Meta/PyTorch-official **[torchtune](https://github.com/meta-pytorch/torchtune)** has the philosophy of "**bare PyTorch with no trainer or abstraction in between.**" You configure a recipe in YAML and run it via CLI.

```bash
pip install torchtune

# 重みを取得（ライセンス同意済みのHFアカウントが必要）
tune download meta-llama/Llama-3.3-70B-Instruct \
  --output-dir ./Llama-3.3-70B-Instruct

# LoRA で単一デバイス学習（recipe と config を指定するだけ）
tune run lora_finetune_single_device \
  --config llama3_3/70B_lora_single_device \
  dataset.source=json \
  dataset.data_files=./data/train.jsonl
```

torchtune **declaratively bundles** the learning rate, LoRA rank, epochs, quantization, etc. into the config, so diff management (who ran with which settings) works. This reproducibility is the lifeline of production operation.

---

## Implementation B: Hugging Face TRL + PEFT (QLoRA)

The most widespread route. Passing `PEFT`'s `LoraConfig` to `TRL`'s `SFTTrainer` makes it QLoRA.

```python
# pip install trl peft transformers bitsandbytes datasets
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"  # 密モデルが現実的な対象

# QLoRA：ベースを 4bit(NF4) で載せる＝単一GPUでも大きめモデルに届く
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# 学習するのは低ランクのアダプタだけ。注意機構の射影行列を対象にするのが定番。
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

dataset = load_dataset("json", data_files={"train": "data/train.jsonl"})["train"]

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora,
    args=SFTConfig(
        output_dir="out/llama-domain-lora",
        num_train_epochs=2,                 # 入れすぎは過学習。1〜3で様子を見る
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,      # 実効バッチを稼ぐ
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
    ),
)
trainer.train()
trainer.save_model("out/llama-domain-lora")  # 成果物はアダプタ（小さい）
```

The point is that **only the adapter is trained.** The base 70B stays frozen and 4-bit, so the required VRAM fits within a realistic level.

---

## Implementation C: managed (lean operation toward AWS)

If you don't want to own training infrastructure, **lean toward managed on AWS.**

- **Bedrock managed FT**: as of writing, targets include **Llama 2 / Llama 3.2** (including Vision). Put the training data in S3, run a job from the console/API, and a **custom model** is issued. Zero GPU operation.
- **SageMaker + Bedrock Custom Model Import**: **import** the weights you made with torchtune/TRL (Llama 2/3/3.1/3.2-family architecture) **into Bedrock** and call them as-is via the existing Converse API.

> 📌 **A note for accuracy**: managed FT's **supported models and regions are updated.** Always check the latest in the [Bedrock model-customization support table](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-model-supported.html). Confirming **first** "whether the model you want is a managed-FT target" prevents accidents.

---

## Evaluation gate: don't ship to production without this

The most dangerous thing in FT is shipping on a "**vague feeling it got better.**" **Machine-compare pre/post-FT on a holdout**, detect regressions, and only then deploy.

```python
# evaluate.py — ベース vs FT後 を同じ評価セットで比較し、合否を判定する
import json, statistics
from typing import Callable

def run_eval(generate: Callable[[str], str], eval_path: str) -> float:
    """各サンプルを採点し平均スコアを返す。採点は完全一致/数値一致/LLM-judge等、用途で選ぶ。"""
    scores = []
    with open(eval_path) as f:
        for line in f:
            ex = json.loads(line)
            out = generate(ex["input"])
            scores.append(score(out, ex["expected"]))  # 0.0〜1.0
    return statistics.mean(scores)

base = run_eval(generate_base, "data/eval.jsonl")
tuned = run_eval(generate_tuned, "data/eval.jsonl")

# 改善が閾値未満、または既存能力の回帰があれば“不合格”としてデプロイを止める。
assert tuned >= base + 0.03, f"改善不足: base={base:.3f} tuned={tuned:.3f}"
print(f"PASS base={base:.3f} -> tuned={tuned:.3f}")
```

Evaluation hinges on a **scoring function matched to the use.** Exact match / schema-conformance rate for structured output, LLM-judge for summarization, F1 for classification. Only once you can **quantify "what is correct in production"** does FT become engineering.

---

## Deploy and license (mind the naming convention)

1. **Merge the adapter**: integrate the LoRA adapter into the base into a single set of weights (to make inference lighter).
2. **Serve**: either [self-serve with vLLM](/blog/vllm-llama-self-hosting-production-inference-server) or call from Converse via Bedrock Custom Model Import.
3. **Estimate cost**: for self-operation, derive the break-even with [inference-cost design](/blog/llama-inference-cost-optimization-self-host-vs-api) first.

> ⚖️ **Read the license**: the [Llama Community License](https://www.llama.com/llama4/license/) requires that **when you distribute or provide a derivative model built using Llama, you prepend "Llama" to the model name** (e.g., `Llama-YourCompany-Support-8B`). It also requires a `Built with Llama` notice. **Internal-only use isn't distribution, but the moment you provide it externally, the naming/notice obligations arise.** For details, see the [license chapter of the pillar article](/blog/meta-llama-open-weight-llm-production-guide#ライセンスの落とし穴商用前に必読).

---

## Common pitfalls

- **Catastrophic forgetting**: over-specializing drops the original general ability. **Mix general tasks into the eval set too** and monitor regressions.
- **Overfitting**: too many epochs / too little data worsens evaluation. **Watch eval early** and stop.
- **Data leakage**: evaluation mixes into training. **Decide the split first, deterministically.**
- **Too-small data**: a few dozen examples won't change the style. **Aim for hundreds to thousands of quality examples.**
- **Doing FT when RAG sufficed**: adding knowledge is RAG by default. Baking facts via FT goes obsolete.

---

## Frequently asked questions (FAQ)

**Q. RAG or fine-tuning — which should I do first?**
A. **RAG first.** Knowledge and freshness in RAG, behavior and style in FT. Many projects meet requirements with RAG + prompting, and adding just a light LoRA for the missing style fixing is the cost-effectiveness optimum.

**Q. How much data is needed?**
A. For fixing style and tone, **hundreds to thousands of quality samples** is the rough guide. Quality over quantity. A hand-polished few beats a miscellaneous many.

**Q. Can Llama 4 (MoE) be fine-tuned too?**
A. Technically possible, but FT of a huge MoE requires distributed infrastructure and is **unsuited for the first step.** It's realistic to start from a dense model like **Llama 3.3 8B/70B** + LoRA/QLoRA.

**Q. Can I do it without a GPU?**
A. You can. **Bedrock managed FT** (check the target models) means zero GPU operation. For verification only, you can run QLoRA of a small model on Colab / a single GPU.

**Q. How big is the artifact (the adapter)?**
A. A LoRA adapter is in the **tens-of-MB class.** It can be distributed/swapped separately from the base, and an operation that switches among adapters for multiple domains is also possible.

---

## Conclusion

Fine-tuning is not "magic that makes it smart" but "**engineering that fixes behavior to spec.**" That is exactly why this discipline — **doubt first whether RAG suffices**, **invest in data quality**, **pass an evaluation gate**, and **observe the license naming** — separates a demo from production.

> If you want to put a domain-specialized Llama into production — including its division of roles with RAG, an evaluation harness, cost design, and license compliance — see the [case study](/case-studies/ai-voice-chatbot) and reach out. With **one person × generative AI**, I design end-to-end from PoC to production operation.

### Sources / official resources

- [torchtune (meta-pytorch/torchtune)](https://github.com/meta-pytorch/torchtune) — PyTorch-native LoRA/QLoRA recipes
- [Llama official Fine-tuning guide](https://www.llama.com/docs/how-to-guides/fine-tuning/)
- [Hugging Face TRL](https://huggingface.co/docs/trl) / [PEFT](https://huggingface.co/docs/peft)
- [Amazon Bedrock model-customization support table](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-model-supported.html)
- [Llama Community License](https://www.llama.com/llama4/license/) — derivative-model naming convention / Built with Llama

* Supported models, regions, and the license are updated. Always confirm the primary sources before implementing.
