Chapter 8 · Fine-tuning, LoRA, and PEFT

When to bend a model to your will — and how to do it cheaply.

Fine-tuning used to mean "rent a cluster, pay $50k, hope for the best." In 2026, it means "a few hours on a single GPU, cost in the low hundreds of dollars, reproducible." The methods that made this possible — LoRA, QLoRA, DPO, and their cousins — deserve a chapter.

In plain English. Fine-tuning is teaching the model new habits. RAG is handing the model new information. They solve different problems and they compose beautifully.

Cost vs control: choosing your approach

quadrantChart
    title Adaptation methods - cost vs control
    x-axis Low control --> High control
    y-axis Cheap --> Expensive
    quadrant-1 Heavy customization
    quadrant-2 Wasteful
    quadrant-3 Quick wins
    quadrant-4 Surgical
    Prompting: [0.15, 0.05]
    Few-shot: [0.30, 0.10]
    RAG: [0.40, 0.20]
    Tool use: [0.55, 0.30]
    LoRA: [0.70, 0.40]
    QLoRA: [0.65, 0.30]
    DPO: [0.75, 0.50]
    Full fine-tune: [0.90, 0.95]
    Pretrain from scratch: [0.95, 1.00]

But first, the most important question: should you fine-tune at all?

8.1 The decision tree

flowchart TD
    A[Task underperforming] --> B{Knowledge gap?}
    B -- yes --> C[Use RAG
Ch 7]
    B -- no --> D{Format / style / tone?}
    D -- yes --> E{Few-shot fixes it?}
    E -- yes --> F[Few-shot prompting]
    E -- no --> G[Fine-tune]
    D -- no --> H{Reasoning?}
    H -- yes --> I[Reasoning model
or CoT]
    H -- no --> J{Too slow / expensive?}
    J -- yes --> K[Distill big -> small
fine-tune the small one]
    J -- no --> L{Tool-use errors?}
    L -- yes --> M[Fine-tune on tool traces]
    L -- no --> G

The rule that matters: fine-tuning is for behavior, RAG is for knowledge. If your model doesn't know a fact, don't fine-tune — retrieve. If your model won't output JSON the way you want, or won't match your brand voice, or won't route tool calls correctly, fine-tune.

8.2 Why full fine-tuning fell out of favor

A frontier model might have hundreds of billions of parameters. Full fine-tuning updates all of them. Problems:

Storage. You need a full copy of the model per fine-tune.
Cost. GPU hours scale with parameters.
Catastrophic forgetting. Over-training on your data degrades general capability.
Deployment. Serving many fine-tunes means many big models.

Parameter-Efficient Fine-Tuning (PEFT) methods fix all four.

8.3 LoRA — the technique that changed the game

Low-Rank Adaptation (Hu et al., 2021) is embarrassingly simple and unreasonably effective.

Instead of updating the full weight matrix W, LoRA freezes W and trains two small matrices A and B whose product A × B is added as a delta:

W_new = W + A × B
        ↑           ↑
        frozen      trained
        (big)       (tiny, low rank)

flowchart LR
    subgraph Standard
    X --> W[W: big dense matrix
update all entries]
    W --> Y
    end
    subgraph LoRA
    X2[x] --> Wf[W frozen] --> Sum
    X2 --> A[A: d x r] --> B[B: r x d] --> Sum
    Sum --> Y2
    end

Typical settings: r = 8 or r = 16. You train ~0.1% of the original parameters. You get ~95%+ of full fine-tuning quality. Storage is KB to MB. You can maintain dozens of LoRA adapters per base model and swap at serve time.

QLoRA (Dettmers et al., 2023) adds 4-bit quantization of the frozen base model. Result: fine-tune a 70B model on a single A100 (80 GB). This turned fine-tuning into something a developer could do over lunch.

8.4 Preference optimization: RLHF, DPO, and friends

The second fine-tuning axis is not "imitate good answers" (supervised) but "prefer A over B" (preference-based). This is what turns base models into chat models.

RLHF (InstructGPT, ChatGPT). Train a reward model, then PPO. Powerful but complex and unstable.
DPO (Rafailov et al., 2023). Directly optimize for preferences without a reward model. Simpler code, similar quality.
ORPO / KTO / SimPO (2024). Incremental improvements — often drop the reference-model requirement, or work with single-response preferences.

flowchart LR
    subgraph RLHF
    A1[SFT model] --> A2[Sample pairs]
    A2 --> A3[Human ranks]
    A3 --> A4[Reward model]
    A1 --> A5[PPO] 
    A4 --> A5
    A5 --> A6[Aligned model]
    end
    subgraph DPO
    B1[SFT model] --> B2["Preference pairs: (prefer, reject)"]
    B2 --> B3[DPO loss]
    B1 --> B3
    B3 --> B4[Aligned model]
    end

For most backend engineers: start with SFT on domain data, graduate to DPO if you need tone or preference shaping. Skip RLHF unless you're running an ML team.

8.5 When fine-tuning is clearly worth it

Format/style lock-in. "Always output this exact JSON." "Always in the voice of our brand guide."
Classification at scale. Fine-tune a small open model (3–14B) to replace a frontier API call costing 100× more.
Tool use on a large catalog. If you have 200 internal tools, a fine-tune can outperform prompt-engineering on the top-K problem.
Low-latency production. Fine-tuned small models beat zero-shot frontier models on narrow tasks by both cost and latency.
Distillation. Generate synthetic data using a frontier model, fine-tune a cheap small model on that data.

8.6 When fine-tuning is the wrong answer

Knowledge updates. Use RAG.
General capability improvement. You can't fine-tune a small model into GPT-5.
One-shot tasks. Better prompting usually fixes it.
Rapidly changing behavior. A prompt change is seconds; a fine-tune is hours.

8.7 The data question

Fine-tuning quality is overwhelmingly determined by training data quality. Practical rules:

Hundreds beat thousands. 500 hand-curated examples > 50,000 messy ones.
Consistency matters. If your targets have inconsistent style, the model learns inconsistency.
Include hard cases. Edge cases, adversarial inputs, past failures.
Split carefully. Hold out 10–20% for evaluation, stratified by class or domain.
Use synthetic data deliberately. A frontier model can generate thousands of examples matching a template, then you curate.

A minimal SFT dataset shape:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [...]}

For DPO:

{"prompt": "...", "chosen": "good answer", "rejected": "bad answer"}

8.8 The cloud-native fine-tuning path

You don't need to write PyTorch to fine-tune in 2026. Each hyperscaler ships managed flows.

OpenAI fine-tuning API — upload JSONL, pick a base (GPT-4o-mini, o4-mini), wait, get a custom model ID you call like any other.
Anthropic fine-tuning — via Bedrock (Amazon) or Vertex (Google), fine-tune Claude Haiku variants for classification, extraction, tool use.
Vertex AI Model Garden — tune LLaMA, Gemma, Mistral, etc.; deploy to managed endpoints; IAM-integrated.
Bedrock Custom Models — similar, on AWS, fed from S3.
Hugging Face AutoTrain — one-click LoRA on open models.

For serious work, Axolotl and LLaMA-Factory are the open-source go-tos: YAML-configured, multi-method (SFT / DPO / ORPO / QLoRA), backed by the Hugging Face TRL library.

flowchart LR
    A[Training data
JSONL] --> B[Axolotl / LLaMA-Factory
config.yaml]
    B --> C[Fine-tune job
single A100 / H100]
    C --> D[LoRA adapter
10s of MB]
    D --> E[vLLM serve
base + adapter]
    D --> F[Merge + quantize
deploy standalone]

8.9 A realistic end-to-end example

Task: classify customer support tickets into 40 internal categories. Currently using Claude Haiku, costing $2,500/month at 5M tickets.

Export 10k labeled tickets from your DB.
Clean labels; fix obvious mistakes; stratify train/eval.
Generate 5k additional synthetic examples with Opus 4.7, filtered by confidence.
Train a Qwen 2.5 3B with QLoRA on a single A100 for 3 hours.
Eval on a held-out 1k tickets — target F1 ≥ 0.9.
Serve with vLLM on a small GPU instance — cost drops to ~$150/month.
Drift-monitor weekly; re-train quarterly as categories evolve.

Total project: one engineer, one week. ROI: first month.

8.10 What you don't need to learn

For most backend engineers, these are not required:

Training from scratch.
Writing CUDA kernels.
RLHF implementation details (PPO, advantage estimation).
Distributed training (FSDP, ZeRO, tensor parallelism).

These are specialist skills. Know what they are. Hire or partner for them.