Chapter 8 · Fine-tuning, LoRA, and PEFT

When to bend a model to your will — and how to do it cheaply.


Fine-tuning used to mean "rent a cluster, pay $50k, hope for the best." In 2026, it means "a few hours on a single GPU, cost in the low hundreds of dollars, reproducible." The methods that made this possible — LoRA, QLoRA, DPO, and their cousins — deserve a chapter.

In plain English. Fine-tuning is teaching the model new habits. RAG is handing the model new information. They solve different problems and they compose beautifully.

Cost vs control: choosing your approach

quadrantChart
    title Adaptation methods - cost vs control
    x-axis Low control --> High control
    y-axis Cheap --> Expensive
    quadrant-1 Heavy customization
    quadrant-2 Wasteful
    quadrant-3 Quick wins
    quadrant-4 Surgical
    Prompting: [0.15, 0.05]
    Few-shot: [0.30, 0.10]
    RAG: [0.40, 0.20]
    Tool use: [0.55, 0.30]
    LoRA: [0.70, 0.40]
    QLoRA: [0.65, 0.30]
    DPO: [0.75, 0.50]
    Full fine-tune: [0.90, 0.95]
    Pretrain from scratch: [0.95, 1.00]

But first, the most important question: should you fine-tune at all?

8.1 The decision tree

flowchart TD
    A[Task underperforming] --> B{Knowledge gap?}
    B -- yes --> C[Use RAG
Ch 7] B -- no --> D{Format / style / tone?} D -- yes --> E{Few-shot fixes it?} E -- yes --> F[Few-shot prompting] E -- no --> G[Fine-tune] D -- no --> H{Reasoning?} H -- yes --> I[Reasoning model
or CoT] H -- no --> J{Too slow / expensive?} J -- yes --> K[Distill big -> small
fine-tune the small one] J -- no --> L{Tool-use errors?} L -- yes --> M[Fine-tune on tool traces] L -- no --> G

The rule that matters: fine-tuning is for behavior, RAG is for knowledge. If your model doesn't know a fact, don't fine-tune — retrieve. If your model won't output JSON the way you want, or won't match your brand voice, or won't route tool calls correctly, fine-tune.

8.2 Why full fine-tuning fell out of favor

A frontier model might have hundreds of billions of parameters. Full fine-tuning updates all of them. Problems:

Parameter-Efficient Fine-Tuning (PEFT) methods fix all four.

8.3 LoRA — the technique that changed the game

Low-Rank Adaptation (Hu et al., 2021) is embarrassingly simple and unreasonably effective.

Instead of updating the full weight matrix W, LoRA freezes W and trains two small matrices A and B whose product A × B is added as a delta:

W_new = W + A × B
        ↑           ↑
        frozen      trained
        (big)       (tiny, low rank)
flowchart LR
    subgraph Standard
    X --> W[W: big dense matrix
update all entries] W --> Y end subgraph LoRA X2[x] --> Wf[W frozen] --> Sum X2 --> A[A: d x r] --> B[B: r x d] --> Sum Sum --> Y2 end

Typical settings: r = 8 or r = 16. You train ~0.1% of the original parameters. You get ~95%+ of full fine-tuning quality. Storage is KB to MB. You can maintain dozens of LoRA adapters per base model and swap at serve time.

QLoRA (Dettmers et al., 2023) adds 4-bit quantization of the frozen base model. Result: fine-tune a 70B model on a single A100 (80 GB). This turned fine-tuning into something a developer could do over lunch.

8.4 Preference optimization: RLHF, DPO, and friends

The second fine-tuning axis is not "imitate good answers" (supervised) but "prefer A over B" (preference-based). This is what turns base models into chat models.

flowchart LR
    subgraph RLHF
    A1[SFT model] --> A2[Sample pairs]
    A2 --> A3[Human ranks]
    A3 --> A4[Reward model]
    A1 --> A5[PPO] 
    A4 --> A5
    A5 --> A6[Aligned model]
    end
    subgraph DPO
    B1[SFT model] --> B2["Preference pairs: (prefer, reject)"]
    B2 --> B3[DPO loss]
    B1 --> B3
    B3 --> B4[Aligned model]
    end

For most backend engineers: start with SFT on domain data, graduate to DPO if you need tone or preference shaping. Skip RLHF unless you're running an ML team.

8.5 When fine-tuning is clearly worth it

8.6 When fine-tuning is the wrong answer

8.7 The data question

Fine-tuning quality is overwhelmingly determined by training data quality. Practical rules:

A minimal SFT dataset shape:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [...]}

For DPO:

{"prompt": "...", "chosen": "good answer", "rejected": "bad answer"}

8.8 The cloud-native fine-tuning path

You don't need to write PyTorch to fine-tune in 2026. Each hyperscaler ships managed flows.

For serious work, Axolotl and LLaMA-Factory are the open-source go-tos: YAML-configured, multi-method (SFT / DPO / ORPO / QLoRA), backed by the Hugging Face TRL library.

flowchart LR
    A[Training data
JSONL] --> B[Axolotl / LLaMA-Factory
config.yaml] B --> C[Fine-tune job
single A100 / H100] C --> D[LoRA adapter
10s of MB] D --> E[vLLM serve
base + adapter] D --> F[Merge + quantize
deploy standalone]

8.9 A realistic end-to-end example

Task: classify customer support tickets into 40 internal categories. Currently using Claude Haiku, costing $2,500/month at 5M tickets.

  1. Export 10k labeled tickets from your DB.
  2. Clean labels; fix obvious mistakes; stratify train/eval.
  3. Generate 5k additional synthetic examples with Opus 4.7, filtered by confidence.
  4. Train a Qwen 2.5 3B with QLoRA on a single A100 for 3 hours.
  5. Eval on a held-out 1k tickets — target F1 ≥ 0.9.
  6. Serve with vLLM on a small GPU instance — cost drops to ~$150/month.
  7. Drift-monitor weekly; re-train quarterly as categories evolve.

Total project: one engineer, one week. ROI: first month.

8.10 What you don't need to learn

For most backend engineers, these are not required:

These are specialist skills. Know what they are. Hire or partner for them.

Further reading & watching