When to bend a model to your will — and how to do it cheaply.
Fine-tuning used to mean "rent a cluster, pay $50k, hope for the best." In 2026, it means "a few hours on a single GPU, cost in the low hundreds of dollars, reproducible." The methods that made this possible — LoRA, QLoRA, DPO, and their cousins — deserve a chapter.
In plain English. Fine-tuning is teaching the model new habits. RAG is handing the model new information. They solve different problems and they compose beautifully.
quadrantChart
title Adaptation methods - cost vs control
x-axis Low control --> High control
y-axis Cheap --> Expensive
quadrant-1 Heavy customization
quadrant-2 Wasteful
quadrant-3 Quick wins
quadrant-4 Surgical
Prompting: [0.15, 0.05]
Few-shot: [0.30, 0.10]
RAG: [0.40, 0.20]
Tool use: [0.55, 0.30]
LoRA: [0.70, 0.40]
QLoRA: [0.65, 0.30]
DPO: [0.75, 0.50]
Full fine-tune: [0.90, 0.95]
Pretrain from scratch: [0.95, 1.00]
But first, the most important question: should you fine-tune at all?
flowchart TD
A[Task underperforming] --> B{Knowledge gap?}
B -- yes --> C[Use RAG
Ch 7]
B -- no --> D{Format / style / tone?}
D -- yes --> E{Few-shot fixes it?}
E -- yes --> F[Few-shot prompting]
E -- no --> G[Fine-tune]
D -- no --> H{Reasoning?}
H -- yes --> I[Reasoning model
or CoT]
H -- no --> J{Too slow / expensive?}
J -- yes --> K[Distill big -> small
fine-tune the small one]
J -- no --> L{Tool-use errors?}
L -- yes --> M[Fine-tune on tool traces]
L -- no --> G
The rule that matters: fine-tuning is for behavior, RAG is for knowledge. If your model doesn't know a fact, don't fine-tune — retrieve. If your model won't output JSON the way you want, or won't match your brand voice, or won't route tool calls correctly, fine-tune.
A frontier model might have hundreds of billions of parameters. Full fine-tuning updates all of them. Problems:
Parameter-Efficient Fine-Tuning (PEFT) methods fix all four.
Low-Rank Adaptation (Hu et al., 2021) is embarrassingly simple and unreasonably effective.
Instead of updating the full weight matrix W, LoRA freezes W and trains two small matrices A and B whose product A × B is added as a delta:
W_new = W + A × B
↑ ↑
frozen trained
(big) (tiny, low rank)
flowchart LR
subgraph Standard
X --> W[W: big dense matrix
update all entries]
W --> Y
end
subgraph LoRA
X2[x] --> Wf[W frozen] --> Sum
X2 --> A[A: d x r] --> B[B: r x d] --> Sum
Sum --> Y2
end
Typical settings: r = 8 or r = 16. You train ~0.1% of the original parameters. You get ~95%+ of full fine-tuning quality. Storage is KB to MB. You can maintain dozens of LoRA adapters per base model and swap at serve time.
QLoRA (Dettmers et al., 2023) adds 4-bit quantization of the frozen base model. Result: fine-tune a 70B model on a single A100 (80 GB). This turned fine-tuning into something a developer could do over lunch.
The second fine-tuning axis is not "imitate good answers" (supervised) but "prefer A over B" (preference-based). This is what turns base models into chat models.
flowchart LR
subgraph RLHF
A1[SFT model] --> A2[Sample pairs]
A2 --> A3[Human ranks]
A3 --> A4[Reward model]
A1 --> A5[PPO]
A4 --> A5
A5 --> A6[Aligned model]
end
subgraph DPO
B1[SFT model] --> B2["Preference pairs: (prefer, reject)"]
B2 --> B3[DPO loss]
B1 --> B3
B3 --> B4[Aligned model]
end
For most backend engineers: start with SFT on domain data, graduate to DPO if you need tone or preference shaping. Skip RLHF unless you're running an ML team.
Fine-tuning quality is overwhelmingly determined by training data quality. Practical rules:
A minimal SFT dataset shape:
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [...]}
For DPO:
{"prompt": "...", "chosen": "good answer", "rejected": "bad answer"}
You don't need to write PyTorch to fine-tune in 2026. Each hyperscaler ships managed flows.
For serious work, Axolotl and LLaMA-Factory are the open-source go-tos: YAML-configured, multi-method (SFT / DPO / ORPO / QLoRA), backed by the Hugging Face TRL library.
flowchart LR
A[Training data
JSONL] --> B[Axolotl / LLaMA-Factory
config.yaml]
B --> C[Fine-tune job
single A100 / H100]
C --> D[LoRA adapter
10s of MB]
D --> E[vLLM serve
base + adapter]
D --> F[Merge + quantize
deploy standalone]
Task: classify customer support tickets into 40 internal categories. Currently using Claude Haiku, costing $2,500/month at 5M tickets.
Total project: one engineer, one week. ROI: first month.
For most backend engineers, these are not required:
These are specialist skills. Know what they are. Hire or partner for them.