Appendix B · Canonical Papers

The 25 papers that built modern AI, with a one-paragraph "why this matters" for each. If you read five of these a month for five months, you'll be in the top 1% of practitioners.


timeline
    title Papers that moved the field
    2012 : AlexNet
    2015 : Deep Residual Learning (ResNet)
    2017 : Attention Is All You Need (Transformer)
    2018 : BERT, GPT-1
    2019 : GPT-2 (withheld release)
    2020 : GPT-3, Scaling Laws, RAG
    2021 : LoRA, Chain of Thought
    2022 : InstructGPT, Chinchilla, ReAct, FlashAttention
    2023 : GPT-4 TR, LLaMA, QLoRA, DPO, Toolformer, Tree of Thoughts
    2024 : Mixtral, o1, Constitutional AI scaling
    2025 : Frontier reasoning, MoE refinements

Foundations

1. Attention Is All You Need (Vaswani et al., 2017)https://arxiv.org/abs/1706.03762 Introduced the Transformer. Replaced RNNs with self-attention, enabled massive parallel training, and became the skeleton of every LLM today. The most important paper in the saga.

2. Deep Residual Learning for Image Recognition (He et al., 2015)https://arxiv.org/abs/1512.03385 Skip connections. Made training deep networks stable. Not LLM-specific, but the trick appears inside every Transformer block.

3. Improving Language Understanding by Generative Pre-Training (Radford et al., 2018)https://openai.com/research/language-unsupervised GPT-1. The decoder-only recipe: pretrain on text, fine-tune on tasks.

4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)https://arxiv.org/abs/1810.04805 The encoder-only counterpoint to GPT. Powered Google Search and almost every NLP production system until GPT-3 arrived.


Scaling

5. Language Models are Few-Shot Learners (Brown et al., 2020 — GPT-3)https://arxiv.org/abs/2005.14165 Showed that sufficiently large LMs learn new tasks from examples in the prompt, with no gradient updates. In-context learning is born.

6. Scaling Laws for Neural Language Models (Kaplan et al., 2020)https://arxiv.org/abs/2001.08361 Loss decreases predictably with compute, data, and parameters. Made training a 100B-parameter model a planning exercise rather than a gamble.

7. Training Compute-Optimal Large Language Models (Hoffmann et al., 2022 — Chinchilla)https://arxiv.org/abs/2203.15556 Corrected Kaplan et al.: most models up to 2022 were over-parameterized and under-trained. For a fixed compute budget, use a smaller model with more tokens.

8. Emergent Abilities of Large Language Models (Wei et al., 2022)https://arxiv.org/abs/2206.07682 Argues that certain capabilities appear suddenly at scale. Controversial but influential; later work questioned the "emergence" framing.


Alignment

9. Training Language Models to Follow Instructions with Human Feedback (Ouyang et al., 2022 — InstructGPT)https://arxiv.org/abs/2203.02155 The RLHF recipe behind ChatGPT. SFT + reward model + PPO. This is how you turn a next-token predictor into something shippable to a hundred million people.

10. Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)https://arxiv.org/abs/2212.08073 Anthropic's alternative to raw RLHF: train against a written "constitution" of principles, with AI-generated critiques. The core of how Claude is aligned.

11. Direct Preference Optimization (Rafailov et al., 2023)https://arxiv.org/abs/2305.18290 RLHF without the RL. A clever derivation lets you skip the reward model and optimize preferences directly. Simpler and often stronger.


Reasoning

12. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)https://arxiv.org/abs/2201.11903 "Let's think step by step." A trivially small prompt change that unlocked real reasoning on math and logic.

13. Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)https://arxiv.org/abs/2203.11171 Sample many reasoning paths, take the majority vote. Simple, effective, expensive.

14. Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)https://arxiv.org/abs/2305.10601 Extends CoT into a search tree with heuristics. The conceptual ancestor of "reasoning models" like o1.

15. OpenAI o1 — Learning to Reason with LLMs (2024)https://openai.com/index/learning-to-reason-with-llms/ The paradigm shift: train models to think before answering, spending serious test-time compute on hidden chain-of-thought. Produced a step-change on math, science, and code.


Retrieval and tools

16. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)https://arxiv.org/abs/2005.11401 The original RAG paper. Still remarkably aligned with how production systems are built.

17. Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)https://arxiv.org/abs/2307.03172 Empirical result: models attend best to the start and end of long contexts and less to the middle. Guides document ordering in RAG prompts.

18. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023)https://arxiv.org/abs/2302.04761 An early, elegant demonstration of self-supervised tool-use training. The ancestor of today's function calling.

19. ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)https://arxiv.org/abs/2210.03629 Interleave reasoning and action. The pattern underlying every production agent.


Efficiency

20. LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)https://arxiv.org/abs/2106.09685 Fine-tune with a tiny delta instead of all weights. Made custom models cheap and common.

21. QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)https://arxiv.org/abs/2305.14314 LoRA on a 4-bit quantized base. Fine-tune a 70B model on a single consumer GPU.

22. FlashAttention (Dao et al., 2022)https://arxiv.org/abs/2205.14135 A fused GPU kernel for attention that is faster and more memory-efficient. Nearly every LLM deployment now uses some variant.

23. Mixtral of Experts (Jiang et al., 2024)https://arxiv.org/abs/2401.04088 A high-quality open sparse Mixture-of-Experts model. Popularized MoE beyond a few large labs.


Multimodality and open models

24. GPT-4 Technical Report (OpenAI, 2023)https://arxiv.org/abs/2303.08774 The benchmarks, the safety section, and not much else. Worth reading for the shape of a flagship release.

25. LLaMA 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023)https://arxiv.org/abs/2307.09288 The paper that made capable open models the norm and launched thousands of fine-tunes.


How to read these

A reading order that works:

flowchart LR
    P1[Attention Is All You Need] --> P2[BERT]
    P1 --> P3[GPT-3]
    P3 --> P4[Scaling Laws]
    P4 --> P5[Chinchilla]
    P3 --> P6[InstructGPT]
    P6 --> P7[Constitutional AI]
    P6 --> P8[DPO]
    P3 --> P9[Chain of Thought]
    P9 --> P10[ReAct]
    P10 --> P11[Toolformer]
    P3 --> P12[RAG]
    P12 --> P13[Lost in the Middle]
    P6 --> P14[LoRA]
    P14 --> P15[QLoRA]
    P9 --> P16[Tree of Thoughts]
    P16 --> P17[o1]

For each paper, a 30-minute pass is usually enough: read the abstract, look at every figure and table, skim the method, read the conclusion. You are looking for the one idea that makes the paper famous, not a deep implementation guide.


Back to index.