The 25 papers that built modern AI, with a one-paragraph "why this matters" for each. If you read five of these a month for five months, you'll be in the top 1% of practitioners.
timeline
title Papers that moved the field
2012 : AlexNet
2015 : Deep Residual Learning (ResNet)
2017 : Attention Is All You Need (Transformer)
2018 : BERT, GPT-1
2019 : GPT-2 (withheld release)
2020 : GPT-3, Scaling Laws, RAG
2021 : LoRA, Chain of Thought
2022 : InstructGPT, Chinchilla, ReAct, FlashAttention
2023 : GPT-4 TR, LLaMA, QLoRA, DPO, Toolformer, Tree of Thoughts
2024 : Mixtral, o1, Constitutional AI scaling
2025 : Frontier reasoning, MoE refinements
1. Attention Is All You Need (Vaswani et al., 2017) — https://arxiv.org/abs/1706.03762 Introduced the Transformer. Replaced RNNs with self-attention, enabled massive parallel training, and became the skeleton of every LLM today. The most important paper in the saga.
2. Deep Residual Learning for Image Recognition (He et al., 2015) — https://arxiv.org/abs/1512.03385 Skip connections. Made training deep networks stable. Not LLM-specific, but the trick appears inside every Transformer block.
3. Improving Language Understanding by Generative Pre-Training (Radford et al., 2018) — https://openai.com/research/language-unsupervised GPT-1. The decoder-only recipe: pretrain on text, fine-tune on tasks.
4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018) — https://arxiv.org/abs/1810.04805 The encoder-only counterpoint to GPT. Powered Google Search and almost every NLP production system until GPT-3 arrived.
5. Language Models are Few-Shot Learners (Brown et al., 2020 — GPT-3) — https://arxiv.org/abs/2005.14165 Showed that sufficiently large LMs learn new tasks from examples in the prompt, with no gradient updates. In-context learning is born.
6. Scaling Laws for Neural Language Models (Kaplan et al., 2020) — https://arxiv.org/abs/2001.08361 Loss decreases predictably with compute, data, and parameters. Made training a 100B-parameter model a planning exercise rather than a gamble.
7. Training Compute-Optimal Large Language Models (Hoffmann et al., 2022 — Chinchilla) — https://arxiv.org/abs/2203.15556 Corrected Kaplan et al.: most models up to 2022 were over-parameterized and under-trained. For a fixed compute budget, use a smaller model with more tokens.
8. Emergent Abilities of Large Language Models (Wei et al., 2022) — https://arxiv.org/abs/2206.07682 Argues that certain capabilities appear suddenly at scale. Controversial but influential; later work questioned the "emergence" framing.
9. Training Language Models to Follow Instructions with Human Feedback (Ouyang et al., 2022 — InstructGPT) — https://arxiv.org/abs/2203.02155 The RLHF recipe behind ChatGPT. SFT + reward model + PPO. This is how you turn a next-token predictor into something shippable to a hundred million people.
10. Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) — https://arxiv.org/abs/2212.08073 Anthropic's alternative to raw RLHF: train against a written "constitution" of principles, with AI-generated critiques. The core of how Claude is aligned.
11. Direct Preference Optimization (Rafailov et al., 2023) — https://arxiv.org/abs/2305.18290 RLHF without the RL. A clever derivation lets you skip the reward model and optimize preferences directly. Simpler and often stronger.
12. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) — https://arxiv.org/abs/2201.11903 "Let's think step by step." A trivially small prompt change that unlocked real reasoning on math and logic.
13. Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022) — https://arxiv.org/abs/2203.11171 Sample many reasoning paths, take the majority vote. Simple, effective, expensive.
14. Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023) — https://arxiv.org/abs/2305.10601 Extends CoT into a search tree with heuristics. The conceptual ancestor of "reasoning models" like o1.
15. OpenAI o1 — Learning to Reason with LLMs (2024) — https://openai.com/index/learning-to-reason-with-llms/ The paradigm shift: train models to think before answering, spending serious test-time compute on hidden chain-of-thought. Produced a step-change on math, science, and code.
16. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020) — https://arxiv.org/abs/2005.11401 The original RAG paper. Still remarkably aligned with how production systems are built.
17. Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023) — https://arxiv.org/abs/2307.03172 Empirical result: models attend best to the start and end of long contexts and less to the middle. Guides document ordering in RAG prompts.
18. Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) — https://arxiv.org/abs/2302.04761 An early, elegant demonstration of self-supervised tool-use training. The ancestor of today's function calling.
19. ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) — https://arxiv.org/abs/2210.03629 Interleave reasoning and action. The pattern underlying every production agent.
20. LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021) — https://arxiv.org/abs/2106.09685 Fine-tune with a tiny delta instead of all weights. Made custom models cheap and common.
21. QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023) — https://arxiv.org/abs/2305.14314 LoRA on a 4-bit quantized base. Fine-tune a 70B model on a single consumer GPU.
22. FlashAttention (Dao et al., 2022) — https://arxiv.org/abs/2205.14135 A fused GPU kernel for attention that is faster and more memory-efficient. Nearly every LLM deployment now uses some variant.
23. Mixtral of Experts (Jiang et al., 2024) — https://arxiv.org/abs/2401.04088 A high-quality open sparse Mixture-of-Experts model. Popularized MoE beyond a few large labs.
24. GPT-4 Technical Report (OpenAI, 2023) — https://arxiv.org/abs/2303.08774 The benchmarks, the safety section, and not much else. Worth reading for the shape of a flagship release.
25. LLaMA 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023) — https://arxiv.org/abs/2307.09288 The paper that made capable open models the norm and launched thousands of fine-tunes.
A reading order that works:
flowchart LR
P1[Attention Is All You Need] --> P2[BERT]
P1 --> P3[GPT-3]
P3 --> P4[Scaling Laws]
P4 --> P5[Chinchilla]
P3 --> P6[InstructGPT]
P6 --> P7[Constitutional AI]
P6 --> P8[DPO]
P3 --> P9[Chain of Thought]
P9 --> P10[ReAct]
P10 --> P11[Toolformer]
P3 --> P12[RAG]
P12 --> P13[Lost in the Middle]
P6 --> P14[LoRA]
P14 --> P15[QLoRA]
P9 --> P16[Tree of Thoughts]
P16 --> P17[o1]
For each paper, a 30-minute pass is usually enough: read the abstract, look at every figure and table, skim the method, read the conclusion. You are looking for the one idea that makes the paper famous, not a deep implementation guide.
Back to index.