Ten years of quiet work that made November 2022 possible.
"It is not the strongest of the species that survives, nor the most intelligent. It is the one most adaptable to change." — often attributed to Darwin
To understand why ChatGPT felt like lightning, you have to see the storm clouds. The breakthrough wasn't a miracle. It was the last step in a ladder that had been built, one rung at a time, by thousands of researchers since roughly 2012.
This chapter walks that ladder. You don't need to memorize it, but knowing the shape of the climb makes every later idea — attention, embeddings, scaling, alignment — feel inevitable rather than magical.
mindmap root((Pre-ChatGPT
foundations)) Hardware GPUs and CUDA A100 and H100 TPU pods Data ImageNet Common Crawl The Pile RefinedWeb Architectures CNN RNN and LSTM Transformer Encoder vs Decoder Training methods Backprop Adam Mixed precision RLHF Capabilities Vision Translation Summarization Few-shot learning
For most of the 2000s, "neural networks" were a dirty phrase in serious ML circles. Support vector machines and random forests won Kaggle. Neural nets were overhyped and undercooked.
Three things slowly changed:
Then in 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet competition and beat the field by a historic margin. Deep learning wasn't promising anymore. It worked.
flowchart LR
A[Hand-engineered features
SIFT, HOG] --> B[Shallow classifiers
SVM, Random Forest]
B --> C[AlexNet 2012
deep convnet on GPU]
C --> D[Every vision task
moves to deep learning]
Vision fell first; language was harder, because language has variable length and long-range dependencies.
The state of the art was RNNs (Recurrent Neural Networks), specifically LSTMs and GRUs. They processed tokens one at a time, passing hidden state forward. They worked, but:
In 2014, Bahdanau et al. introduced attention as an add-on to seq2seq translation. Instead of squeezing the whole source sentence into a single hidden vector, the decoder could look back at all the source positions and weight them. Translation quality jumped.
Attention was the key insight. But for three years, it lived as a supporting actor.
flowchart LR
subgraph RNN Era
A[Token 1] --> B[Token 2] --> C[Token 3] --> D[Token 4]
end
subgraph RNN with attention
E[Token 1] --> F[Token 2] --> G[Token 3] --> H[Token 4]
H -.looks back.-> E
H -.looks back.-> F
H -.looks back.-> G
end
In June 2017, eight researchers at Google Brain (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin) published the single most important paper of the decade:
Attention Is All You Need.
Their claim was radical: you don't need RNNs at all. Self-attention alone — each token attending to every other token — can model sequences, and because it's a big parallel matrix operation, it trains hundreds of times faster on modern hardware.
The architecture they introduced, the Transformer, is the skeleton of every LLM today.
flowchart TB
I[Input tokens] --> E[Embeddings + positional encodings]
E --> MHA[Multi-head self-attention]
MHA --> AN1[Add & Norm]
AN1 --> FFN[Feed-forward network]
FFN --> AN2[Add & Norm]
AN2 --> R{Repeat N times}
R --> O[Output logits]
Three things made it magical:
Within two years, Transformers had conquered translation, speech, and eventually vision (ViT, 2020). Language modeling was next.
In late 2018, two labs took Transformers in different directions.
These two architectures represent a fork that persists to this day:
flowchart TB
T[Transformer] --> B[Encoder-only
BERT family]
T --> D[Decoder-only
GPT family]
T --> E[Encoder-decoder
T5, BART]
B --> B1[Classification]
B --> B2[Search / retrieval]
B --> B3[Embeddings]
D --> D1[Text generation]
D --> D2[Chat]
D --> D3[Code]
E --> E1[Translation]
E --> E2[Summarization]
For the next four years, BERT-style models powered Google Search, FAQ matching, content moderation, and most "AI" features in production. GPT-style models were research curiosities — until they weren't.
In early 2019, OpenAI released GPT-2 — 1.5B parameters, trained on 40 GB of web text. They withheld the largest version at first, worried about misuse. A few months later they released it and the world mostly yawned. Generation was still stilted.
Then in 2020, OpenAI published two things that changed the shape of the field:
GPT-3 was the first "holy shit" moment for people inside the field. You could write a paragraph of English and it would complete it plausibly. Give it three examples of a pattern, and it would continue the pattern — without being fine-tuned. This capability, in-context learning or few-shot prompting, was unprecedented.
flowchart LR
subgraph Model size
A[GPT-1
117M] --> B[GPT-2
1.5B] --> C[GPT-3
175B]
end
subgraph Capability
D[Coherent sentences] --> E[Coherent paragraphs] --> F[Few-shot anything]
end
In 2022, DeepMind's Chinchilla paper (Hoffmann et al.) refined the recipe further: most models up to that point were under-trained. For a given compute budget, you should use a smaller model with more data. This became the guiding heuristic for the next generation of open models.
Three other threads were being woven at the same time, and they matter for what follows:
flowchart TB
subgraph Compute
C1[Fermi GPU 2010] --> C2[V100 2017] --> C3[A100 2020] --> C4[H100 2022]
end
subgraph Data
D1[Wikipedia] --> D2[Common Crawl] --> D3[The Pile] --> D4[RefinedWeb / FineWeb]
end
subgraph Methods
M1[Backprop + SGD] --> M2[Adam] --> M3[Mixed precision] --> M4[FSDP / ZeRO]
end
C4 --> R[GPT-3-scale training
routine by 2022]
D4 --> R
M4 --> R
Different model families serve different jobs. A useful mental map:
quadrantChart
title Architecture vs primary use
x-axis Understanding --> Generation
y-axis Small / Cheap --> Large / Capable
quadrant-1 Frontier generators
quadrant-2 Frontier encoders
quadrant-3 Lightweight encoders
quadrant-4 Lightweight generators
BERT: [0.15, 0.45]
DistilBERT: [0.10, 0.20]
RoBERTa: [0.18, 0.55]
T5: [0.55, 0.60]
GPT-2: [0.80, 0.35]
GPT-3: [0.90, 0.85]
sentence-transformers: [0.20, 0.30]
In plain English. Encoders are good at reading. Decoders are good at writing. Encoder-decoders translate between the two. By 2020, decoders were winning every interesting benchmark — partly because generation contains understanding, but the reverse is not true.
Every later chapter of this handbook is a variation on a theme introduced here:
| Later idea | Its root here |
|---|---|
| Long context (100k+ tokens) | Self-attention's lack of explicit length limit |
| Embeddings for search / RAG | BERT-style encoder representations |
| Chain-of-thought reasoning | Few-shot in-context learning from GPT-3 |
| RLHF and constitutional AI | Transfer learning + human preference data |
| Agents, tools, MCP | Decoder-only generation looped against real systems |
| Open-source models | Architecture + methods being public by 2020 |
Transformers, pretraining, scale — everything after is an engineering elaboration.