Chapter 1 · The Prologue: Before ChatGPT

Ten years of quiet work that made November 2022 possible.


"It is not the strongest of the species that survives, nor the most intelligent. It is the one most adaptable to change." — often attributed to Darwin

To understand why ChatGPT felt like lightning, you have to see the storm clouds. The breakthrough wasn't a miracle. It was the last step in a ladder that had been built, one rung at a time, by thousands of researchers since roughly 2012.

This chapter walks that ladder. You don't need to memorize it, but knowing the shape of the climb makes every later idea — attention, embeddings, scaling, alignment — feel inevitable rather than magical.

A bird's-eye view

mindmap
  root((Pre-ChatGPT
foundations)) Hardware GPUs and CUDA A100 and H100 TPU pods Data ImageNet Common Crawl The Pile RefinedWeb Architectures CNN RNN and LSTM Transformer Encoder vs Decoder Training methods Backprop Adam Mixed precision RLHF Capabilities Vision Translation Summarization Few-shot learning

1.1 The thaw (2006–2012)

For most of the 2000s, "neural networks" were a dirty phrase in serious ML circles. Support vector machines and random forests won Kaggle. Neural nets were overhyped and undercooked.

Three things slowly changed:

  1. GPUs. NVIDIA's CUDA (2007) made matrix multiplication cheap. Neural nets are mostly matrix multiplications.
  2. Data. ImageNet (2009, Fei-Fei Li) gave vision researchers a billion-pixel sandbox.
  3. Ideas. Hinton, Bengio, LeCun quietly kept publishing — unsupervised pretraining, ReLUs, dropout.

Then in 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet competition and beat the field by a historic margin. Deep learning wasn't promising anymore. It worked.

flowchart LR
    A[Hand-engineered features
SIFT, HOG] --> B[Shallow classifiers
SVM, Random Forest] B --> C[AlexNet 2012
deep convnet on GPU] C --> D[Every vision task
moves to deep learning]

1.2 Sequence models (2014–2016)

Vision fell first; language was harder, because language has variable length and long-range dependencies.

The state of the art was RNNs (Recurrent Neural Networks), specifically LSTMs and GRUs. They processed tokens one at a time, passing hidden state forward. They worked, but:

In 2014, Bahdanau et al. introduced attention as an add-on to seq2seq translation. Instead of squeezing the whole source sentence into a single hidden vector, the decoder could look back at all the source positions and weight them. Translation quality jumped.

Attention was the key insight. But for three years, it lived as a supporting actor.

flowchart LR
    subgraph RNN Era
    A[Token 1] --> B[Token 2] --> C[Token 3] --> D[Token 4]
    end
    subgraph RNN with attention
    E[Token 1] --> F[Token 2] --> G[Token 3] --> H[Token 4]
    H -.looks back.-> E
    H -.looks back.-> F
    H -.looks back.-> G
    end

1.3 The Transformer (2017)

In June 2017, eight researchers at Google Brain (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin) published the single most important paper of the decade:

Attention Is All You Need.

Their claim was radical: you don't need RNNs at all. Self-attention alone — each token attending to every other token — can model sequences, and because it's a big parallel matrix operation, it trains hundreds of times faster on modern hardware.

The architecture they introduced, the Transformer, is the skeleton of every LLM today.

flowchart TB
    I[Input tokens] --> E[Embeddings + positional encodings]
    E --> MHA[Multi-head self-attention]
    MHA --> AN1[Add & Norm]
    AN1 --> FFN[Feed-forward network]
    FFN --> AN2[Add & Norm]
    AN2 --> R{Repeat N times}
    R --> O[Output logits]

Three things made it magical:

Within two years, Transformers had conquered translation, speech, and eventually vision (ViT, 2020). Language modeling was next.

1.4 The two families: BERT and GPT (2018)

In late 2018, two labs took Transformers in different directions.

These two architectures represent a fork that persists to this day:

flowchart TB
    T[Transformer] --> B[Encoder-only
BERT family] T --> D[Decoder-only
GPT family] T --> E[Encoder-decoder
T5, BART] B --> B1[Classification] B --> B2[Search / retrieval] B --> B3[Embeddings] D --> D1[Text generation] D --> D2[Chat] D --> D3[Code] E --> E1[Translation] E --> E2[Summarization]

For the next four years, BERT-style models powered Google Search, FAQ matching, content moderation, and most "AI" features in production. GPT-style models were research curiosities — until they weren't.

1.5 Scale is all you need (2019–2020)

In early 2019, OpenAI released GPT-2 — 1.5B parameters, trained on 40 GB of web text. They withheld the largest version at first, worried about misuse. A few months later they released it and the world mostly yawned. Generation was still stilted.

Then in 2020, OpenAI published two things that changed the shape of the field:

GPT-3 was the first "holy shit" moment for people inside the field. You could write a paragraph of English and it would complete it plausibly. Give it three examples of a pattern, and it would continue the pattern — without being fine-tuned. This capability, in-context learning or few-shot prompting, was unprecedented.

flowchart LR
    subgraph Model size
    A[GPT-1
117M] --> B[GPT-2
1.5B] --> C[GPT-3
175B] end subgraph Capability D[Coherent sentences] --> E[Coherent paragraphs] --> F[Few-shot anything] end

In 2022, DeepMind's Chinchilla paper (Hoffmann et al.) refined the recipe further: most models up to that point were under-trained. For a given compute budget, you should use a smaller model with more data. This became the guiding heuristic for the next generation of open models.

1.6 Meanwhile, the quiet foundations

Three other threads were being woven at the same time, and they matter for what follows:

flowchart TB
    subgraph Compute
    C1[Fermi GPU 2010] --> C2[V100 2017] --> C3[A100 2020] --> C4[H100 2022]
    end
    subgraph Data
    D1[Wikipedia] --> D2[Common Crawl] --> D3[The Pile] --> D4[RefinedWeb / FineWeb]
    end
    subgraph Methods
    M1[Backprop + SGD] --> M2[Adam] --> M3[Mixed precision] --> M4[FSDP / ZeRO]
    end
    C4 --> R[GPT-3-scale training
routine by 2022] D4 --> R M4 --> R

1.7 Where the families landed (a quadrant)

Different model families serve different jobs. A useful mental map:

quadrantChart
    title Architecture vs primary use
    x-axis Understanding --> Generation
    y-axis Small / Cheap --> Large / Capable
    quadrant-1 Frontier generators
    quadrant-2 Frontier encoders
    quadrant-3 Lightweight encoders
    quadrant-4 Lightweight generators
    BERT: [0.15, 0.45]
    DistilBERT: [0.10, 0.20]
    RoBERTa: [0.18, 0.55]
    T5: [0.55, 0.60]
    GPT-2: [0.80, 0.35]
    GPT-3: [0.90, 0.85]
    sentence-transformers: [0.20, 0.30]

In plain English. Encoders are good at reading. Decoders are good at writing. Encoder-decoders translate between the two. By 2020, decoders were winning every interesting benchmark — partly because generation contains understanding, but the reverse is not true.

1.8 Why this prologue matters

Every later chapter of this handbook is a variation on a theme introduced here:

Later idea Its root here
Long context (100k+ tokens) Self-attention's lack of explicit length limit
Embeddings for search / RAG BERT-style encoder representations
Chain-of-thought reasoning Few-shot in-context learning from GPT-3
RLHF and constitutional AI Transfer learning + human preference data
Agents, tools, MCP Decoder-only generation looped against real systems
Open-source models Architecture + methods being public by 2020

Transformers, pretraining, scale — everything after is an engineering elaboration.

Further reading & watching