Chapter 1 · The Prologue: Before ChatGPT

Ten years of quiet work that made November 2022 possible.

"It is not the strongest of the species that survives, nor the most intelligent. It is the one most adaptable to change." — often attributed to Darwin

To understand why ChatGPT felt like lightning, you have to see the storm clouds. The breakthrough wasn't a miracle. It was the last step in a ladder that had been built, one rung at a time, by thousands of researchers since roughly 2012.

This chapter walks that ladder. You don't need to memorize it, but knowing the shape of the climb makes every later idea — attention, embeddings, scaling, alignment — feel inevitable rather than magical.

A bird's-eye view

mindmap
  root((Pre-ChatGPT
foundations))
    Hardware
      GPUs and CUDA
      A100 and H100
      TPU pods
    Data
      ImageNet
      Common Crawl
      The Pile
      RefinedWeb
    Architectures
      CNN
      RNN and LSTM
      Transformer
      Encoder vs Decoder
    Training methods
      Backprop
      Adam
      Mixed precision
      RLHF
    Capabilities
      Vision
      Translation
      Summarization
      Few-shot learning

1.1 The thaw (2006–2012)

For most of the 2000s, "neural networks" were a dirty phrase in serious ML circles. Support vector machines and random forests won Kaggle. Neural nets were overhyped and undercooked.

Three things slowly changed:

GPUs. NVIDIA's CUDA (2007) made matrix multiplication cheap. Neural nets are mostly matrix multiplications.
Data. ImageNet (2009, Fei-Fei Li) gave vision researchers a billion-pixel sandbox.
Ideas. Hinton, Bengio, LeCun quietly kept publishing — unsupervised pretraining, ReLUs, dropout.

Then in 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet competition and beat the field by a historic margin. Deep learning wasn't promising anymore. It worked.

flowchart LR
    A[Hand-engineered features
SIFT, HOG] --> B[Shallow classifiers
SVM, Random Forest]
    B --> C[AlexNet 2012
deep convnet on GPU]
    C --> D[Every vision task
moves to deep learning]

1.2 Sequence models (2014–2016)

Vision fell first; language was harder, because language has variable length and long-range dependencies.

The state of the art was RNNs (Recurrent Neural Networks), specifically LSTMs and GRUs. They processed tokens one at a time, passing hidden state forward. They worked, but:

They were slow to train — you couldn't parallelize across tokens.
They forgot — information from 50 tokens ago was often lost.

In 2014, Bahdanau et al. introduced attention as an add-on to seq2seq translation. Instead of squeezing the whole source sentence into a single hidden vector, the decoder could look back at all the source positions and weight them. Translation quality jumped.

Attention was the key insight. But for three years, it lived as a supporting actor.

flowchart LR
    subgraph RNN Era
    A[Token 1] --> B[Token 2] --> C[Token 3] --> D[Token 4]
    end
    subgraph RNN with attention
    E[Token 1] --> F[Token 2] --> G[Token 3] --> H[Token 4]
    H -.looks back.-> E
    H -.looks back.-> F
    H -.looks back.-> G
    end

1.3 The Transformer (2017)

In June 2017, eight researchers at Google Brain (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin) published the single most important paper of the decade:

Attention Is All You Need.

Their claim was radical: you don't need RNNs at all. Self-attention alone — each token attending to every other token — can model sequences, and because it's a big parallel matrix operation, it trains hundreds of times faster on modern hardware.

The architecture they introduced, the Transformer, is the skeleton of every LLM today.

flowchart TB
    I[Input tokens] --> E[Embeddings + positional encodings]
    E --> MHA[Multi-head self-attention]
    MHA --> AN1[Add & Norm]
    AN1 --> FFN[Feed-forward network]
    FFN --> AN2[Add & Norm]
    AN2 --> R{Repeat N times}
    R --> O[Output logits]

Three things made it magical:

Parallelism. Train on a whole sequence at once on a GPU/TPU.
Long-range dependencies. Every token can directly attend to every other token — no decay.
Composability. Stack N identical blocks. Make N big. Watch capability rise.

Within two years, Transformers had conquered translation, speech, and eventually vision (ViT, 2020). Language modeling was next.

1.4 The two families: BERT and GPT (2018)

In late 2018, two labs took Transformers in different directions.

Google published BERT (Devlin et al., October 2018) — an encoder-only model trained on masked language modeling ("fill in the blank"). Great for understanding tasks: search ranking, classification, sentence embeddings.
OpenAI published GPT-1 (Radford et al., June 2018) — a decoder-only model trained on next-token prediction. Great for generation.

These two architectures represent a fork that persists to this day:

flowchart TB
    T[Transformer] --> B[Encoder-only
BERT family]
    T --> D[Decoder-only
GPT family]
    T --> E[Encoder-decoder
T5, BART]
    B --> B1[Classification]
    B --> B2[Search / retrieval]
    B --> B3[Embeddings]
    D --> D1[Text generation]
    D --> D2[Chat]
    D --> D3[Code]
    E --> E1[Translation]
    E --> E2[Summarization]

For the next four years, BERT-style models powered Google Search, FAQ matching, content moderation, and most "AI" features in production. GPT-style models were research curiosities — until they weren't.

1.5 Scale is all you need (2019–2020)

In early 2019, OpenAI released GPT-2 — 1.5B parameters, trained on 40 GB of web text. They withheld the largest version at first, worried about misuse. A few months later they released it and the world mostly yawned. Generation was still stilted.

Then in 2020, OpenAI published two things that changed the shape of the field:

Scaling Laws for Neural Language Models (Kaplan et al., January 2020) — a paper showing that model loss decreases predictably as you increase compute, data, and parameters. There was a recipe.
GPT-3 (May 2020) — 175 billion parameters, 500 GB of text, the recipe cranked to eleven.

GPT-3 was the first "holy shit" moment for people inside the field. You could write a paragraph of English and it would complete it plausibly. Give it three examples of a pattern, and it would continue the pattern — without being fine-tuned. This capability, in-context learning or few-shot prompting, was unprecedented.

flowchart LR
    subgraph Model size
    A[GPT-1
117M] --> B[GPT-2
1.5B] --> C[GPT-3
175B]
    end
    subgraph Capability
    D[Coherent sentences] --> E[Coherent paragraphs] --> F[Few-shot anything]
    end

In 2022, DeepMind's Chinchilla paper (Hoffmann et al.) refined the recipe further: most models up to that point were under-trained. For a given compute budget, you should use a smaller model with more data. This became the guiding heuristic for the next generation of open models.

1.6 Meanwhile, the quiet foundations

Three other threads were being woven at the same time, and they matter for what follows:

Self-supervised learning. Instead of labeled data (expensive), models learn from the structure of raw text itself. Masked tokens. Next tokens. This is why the internet is a training dataset.
Transfer learning. Pretrain on a huge general corpus, fine-tune for your task with a small amount of labeled data. This was BERT's gift to industry.
Hardware. NVIDIA's A100 (2020) and H100 (2022). Google's TPU v4 (2021). Training a 100B-parameter model went from "impossible" to "a few million dollars if you know what you're doing."

flowchart TB
    subgraph Compute
    C1[Fermi GPU 2010] --> C2[V100 2017] --> C3[A100 2020] --> C4[H100 2022]
    end
    subgraph Data
    D1[Wikipedia] --> D2[Common Crawl] --> D3[The Pile] --> D4[RefinedWeb / FineWeb]
    end
    subgraph Methods
    M1[Backprop + SGD] --> M2[Adam] --> M3[Mixed precision] --> M4[FSDP / ZeRO]
    end
    C4 --> R[GPT-3-scale training
routine by 2022]
    D4 --> R
    M4 --> R

1.7 Where the families landed (a quadrant)

Different model families serve different jobs. A useful mental map:

quadrantChart
    title Architecture vs primary use
    x-axis Understanding --> Generation
    y-axis Small / Cheap --> Large / Capable
    quadrant-1 Frontier generators
    quadrant-2 Frontier encoders
    quadrant-3 Lightweight encoders
    quadrant-4 Lightweight generators
    BERT: [0.15, 0.45]
    DistilBERT: [0.10, 0.20]
    RoBERTa: [0.18, 0.55]
    T5: [0.55, 0.60]
    GPT-2: [0.80, 0.35]
    GPT-3: [0.90, 0.85]
    sentence-transformers: [0.20, 0.30]

In plain English. Encoders are good at reading. Decoders are good at writing. Encoder-decoders translate between the two. By 2020, decoders were winning every interesting benchmark — partly because generation contains understanding, but the reverse is not true.

1.8 Why this prologue matters

Every later chapter of this handbook is a variation on a theme introduced here:

Later idea	Its root here
Long context (100k+ tokens)	Self-attention's lack of explicit length limit
Embeddings for search / RAG	BERT-style encoder representations
Chain-of-thought reasoning	Few-shot in-context learning from GPT-3
RLHF and constitutional AI	Transfer learning + human preference data
Agents, tools, MCP	Decoder-only generation looped against real systems
Open-source models	Architecture + methods being public by 2020

Transformers, pretraining, scale — everything after is an engineering elaboration.