Chapter 4 · GPT-4 and the Frontier Race

Multimodality, reasoning models, and the five-lab skyline.


"Compute is the new oil. We are in the middle of a gold rush, and the picks and shovels are GPUs." — a Silicon Valley VC, circa 2023

The 108 days between ChatGPT's launch and GPT-4's release were the most intense research-to-product compression in tech history. Then it got faster.

How the labs line up

quadrantChart
    title Frontier labs: openness vs capability, April 2026
    x-axis Closed weights --> Open weights
    y-axis Niche / Small --> Flagship / Large
    quadrant-1 Open frontier
    quadrant-2 Closed frontier
    quadrant-3 Closed niche
    quadrant-4 Open niche
    OpenAI GPT-5: [0.10, 0.95]
    Anthropic Opus 4.7: [0.12, 0.93]
    Google Gemini 3: [0.15, 0.90]
    Meta LLaMA 4: [0.85, 0.80]
    Mistral Large: [0.70, 0.65]
    DeepSeek V3-R1: [0.88, 0.75]
    Qwen 3: [0.80, 0.78]
    xAI Grok 4: [0.35, 0.70]

4.1 March 14, 2023 — GPT-4

OpenAI released GPT-4 with a 100-page technical report that notably declined to disclose the architecture, parameter count, or training data. What it did disclose was benchmarks:

The underlying dynamic had changed: GPT-4 wasn't an incremental improvement, it was good enough to automate real work. Legal associates, junior analysts, tutors, translators — every white-collar task suddenly had a GPT-4 question mark hanging over it.

flowchart LR
    A[GPT-3.5
bottom 10% of bar] --> B[GPT-4
top 10%] B --> C[GPT-4 Turbo
128k ctx, cheaper] C --> D[GPT-4o
text+image+audio native] D --> E[o1 / o3
reasoning models] E --> F[GPT-5
2025 frontier]

4.2 The five frontier labs

For the next three years, five labs traded the crown roughly every few months:

Lab Flagships (2023–2026) Superpower
OpenAI GPT-4 → 4 Turbo → 4o → o1 → o3 → GPT-5 Product velocity, multimodality, first reasoning models
Anthropic Claude 2 → 3 → 3.5 → 4 → 4.5 → Opus 4.7 Safety, long context, coding, agentic reliability
Google DeepMind Bard → Gemini 1 → 1.5 → 2 → 3 Ultra-long context (1M+), deep product integration
Meta LLaMA → 2 → 3 → 4 Open weights, community catalyst
xAI / Mistral / DeepSeek Grok, Mistral Large, DeepSeek V3 / R1 Efficient architectures, open reasoning
timeline
    title Frontier model releases
    2023 Mar : GPT-4
    2023 Jul : Claude 2 (100k ctx)
    2023 Dec : Gemini 1.0 Ultra
    2024 Feb : Gemini 1.5 (1M ctx)
    2024 Mar : Claude 3 Opus
    2024 May : GPT-4o (omni)
    2024 Jun : Claude 3.5 Sonnet
    2024 Sep : o1 (first reasoning model)
    2024 Dec : o3 preview, Gemini 2
    2025 Feb : Claude 3.7 (extended thinking)
    2025 May : Claude 4 family
    2025 Q3  : GPT-5
    2025 Q4  : Claude 4.5, Opus 4.6, Gemini 3
    2026 Q2  : Claude Opus 4.7

4.3 The multimodality leap

By late 2024, "the model" and "the input modality" decoupled. Any frontier model could:

flowchart TB
    subgraph Inputs
    I1[Text]
    I2[Image]
    I3[Audio]
    I4[Video]
    I5[PDF/docs]
    end
    subgraph Model["Frontier model (2026)"]
    M[Multimodal Transformer]
    end
    subgraph Outputs
    O1[Text]
    O2[Structured data]
    O3[Audio / speech]
    O4[Image generation]
    O5[Tool calls]
    end
    I1 --> M
    I2 --> M
    I3 --> M
    I4 --> M
    I5 --> M
    M --> O1
    M --> O2
    M --> O3
    M --> O4
    M --> O5

For backend engineers, this had a concrete consequence: you stopped building separate OCR / vision / ASR pipelines. One API call could take a photo of a receipt and return structured JSON.

4.4 Reasoning models — thinking before speaking

The next paradigm shift came in September 2024 when OpenAI released o1. Instead of answering immediately, o1 would generate thousands of hidden "thinking" tokens — chains of reasoning, self-critique, retries — before producing the final answer.

On math and competitive-programming benchmarks, o1 didn't just beat GPT-4. It beat it by margins that looked like a separate paradigm:

flowchart LR
    A[Question] --> B[Hidden reasoning
thousands of tokens] B --> C[Self-check] C --> D[Retry / branch] D --> B C --> E[Final answer
concise]

By 2025, every major lab had a reasoning mode: o3, Claude with extended thinking, Gemini 2 Thinking, DeepSeek R1 (fully open). The trade-off: latency goes from seconds to minutes, cost goes up 5–20×. For hard problems, it's worth it; for routine chat, you'd skip it.

4.5 Context windows: from 2k to 2M

A less glamorous but equally important axis was context length.

flowchart LR
    A[GPT-3
2k tokens] --> B[GPT-4
8k-32k] B --> C[Claude 2
100k] C --> D[GPT-4 Turbo
128k] D --> E[Claude 3.5
200k] E --> F[Gemini 1.5 Pro
1M-2M] F --> G[Claude 4.x
ultra-long + reliable]

With 1M tokens, you can:

This partially threatens RAG (Chapter 7) — why retrieve if you can fit everything? — but cost, latency, and reliable retrieval inside long contexts mean RAG is still the right default for most real workloads in 2026.

4.6 Pricing — the quiet revolution

Less sexy but more consequential: models got 100× cheaper between 2023 and 2026 at constant capability.

flowchart LR
    A[GPT-4 Mar 2023
$30 / $60 per MT] --> B[GPT-4 Turbo Nov 2023
$10 / $30] B --> C[GPT-4o May 2024
$5 / $15] C --> D[GPT-4o mini
$0.15 / $0.60] D --> E[Haiku / Flash
2026
pennies per million]

The macro story: inference cost dropped roughly 10× per year. Any workflow that was "too expensive" in 2023 is a trivial line item in 2026. Plan for this — your model choice today should be assumed to be a commodity in twelve months.

In plain English. Inference (the cost of using the model) is collapsing. Training cost is exploding. Net effect: the labs that win the training race get to play, but everyone else gets to use near-frontier models for almost nothing. Plan your products on the assumption that the smart bit is free in two years.

4.7 Benchmarks and their discontents

Benchmarks drove the race, and then Goodhart's Law took over: every benchmark got saturated or gamed. The most-watched leaderboards in 2026 are:

Trust benchmarks for direction, not truth. Write your own evals (Chapter 6) for anything you actually ship.

4.8 What this race means for you

As a backend engineer, you do not have to track every release. You have to track three things:

  1. Capability frontier. What's newly feasible? (E.g., multi-hour agents, real-time voice.)
  2. Price per capability. When does a workflow become trivially cheap?
  3. The open gap. How close are open-weights models to the frontier? That gap governs your self-hosting and compliance options.

The correct defaults for most production work in 2026 look like:

Further reading & watching