Chapter 4 · GPT-4 and the Frontier Race

Multimodality, reasoning models, and the five-lab skyline.

"Compute is the new oil. We are in the middle of a gold rush, and the picks and shovels are GPUs." — a Silicon Valley VC, circa 2023

The 108 days between ChatGPT's launch and GPT-4's release were the most intense research-to-product compression in tech history. Then it got faster.

How the labs line up

quadrantChart
    title Frontier labs: openness vs capability, April 2026
    x-axis Closed weights --> Open weights
    y-axis Niche / Small --> Flagship / Large
    quadrant-1 Open frontier
    quadrant-2 Closed frontier
    quadrant-3 Closed niche
    quadrant-4 Open niche
    OpenAI GPT-5: [0.10, 0.95]
    Anthropic Opus 4.7: [0.12, 0.93]
    Google Gemini 3: [0.15, 0.90]
    Meta LLaMA 4: [0.85, 0.80]
    Mistral Large: [0.70, 0.65]
    DeepSeek V3-R1: [0.88, 0.75]
    Qwen 3: [0.80, 0.78]
    xAI Grok 4: [0.35, 0.70]

4.1 March 14, 2023 — GPT-4

OpenAI released GPT-4 with a 100-page technical report that notably declined to disclose the architecture, parameter count, or training data. What it did disclose was benchmarks:

Top 10% of test-takers on the Uniform Bar Exam (vs bottom 10% for GPT-3.5)
5 on AP Biology, Calculus BC, US History
Solve competitive programming problems on LeetCode / Codeforces
Vision inputs — describe images, read diagrams, solve visual math

The underlying dynamic had changed: GPT-4 wasn't an incremental improvement, it was good enough to automate real work. Legal associates, junior analysts, tutors, translators — every white-collar task suddenly had a GPT-4 question mark hanging over it.

flowchart LR
    A[GPT-3.5
bottom 10% of bar] --> B[GPT-4
top 10%]
    B --> C[GPT-4 Turbo
128k ctx, cheaper]
    C --> D[GPT-4o
text+image+audio native]
    D --> E[o1 / o3
reasoning models]
    E --> F[GPT-5
2025 frontier]

4.2 The five frontier labs

For the next three years, five labs traded the crown roughly every few months:

Lab	Flagships (2023–2026)	Superpower
OpenAI	GPT-4 → 4 Turbo → 4o → o1 → o3 → GPT-5	Product velocity, multimodality, first reasoning models
Anthropic	Claude 2 → 3 → 3.5 → 4 → 4.5 → Opus 4.7	Safety, long context, coding, agentic reliability
Google DeepMind	Bard → Gemini 1 → 1.5 → 2 → 3	Ultra-long context (1M+), deep product integration
Meta	LLaMA → 2 → 3 → 4	Open weights, community catalyst
xAI / Mistral / DeepSeek	Grok, Mistral Large, DeepSeek V3 / R1	Efficient architectures, open reasoning

timeline
    title Frontier model releases
    2023 Mar : GPT-4
    2023 Jul : Claude 2 (100k ctx)
    2023 Dec : Gemini 1.0 Ultra
    2024 Feb : Gemini 1.5 (1M ctx)
    2024 Mar : Claude 3 Opus
    2024 May : GPT-4o (omni)
    2024 Jun : Claude 3.5 Sonnet
    2024 Sep : o1 (first reasoning model)
    2024 Dec : o3 preview, Gemini 2
    2025 Feb : Claude 3.7 (extended thinking)
    2025 May : Claude 4 family
    2025 Q3  : GPT-5
    2025 Q4  : Claude 4.5, Opus 4.6, Gemini 3
    2026 Q2  : Claude Opus 4.7

4.3 The multimodality leap

By late 2024, "the model" and "the input modality" decoupled. Any frontier model could:

See. Parse charts, diagrams, screenshots, scanned documents, whiteboards.
Hear. Real-time voice conversations with <300ms latency (GPT-4o, Gemini Live).
Speak. Synthesize voice natively, with emotion and interruption handling.
Read video. Describe or reason over multi-hour footage.

flowchart TB
    subgraph Inputs
    I1[Text]
    I2[Image]
    I3[Audio]
    I4[Video]
    I5[PDF/docs]
    end
    subgraph Model["Frontier model (2026)"]
    M[Multimodal Transformer]
    end
    subgraph Outputs
    O1[Text]
    O2[Structured data]
    O3[Audio / speech]
    O4[Image generation]
    O5[Tool calls]
    end
    I1 --> M
    I2 --> M
    I3 --> M
    I4 --> M
    I5 --> M
    M --> O1
    M --> O2
    M --> O3
    M --> O4
    M --> O5

For backend engineers, this had a concrete consequence: you stopped building separate OCR / vision / ASR pipelines. One API call could take a photo of a receipt and return structured JSON.

4.4 Reasoning models — thinking before speaking

The next paradigm shift came in September 2024 when OpenAI released o1. Instead of answering immediately, o1 would generate thousands of hidden "thinking" tokens — chains of reasoning, self-critique, retries — before producing the final answer.

On math and competitive-programming benchmarks, o1 didn't just beat GPT-4. It beat it by margins that looked like a separate paradigm:

AIME (American Invitational Math Exam): GPT-4o ~13%, o1 ~83%.
Codeforces Elo: GPT-4o ~808, o1 ~1807 (top 11%).
GPQA (PhD-level science): o1 beat expert humans.

flowchart LR
    A[Question] --> B[Hidden reasoning
thousands of tokens]
    B --> C[Self-check]
    C --> D[Retry / branch]
    D --> B
    C --> E[Final answer
concise]

By 2025, every major lab had a reasoning mode: o3, Claude with extended thinking, Gemini 2 Thinking, DeepSeek R1 (fully open). The trade-off: latency goes from seconds to minutes, cost goes up 5–20×. For hard problems, it's worth it; for routine chat, you'd skip it.

4.5 Context windows: from 2k to 2M

A less glamorous but equally important axis was context length.

flowchart LR
    A[GPT-3
2k tokens] --> B[GPT-4
8k-32k]
    B --> C[Claude 2
100k]
    C --> D[GPT-4 Turbo
128k]
    D --> E[Claude 3.5
200k]
    E --> F[Gemini 1.5 Pro
1M-2M]
    F --> G[Claude 4.x
ultra-long + reliable]

With 1M tokens, you can:

Put an entire medium-sized codebase in one prompt.
Upload a full legal contract and ask nuanced questions.
Feed a model an entire customer's support history.

This partially threatens RAG (Chapter 7) — why retrieve if you can fit everything? — but cost, latency, and reliable retrieval inside long contexts mean RAG is still the right default for most real workloads in 2026.

4.6 Pricing — the quiet revolution

Less sexy but more consequential: models got 100× cheaper between 2023 and 2026 at constant capability.

flowchart LR
    A[GPT-4 Mar 2023
$30 / $60 per MT] --> B[GPT-4 Turbo Nov 2023
$10 / $30]
    B --> C[GPT-4o May 2024
$5 / $15]
    C --> D[GPT-4o mini
$0.15 / $0.60]
    D --> E[Haiku / Flash
2026
pennies per million]

The macro story: inference cost dropped roughly 10× per year. Any workflow that was "too expensive" in 2023 is a trivial line item in 2026. Plan for this — your model choice today should be assumed to be a commodity in twelve months.

In plain English. Inference (the cost of using the model) is collapsing. Training cost is exploding. Net effect: the labs that win the training race get to play, but everyone else gets to use near-frontier models for almost nothing. Plan your products on the assumption that the smart bit is free in two years.

4.7 Benchmarks and their discontents

Benchmarks drove the race, and then Goodhart's Law took over: every benchmark got saturated or gamed. The most-watched leaderboards in 2026 are:

SWE-bench / SWE-bench Verified — real GitHub issues, model must produce a patch that passes tests. The gold standard for coding agents. Opus 4.7 sits near the top.
GPQA Diamond — hard science questions.
ARC-AGI — fluid reasoning puzzles. Still a challenge.
LMSYS Chatbot Arena — human preference ratings in blind head-to-head matches.
Frontier Math — a new, intentionally hard math benchmark designed to resist memorization.
τ-bench / BFCL — tool use and multi-turn agent tasks.

Trust benchmarks for direction, not truth. Write your own evals (Chapter 6) for anything you actually ship.

4.8 What this race means for you

As a backend engineer, you do not have to track every release. You have to track three things:

Capability frontier. What's newly feasible? (E.g., multi-hour agents, real-time voice.)
Price per capability. When does a workflow become trivially cheap?
The open gap. How close are open-weights models to the frontier? That gap governs your self-hosting and compliance options.

The correct defaults for most production work in 2026 look like:

Daily driver: Claude Opus 4.7 (or Sonnet 4.5 for cost/latency).
High-volume / cheap tier: Haiku 4.5, GPT-4o-mini, or Gemini Flash.
Reasoning / agents: Opus 4.7 with extended thinking.
Multimodal: GPT-5 or Gemini 3 for voice/vision-heavy workflows.
Self-host: Qwen, Llama 4, DeepSeek V3 on vLLM.