Chapter 9 · From Chatbots to Agents

A chatbot answers. An agent acts.

The leap from "a model that answers questions" to "a model that accomplishes tasks" is, arguably, the defining arc of 2023–2026. It is also where most production AI engineering effort now goes.

An agent is, in its simplest form, an LLM in a loop with tools and a goal. Nothing more.

In plain English. A chatbot is a pen pal. An agent is an intern with a credit card and a laptop.

The agent inside, as a state machine

stateDiagram-v2
    [*] --> Goal
    Goal --> Plan: read goal + memory
    Plan --> Act: choose tool
    Act --> Observe: execute
    Observe --> Reflect: read result
    Reflect --> Plan: not done
    Reflect --> Verify: looks done
    Verify --> Plan: failed checks
    Verify --> [*]: passed

Real systems add three things to that loop: memory (persistent state across turns), safety checks (don't email the customer, don't drop the table), and budgets (no more than N tool calls, no more than M dollars).

flowchart LR
    G[Goal] --> P[Plan / think]
    P --> A[Act: call a tool]
    A --> O[Observe result]
    O --> D{Done?}
    D -- no --> P
    D -- yes --> R[Return result]

That loop — Observe, Orient, Decide, Act — is ancient in AI (it's Boyd's OODA loop, it's classical control theory, it's reinforcement learning). The Transformer-era twist is that the "Decide" node can be an LLM reading natural language and writing natural language.

9.1 The paper that started it all: ReAct (2022)

Yao et al.'s ReAct paper (Oct 2022) showed that interleaving reasoning and action outperformed both pure reasoning (CoT) and pure action (tool use). The model alternates:

Thought: I need the current price of AAPL.
Action: get_quote("AAPL")
Observation: 198.42
Thought: The user asked in euros. I should convert.
Action: get_fx_rate("USD", "EUR")
Observation: 0.92
Thought: 198.42 * 0.92 = 182.55
Final Answer: AAPL is trading at roughly €182.55.

This four-line format is the ancestor of every agent framework that followed. Modern agents dress it up (JSON instead of text, parallel tool calls, typed schemas) but the core is unchanged.

9.2 The 2023 Cambrian explosion

By spring 2023, agent demos were everywhere:

AutoGPT (March 2023) — a viral Python script: give it a goal, watch it loop. Rarely finished real tasks, but showed everyone the vision.
BabyAGI — three agents: task creator, executor, prioritizer. Minimalism.
LangChain Agents — the first production framework, warts and all.
HuggingGPT / Toolformer papers explored the theory.

They were chaotic because:

Context windows were tiny — agents forgot the goal.
Tool use was string-parsed — fragile.
No evals — no way to iterate.
Loops without budgets — agents burned dollars chasing their tails.

But they proved the shape.

9.3 What modern agents actually look like (2024–2026)

flowchart TB
    subgraph Runtime
    L[LLM: reasoning core]
    ST[Short-term state
conversation + scratch]
    MEM[Long-term memory
vector + summaries]
    end
    subgraph Tools
    T1[Retrieval]
    T2[Code execution]
    T3[HTTP / API]
    T4[DB query]
    T5[Filesystem]
    T6[Browser]
    T7[Domain tools]
    end
    G[Goal] --> L
    L <--> ST
    L <--> MEM
    L --> T1
    L --> T2
    L --> T3
    L --> T4
    L --> T5
    L --> T6
    L --> T7
    T1 --> L
    T2 --> L
    T3 --> L
    T4 --> L
    T5 --> L
    T6 --> L
    T7 --> L
    L --> R[Result]

Common patterns:

Planner + executor. The model plans out loud, then executes each step.
Plan-and-reflect. The model pauses periodically to criticize its own progress.
Scratchpad memory. A rolling file the agent writes notes into.
Episodic memory. Past tasks stored in a vector DB, retrieved when relevant.
Budget-bounded loops. Stop after N steps, M seconds, or K dollars.
Checkpoints / approvals. Pause for human review on destructive actions.

9.4 Agents in the wild

Categories that matter in 2026:

Coding agents — Claude Code, Cursor agent, GitHub Copilot Workspace, Devin, Cognition's Cascade. Plan, edit, run tests, open PRs.
Computer-use agents — Claude computer use, OpenAI Operator, Google Project Mariner. Drive a browser or desktop.
Research agents — GPT Deep Research, Gemini Deep Research, Perplexity Pro. Multi-hop web search + synthesis.
Customer-facing agents — Sierra (support), Decagon (support), 11x (sales), Paradigm (recruiting).
Infra agents — Resolve AI (on-call), Runbook-style SRE agents, autonomous security triage.
Data / analytics agents — NL-to-SQL agents, dashboarding agents, ETL generators.

flowchart LR
    subgraph Coding
    A1[Claude Code]
    A2[Cursor agent]
    A3[Devin]
    end
    subgraph Computer-use
    B1[Claude CU]
    B2[OpenAI Operator]
    B3[Project Mariner]
    end
    subgraph Research
    C1[Gemini DR]
    C2[ChatGPT DR]
    C3[Perplexity]
    end
    subgraph Ops
    D1[Resolve AI]
    D2[Runbooks]
    end
    subgraph Business
    E1[Sierra]
    E2[Decagon]
    E3[11x]
    end

9.5 The "just build an agent" temptation

Don't. The Anthropic guidance (Building Effective Agents, 2024) is the single best piece of advice here:

Use the simplest thing that works.

Often you don't need an agent — you need a well-structured workflow with an LLM call or two. Full agentic autonomy (the model decides the number of steps and the order) is the most expensive, least predictable, and hardest to debug pattern. Use it only when:

The task genuinely has an unbounded number of steps.
The path depends on intermediate results you can't enumerate.
The cost of a wrong step is low (or checkpointable).

When a deterministic workflow suffices, write the workflow. When an LLM routes between workflows, call it a "router." Reserve "agent" for the cases where the model really does need to make decisions inside a loop.

flowchart TD
    A[Task] --> B{Bounded steps
enumerable?}
    B -- yes --> C[Workflow
with LLM calls]
    B -- no --> D{Safe to explore?}
    D -- yes --> E[Agent]
    D -- no --> F[Workflow with
human checkpoints]

9.6 What makes an agent reliable

The production art of agents, as learned the hard way:

Clear, narrow tools. 5–15 well-described tools. Not 100.
Structured errors. Tools return {ok: false, error: "...", suggestion: "..."}; the model self-corrects.
Budgets. Max steps, max tokens, max wall clock, max dollars. All enforced outside the model.
Observability. Every step, tool call, and token logged with a trace ID.
Reflection. Periodic "what have I learned, what's next?" calls keep long runs coherent.
Checkpoints. Long-running agents persist state (DB / filesystem) so you can resume after failure.
Evals. A golden set of tasks with expected outcomes; run on every prompt or model change.
Human gates. For any action that spends money, touches prod, or sends a customer message.

9.7 A minimal agent in Python

Here is the shape. In production you'd use LangGraph, OpenAI Agents SDK, or Claude Agent SDK — but seeing it bare is worth a page.

from anthropic import Anthropic

client = Anthropic()

TOOLS = [
    {
        "name": "search",
        "description": "Search the company wiki.",
        "input_schema": {
            "type": "object",
            "properties": {"q": {"type": "string"}},
            "required": ["q"],
        },
    },
    {
        "name": "finish",
        "description": "Return the final answer to the user.",
        "input_schema": {
            "type": "object",
            "properties": {"answer": {"type": "string"}},
            "required": ["answer"],
        },
    },
]

def run_tool(name, args):
    if name == "search":
        return wiki_search(args["q"])
    if name == "finish":
        return {"done": True, "answer": args["answer"]}
    return {"error": f"unknown tool {name}"}

def agent(goal, max_steps=10):
    messages = [{"role": "user", "content": goal}]
    for step in range(max_steps):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return resp  # model answered without tool use

        results = []
        for tu in tool_uses:
            result = run_tool(tu.name, tu.input)
            if tu.name == "finish":
                return result["answer"]
            results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": str(result),
            })
        messages.append({"role": "user", "content": results})

    raise RuntimeError("agent exceeded step budget")

Everything else — multi-agent, MCP, long-running workflows — is a variation of this loop.

9.8 The honest state of agents in 2026

What works well: bounded multi-step tasks (2–20 steps), with 5–15 clean tools, in domains where mistakes are correctable. Code agents, research agents, internal-ops agents.
What's improving fast: long-horizon autonomy (hours, days), computer use, agents that learn from their mistakes across runs.
What still struggles: agents without clear goals, agents asked to handle real money without human oversight, agents with massively overlapping tools, agents without any evals.

If you're building an agent, the question to keep asking is: "would a junior engineer with this toolkit and these instructions succeed?" If the answer is "probably not," your agent won't either.