Coding Agents Internals Series — Presentation 03

Components of a Coding Agent

A companion to Sebastian Raschka’s essay — the harness mental model, the four-step loop, and the six components that turn an LLM into Claude Code or Codex CLI. The first of five decks unpacking the article in detail.

Companion Sebastian Raschka Harness Agent Loop Claude Code Codex Mini Coding Agent
Engine (LLM) Reasoning Harness Repo Context Tools Memory Subagents

Read the original article →

00

Topics We’ll Cover

How this series maps onto Raschka’s article

This deck covers the framing and the high-level architecture. Decks 04–07 take each of the six components in turn, with diagrams, code, and reflective exercises. The whole series is a companion — read alongside the article, not instead of it.

01

Why “Just Call the LLM” Stops Working

The first time you wire messages.create() to your terminal, the experience is intoxicating. You type “refactor the auth flow to use JWT” and the model produces beautiful code. Then you actually try to use it on a real codebase, and something breaks. The model patches a function that no longer exists. It hallucinates an import path. It edits the wrong file. It re-reads the same 4 000-line config on every turn.

Raschka opens with a deceptively simple observation: much of the recent progress in practical LLM systems is not just about better models, but about how we use them. The implication runs deep — the gap between “LLM with code in chat” and “Claude Code” is not mostly the model. It is the surrounding software.

What you actually build over time

  • A way to summarise the repo upfront so the model isn’t starting cold every turn.
  • A structured tool layer that turns the model’s “run this command” into a validated, approvable, sandboxed action.
  • A memory system that compresses, deduplicates, and prioritises what stays in the prompt.
  • A way to delegate bounded subtasks rather than overloading one prompt.

What this software collectively becomes

Raschka calls it the agent harness: “a layer on top, which can be understood as a control loop around the model” — software scaffolding that manages context, tool use, prompts, state, and control flow. When the harness is specialised for software-engineering tasks, it becomes a coding harness. Claude Code and Codex CLI are coding harnesses; their underlying models are not unique.

Pause and ask yourself

If you had to write down everything Cursor or Claude Code does between your keystroke and the LLM’s reply, how many discrete steps would you list? Try this before reading on. The article’s six components are essentially the answer — but seeing them yourself first makes the rest land harder.

02

LLM vs Reasoning Model vs Agent

Three terms that get casually used interchangeably in marketing copy. Raschka draws sharp lines between them, because each is a real distinction in how the system is built.

TermWhat it isWhere the work happens
LLM The raw model — a next-token engine. Trained on a mix of language, code and human feedback. One forward pass produces one stream of tokens. No memory across calls. No tools.
Reasoning Model An LLM further trained for “intermediate reasoning, verification, or search” — o1, R1, Claude with extended thinking, Gemini Thinking. Still a forward pass, but the model is incentivised to spend many tokens on hidden reasoning before producing the visible answer.
Agent An LLM (reasoning or not) wrapped in a control loop that calls tools, manages memory, and iterates until a task is done. Across many forward passes — with software in between deciding what to feed the model next.

An engine metaphor that scales

Raschka offers a useful image: think of the LLM as an engine, the reasoning model as a beefed-up engine, and the agent harness as the rest of the car — the chassis, transmission, dashboard, and driver-assist features that turn the engine into something useful. A bigger engine matters; but a great chassis around a smaller engine often outperforms a bare engine alone.

LLM engine next-token prediction no state Reasoning model engine + thinking trained to spend tokens on internal verification Agent harness control loop · tools · memory prompt construction · sandbox approval gates · subagents wraps a model (any kind) domain-tuned for software work specialisation increases →
Raschka’s sharper version of the claim

“A good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box.” The harness changes what task the model is being asked to do on each turn — from “solve this entire engineering problem in your head” to “take one careful step on a problem someone has already laid out for you”.

03

The Engine Metaphor — Where the Harness Sits

Zooming in: a coding harness is not a single algorithm. It is a stack of small, mostly-independent pieces, each designed to make the next forward pass cheaper, safer, or more accurate.

User intent — “fix the failing tests”
Workspace summary — what kind of project, branch, key files
Stable prompt prefix — system, tools, repo facts (cacheable)
Session memory — compact transcript & working notes
Tool layer — validate, approve, run, return bounded results
Optional subagent — spawn for side question, return summary
LLM forward pass — reads context, emits next action

Each box is a software component the harness owns. The model itself appears once at the bottom — everything above it is decision-making about what should reach the model next. That is what Raschka means by “a control loop around the model”.

The runtime supports

Raschka’s Figure 3 separates three layers of responsibility: the model family (the engine), the agent loop (the iterative driver), and the runtime supports (sandbox, file system, shell, network, IDE bindings, telemetry). The runtime supports often get hand-waved past, but they are what gives the agent its powers and its safety properties — the difference between “the agent suggested running rm -rf” and “the agent ran rm -rf in a sandboxed worktree”.

Model family

Which Claude / GPT / open-weight model. Provider, version, pricing, latency, context window. Often pluggable: claude-opus-4-7 today, claude-opus-5 tomorrow.

Agent loop

The control structure: how the harness decides when to call the model, when to call a tool, when to ask the user, when to stop. The loop also chooses which tool result to pass back.

Runtime supports

Filesystem, shell, network access, sandbox, approval UI, IDE integration, telemetry. The boring infrastructure that determines what the agent can do safely.

04

The Four-Step Loop — observe, inspect, choose, act

At the heart of every coding harness is a tight, iterative loop. Raschka summarises it in four words: observe · inspect · choose · act. Cursor, Aider, Claude Code, and Codex CLI all implement variants of this same loop.

observe read what changed inspect prompt + tool results choose next action act tool / edit / spawn harness control loop

Mapping each step

  1. observe — the harness gathers what has changed since the last turn: file diffs, last tool result, user message, build status, test failures. This is harness work, not model work.
  2. inspect — the model receives a carefully shaped prompt: the stable prefix, the relevant slice of memory, the latest observation. This is the only step where the model reasons about “what now?”.
  3. choose — the model emits a structured action: a tool call, a code edit, a question for the user, or a final answer. The harness validates that action.
  4. act — the harness executes the action in a sandbox: writes a file, runs a test, spawns a subagent. The result becomes the next observation, and the loop iterates.
A subtle point about agency

Notice that the model is responsible only for the choose step. Everything else is software the harness author wrote. When people say “the agent decided to do X”, they almost always mean “the model emitted an action and the harness ran it”. Mistakes attributed to the model are often mistakes in the harness — in what the model was shown or in how its action was interpreted.

05

The Six Components at a Glance

Raschka decomposes a working coding harness into six components. Each one is a focused engineering problem with its own design space, failure modes, and trade-offs. The rest of this series treats them in detail; this slide gives you the map.

COMPONENT 01

Live Repo Context

Build a workspace summary up front: project layout, key files, branch, test command. Stable facts the agent can reuse instead of starting from zero every turn. Covered in deck 04.

COMPONENT 02

Prompt Shape & Cache Reuse

Separate stable prefix (system, tools, repo summary) from changing tail (memory, latest message). Lets the harness reuse a multi-thousand-token prefix at 10× cheaper. Covered in deck 04.

COMPONENT 03

Tool Access & Use

Move from prose suggestions to bounded, validated actions. Each tool is named, schema-checked, optionally approved, and sandboxed. Less freedom, more reliability. Covered in deck 05.

COMPONENT 04

Minimising Context Bloat

Long contexts get expensive and noisy. Clip large outputs, dedupe repeated reads, summarise older turns, weight by recency. “A lot of apparent model quality is really context quality.” Covered in deck 06.

COMPONENT 05

Structured Session Memory

A small, distilled working memory (current task, important files, notes) sits beside the full transcript on disk. Storage-time discipline, not just prompt-time compression. Covered in deck 06.

COMPONENT 06

Bounded Subagents

Delegate side questions and parallelisable work to subagents that inherit context but with tighter boundaries. Spawn carefully — how to bind a subagent matters as much as how to start one. Covered in deck 07.

Why this is a list, not a hierarchy

None of the six components subsumes another — they are roughly orthogonal axes of harness design. You can have a great tool layer and a terrible memory system, or a tight memory system with no subagents. Real systems differ on which they prioritise: Claude Code emphasises tool-driven retrieval and bounded subagents; Aider leans on a precomputed repo map; Codex sandboxes hard. Raschka’s contribution is the vocabulary for talking about these choices.

06

Why the Harness Can Beat the Bigger Model

One of Raschka’s more provocative claims: “the harness can often be the distinguishing factor that makes one LLM work better than another”. He goes further and speculates that capable open-weight models, given equivalent harnesses, would produce equivalent end-user experiences. Why might that be true?

Per-step cognitive load

Without a harness, the model is asked to plan, read, decide, write, verify — all in one forward pass. With a harness, each pass does one tiny step on a well-bounded problem. A 70B model doing one easy step often beats a 400B model doing one hard step.

Recovery from mistakes

A bare-LLM mistake is final — the wrong answer is the answer. A harness mistake is recoverable — the next turn sees the failed test, the bad diff, the missing file. The loop turns one-shot accuracy into something closer to a search procedure with feedback.

Externalised state

The model has no working memory across calls; the harness does. Decisions taken three turns ago are still in the working-memory file. The model contributes intuition; the harness contributes continuity.

Tools widen the action space

An LLM can suggest git status; a harness can run it and feed back the actual output. The agent observes ground truth, not its own predictions about ground truth. This turns out to matter enormously for correctness on real codebases.

Implication for system designers

If you find yourself reaching for a more capable (and more expensive) model to fix a quality issue, ask first: can I shape the next prompt better, or break this turn into smaller pieces, or feed it real tool output instead of asking it to guess? Raschka’s essay is, in part, a long argument that those harness-level levers are usually cheaper and almost always more durable.

Where the harness cannot save you

07

Reading the Rest of the Series

Each of the next four decks expands one or two components in depth, with diagrams, code, and prompts to take into your own work. Read them in order or jump to whichever component you’re actively working on.

  1. Live Repo Context & Prompt-Cache Reuse →
    Components 1 + 2. What goes into a workspace summary; the prompt-prefix architecture; how the Anthropic / OpenAI prompt cache turns a 6 k-token system prompt from a recurring tax into a one-time cost.
  2. Tool Access & Use →
    Component 3. From prose suggestions to bounded actions: tool-call schemas, validation gates, approval UX, sandbox boundaries, and the design discipline that keeps the model honest.
  3. Minimising Context Bloat & Structured Session Memory →
    Components 4 + 5. Clipping, deduplication, recency-weighted summarisation. Working memory vs full transcript — storage-time structure that survives across turns and even sessions.
  4. Bounded Subagents & Synthesis →
    Component 6 plus a closing synthesis. Why delegate, how to bind rather than just spawn, sandboxing patterns, Claude Code vs Codex differences, and a reading list to take you further.
08

Things to Try Yourself

The article is much more useful if you do something with it — even a small thing. Here are five exercises, in increasing depth, you can do alongside the series.

30 minutes — observe Claude Code

Open Claude Code on a small repo and run a single task. Note every tool call you see in the transcript. Try to label each call as observe, inspect, choose, or act. Where does the harness end and the model begin?

1 hour — read the mini coding agent

Raschka’s reference implementation lives at github.com/rasbt/mini-coding-agent. Walk the code with the article open in another tab. Find the line where each of the six components is implemented.

2 hours — profile your own prompt

If you’ve built any agentic system, dump a real prompt and label each section. What fraction is stable prefix? Changing tail? How much of the changing tail is genuinely new vs duplicated content? You’ll find waste.

An afternoon — build a 200-line harness

Wire messages.create() + a single read_file tool + a working-memory JSON file. Make the loop iterate until the model says “done”. The discomfort of choosing context shapes will teach you more than any article.

A weekend — replace one component

In your toy harness, swap one component for a smarter version: add prompt caching, or a clipping rule, or a validating tool layer. Measure cost and accuracy before and after on a fixed task suite of your own.

Ongoing — pattern-spot in others’ agents

Whenever you use Cursor, Aider, Codex, or any new agent, identify which of the six components it leans on hardest and which it skimps on. The vocabulary turns marketing copy into reverse-engineering.

A meta-prompt to take into your team

Next time someone says “the model isn’t smart enough”, ask the six-component question: did we give it the right context, the right tools, the right memory, and the right task scope? If three of those are honestly “yes”, only then is it the model’s fault.