A companion to Sebastian Raschka’s essay — the harness mental model, the four-step loop, and the six components that turn an LLM into Claude Code or Codex CLI. The first of five decks unpacking the article in detail.
This deck covers the framing and the high-level architecture. Decks 04–07 take each of the six components in turn, with diagrams, code, and reflective exercises. The whole series is a companion — read alongside the article, not instead of it.
The first time you wire messages.create() to your terminal, the experience is intoxicating. You type “refactor the auth flow to use JWT” and the model produces beautiful code. Then you actually try to use it on a real codebase, and something breaks. The model patches a function that no longer exists. It hallucinates an import path. It edits the wrong file. It re-reads the same 4 000-line config on every turn.
Raschka opens with a deceptively simple observation: much of the recent progress in practical LLM systems is not just about better models, but about how we use them. The implication runs deep — the gap between “LLM with code in chat” and “Claude Code” is not mostly the model. It is the surrounding software.
Raschka calls it the agent harness: “a layer on top, which can be understood as a control loop around the model” — software scaffolding that manages context, tool use, prompts, state, and control flow. When the harness is specialised for software-engineering tasks, it becomes a coding harness. Claude Code and Codex CLI are coding harnesses; their underlying models are not unique.
If you had to write down everything Cursor or Claude Code does between your keystroke and the LLM’s reply, how many discrete steps would you list? Try this before reading on. The article’s six components are essentially the answer — but seeing them yourself first makes the rest land harder.
Three terms that get casually used interchangeably in marketing copy. Raschka draws sharp lines between them, because each is a real distinction in how the system is built.
| Term | What it is | Where the work happens |
|---|---|---|
| LLM | The raw model — a next-token engine. Trained on a mix of language, code and human feedback. | One forward pass produces one stream of tokens. No memory across calls. No tools. |
| Reasoning Model | An LLM further trained for “intermediate reasoning, verification, or search” — o1, R1, Claude with extended thinking, Gemini Thinking. | Still a forward pass, but the model is incentivised to spend many tokens on hidden reasoning before producing the visible answer. |
| Agent | An LLM (reasoning or not) wrapped in a control loop that calls tools, manages memory, and iterates until a task is done. | Across many forward passes — with software in between deciding what to feed the model next. |
Raschka offers a useful image: think of the LLM as an engine, the reasoning model as a beefed-up engine, and the agent harness as the rest of the car — the chassis, transmission, dashboard, and driver-assist features that turn the engine into something useful. A bigger engine matters; but a great chassis around a smaller engine often outperforms a bare engine alone.
“A good coding harness can make a reasoning and a non-reasoning model feel much stronger than it does in a plain chat box.” The harness changes what task the model is being asked to do on each turn — from “solve this entire engineering problem in your head” to “take one careful step on a problem someone has already laid out for you”.
Zooming in: a coding harness is not a single algorithm. It is a stack of small, mostly-independent pieces, each designed to make the next forward pass cheaper, safer, or more accurate.
Each box is a software component the harness owns. The model itself appears once at the bottom — everything above it is decision-making about what should reach the model next. That is what Raschka means by “a control loop around the model”.
Raschka’s Figure 3 separates three layers of responsibility: the model family (the engine), the agent loop (the iterative driver), and the runtime supports (sandbox, file system, shell, network, IDE bindings, telemetry). The runtime supports often get hand-waved past, but they are what gives the agent its powers and its safety properties — the difference between “the agent suggested running rm -rf” and “the agent ran rm -rf in a sandboxed worktree”.
Which Claude / GPT / open-weight model. Provider, version, pricing, latency, context window. Often pluggable: claude-opus-4-7 today, claude-opus-5 tomorrow.
The control structure: how the harness decides when to call the model, when to call a tool, when to ask the user, when to stop. The loop also chooses which tool result to pass back.
Filesystem, shell, network access, sandbox, approval UI, IDE integration, telemetry. The boring infrastructure that determines what the agent can do safely.
At the heart of every coding harness is a tight, iterative loop. Raschka summarises it in four words: observe · inspect · choose · act. Cursor, Aider, Claude Code, and Codex CLI all implement variants of this same loop.
Notice that the model is responsible only for the choose step. Everything else is software the harness author wrote. When people say “the agent decided to do X”, they almost always mean “the model emitted an action and the harness ran it”. Mistakes attributed to the model are often mistakes in the harness — in what the model was shown or in how its action was interpreted.
Raschka decomposes a working coding harness into six components. Each one is a focused engineering problem with its own design space, failure modes, and trade-offs. The rest of this series treats them in detail; this slide gives you the map.
Build a workspace summary up front: project layout, key files, branch, test command. Stable facts the agent can reuse instead of starting from zero every turn. Covered in deck 04.
Separate stable prefix (system, tools, repo summary) from changing tail (memory, latest message). Lets the harness reuse a multi-thousand-token prefix at 10× cheaper. Covered in deck 04.
Move from prose suggestions to bounded, validated actions. Each tool is named, schema-checked, optionally approved, and sandboxed. Less freedom, more reliability. Covered in deck 05.
Long contexts get expensive and noisy. Clip large outputs, dedupe repeated reads, summarise older turns, weight by recency. “A lot of apparent model quality is really context quality.” Covered in deck 06.
A small, distilled working memory (current task, important files, notes) sits beside the full transcript on disk. Storage-time discipline, not just prompt-time compression. Covered in deck 06.
Delegate side questions and parallelisable work to subagents that inherit context but with tighter boundaries. Spawn carefully — how to bind a subagent matters as much as how to start one. Covered in deck 07.
None of the six components subsumes another — they are roughly orthogonal axes of harness design. You can have a great tool layer and a terrible memory system, or a tight memory system with no subagents. Real systems differ on which they prioritise: Claude Code emphasises tool-driven retrieval and bounded subagents; Aider leans on a precomputed repo map; Codex sandboxes hard. Raschka’s contribution is the vocabulary for talking about these choices.
One of Raschka’s more provocative claims: “the harness can often be the distinguishing factor that makes one LLM work better than another”. He goes further and speculates that capable open-weight models, given equivalent harnesses, would produce equivalent end-user experiences. Why might that be true?
Without a harness, the model is asked to plan, read, decide, write, verify — all in one forward pass. With a harness, each pass does one tiny step on a well-bounded problem. A 70B model doing one easy step often beats a 400B model doing one hard step.
A bare-LLM mistake is final — the wrong answer is the answer. A harness mistake is recoverable — the next turn sees the failed test, the bad diff, the missing file. The loop turns one-shot accuracy into something closer to a search procedure with feedback.
The model has no working memory across calls; the harness does. Decisions taken three turns ago are still in the working-memory file. The model contributes intuition; the harness contributes continuity.
An LLM can suggest git status; a harness can run it and feed back the actual output. The agent observes ground truth, not its own predictions about ground truth. This turns out to matter enormously for correctness on real codebases.
If you find yourself reaching for a more capable (and more expensive) model to fix a quality issue, ask first: can I shape the next prompt better, or break this turn into smaller pieces, or feed it real tool output instead of asking it to guess? Raschka’s essay is, in part, a long argument that those harness-level levers are usually cheaper and almost always more durable.
Each of the next four decks expands one or two components in depth, with diagrams, code, and prompts to take into your own work. Read them in order or jump to whichever component you’re actively working on.
The article is much more useful if you do something with it — even a small thing. Here are five exercises, in increasing depth, you can do alongside the series.
Open Claude Code on a small repo and run a single task. Note every tool call you see in the transcript. Try to label each call as observe, inspect, choose, or act. Where does the harness end and the model begin?
Raschka’s reference implementation lives at github.com/rasbt/mini-coding-agent. Walk the code with the article open in another tab. Find the line where each of the six components is implemented.
If you’ve built any agentic system, dump a real prompt and label each section. What fraction is stable prefix? Changing tail? How much of the changing tail is genuinely new vs duplicated content? You’ll find waste.
Wire messages.create() + a single read_file tool + a working-memory JSON file. Make the loop iterate until the model says “done”. The discomfort of choosing context shapes will teach you more than any article.
In your toy harness, swap one component for a smarter version: add prompt caching, or a clipping rule, or a validating tool layer. Measure cost and accuracy before and after on a fixed task suite of your own.
Whenever you use Cursor, Aider, Codex, or any new agent, identify which of the six components it leans on hardest and which it skimps on. The vocabulary turns marketing copy into reverse-engineering.
Next time someone says “the model isn’t smart enough”, ask the six-component question: did we give it the right context, the right tools, the right memory, and the right task scope? If three of those are honestly “yes”, only then is it the model’s fault.