Coding Agents Internals Series — Presentation 02

How Agents Make Changes

Search-replace, unified diff, patch files — wire formats. Multi-file atomicity and rollback. Context-window management. The Read/Edit/Write tool trio. Aider vs Cursor vs Claude Code architectures. Test-driven edits. Failure recovery when the model produces a bad diff.

Unified Diff Search-Replace Multi-File Edits Context Window Read/Edit/Write Aider Cursor Claude Code TDD Agents
Load Context Model Edit Parse Diff Apply Test Verify / Retry
00

Topics We’ll Cover

01

How Agents Make Changes

A coding agent ultimately has to write bytes to files. The mechanism through which it does so is more varied than it appears: there are at least five distinct edit formats in use across production agents, each with a different precision/robustness trade-off.

Edit formatUsed byProsCons
Whole-file rewriteClaude Code (small files), early GPT-4 coding assistantsUnambiguous — model outputs the complete new fileWastes tokens on unchanged content; context-window hostile for large files
Search-replace blockAider EDITOR format, Claude Code Edit toolCompact; easy for models to generate; tolerates fuzzFails if search string appears multiple times; brittle on whitespace
Unified diff (GNU patch)Aider DIFF format, GPT-EngineerStandard; tooling support; git-compatible; compact for large filesModels frequently generate invalid context lines; off-by-one line numbers
AST-level patchResearch tools (Mentat, some IDE plugins)Semantically correct; rename-safeRequires language-specific AST; hard to generate from an LLM
Tool-call sequence (Read→Edit→Write)Claude Code, OpenAI function-calling agentsExplicit, auditable; each step verifiableMultiple round-trips; latency adds up for large edits
The central tension

A model that emits a complete file is never ambiguous about what it wants — but for a 1 000-line file, 95% of the output is unchanged content. A model that emits a diff is compact, but must get the context lines exactly right or patch rejects it. Most production agents settle on search-replace blocks because they hit the sweet spot: they are small yet unambiguous, and fuzzy-matching can be applied at the application layer to handle minor whitespace drift.

02

Search-Replace, Unified Diff, Patch Files — Wire Formats

Search-replace block

Aider popularised this format for its EDITOR mode. The agent emits a fenced block containing the text to find and the text to substitute. Claude Code’s Edit tool is a structured variant of the same idea.

Aider EDITOR search-replace format
src/db/pool.py
<<<<<<< SEARCH
    def acquire(self):
        conn = self._pool.pop()
        return conn
=======
    async def acquire(self) -> Connection:
        if not self._pool:
            raise PoolExhausted(
                f"pool exhausted after {self._max_size} conns"
            )
        return self._pool.pop()
>>>>>>> REPLACE

Unified diff — wire format

GNU unified diff (the format patch(1) expects)
--- a/src/db/pool.py   2025-04-24 10:12:00.000000000 +0000
+++ b/src/db/pool.py   2025-04-24 11:00:00.000000000 +0000
@@ -47,7 +47,11 @@
     def _evict_idle(self, max_idle_secs):
         now = time.monotonic()

-    def acquire(self):
-        conn = self._pool.pop()
-        return conn
+    async def acquire(self) -> Connection:
+        if not self._pool:
+            raise PoolExhausted(
+                f"pool exhausted after {self._max_size} conns"
+            )
+        return self._pool.pop()

     def release(self, conn):
Why models struggle with unified diffs

The hunk header @@ -47,7 +47,11 @@ encodes the start line and line count for both the before and after blocks. Models frequently miscalculate the +N,M field (they count the replaced lines but forget to add the context lines). patch --fuzz 3 tolerates context mismatches up to 3 lines — Aider’s DIFF format relies on this. Claude Code avoids the problem entirely by using the Read→Edit→Write sequence where line numbers are never embedded in the output.

Applying patches programmatically

Python: apply a unified diff with fallback fuzz
import subprocess, tempfile, pathlib

def apply_diff(diff_text: str, cwd: str, fuzz: int = 3) -> bool:
    with tempfile.NamedTemporaryFile(suffix='.patch', mode='w', delete=False) as f:
        f.write(diff_text)
        patch_path = f.name
    result = subprocess.run(
        ['patch', '--unified', '--forward',
         f'--fuzz={fuzz}', '-p1', '-i', patch_path],
        cwd=cwd, capture_output=True, text=True
    )
    pathlib.Path(patch_path).unlink(missing_ok=True)
    return result.returncode == 0
03

Multi-File Edits — Atomicity, Rollback

Many real edits touch multiple files simultaneously: rename a function and update all its callers, change an API contract and update both the implementation and its tests. Applying these changes partially — some files updated, some not — leaves the repo in a broken state. Agents need a transactional strategy.

Two-phase commit pattern for multi-file edits 1. Snapshot originals git stash / copy to /tmp 2. Apply all patches in memory (dry run) 3. Validate all patches parse + lint checks Commit Validation FAILED abort all writes Rollback git-based rollback (the simplest approach) git stash — apply patches — run tests pass: git stash drop + git commit -m "..." fail: git checkout -- . (restore all modified files instantly)
In-memory staging before any writes

The safest pattern (used by Aider): build all edited file contents in memory, run a syntax-check pass on every file (e.g. py_compile for Python, rustfmt --check for Rust), and only if all pass, write to disk atomically using os.replace() (which is an atomic rename on POSIX). No partial state ever hits disk. If a write fails mid-way due to disk full or permissions, the un-replaced files are unchanged — repair by writing the remainder.

Conflict detection for concurrent edits

In multi-agent or long-session scenarios, a file may have been modified externally since it was read. Agents should compare the SHA-256 of the file at read time with the file at write time. If they differ, the agent must re-read and re-apply its edit rather than clobber an intervening change.

04

Context Window Management — Load, Summarise, Forget

Even a 200 k-token context window fills quickly during a multi-step edit session: conversation turns, file reads, test output, error messages, and the agent’s own reasoning all compete for the same budget. Managing what stays in context is as important as retrieval.

Fixed
System prompt + CLAUDE.md Repo map (ctags / Aider) Key config files
Working set
Current task file(s) Active test file Recent error output
Summarised
Earlier conversation turns Previously edited files (summary)
Forgotten
Irrelevant retrieved files Verbose tool output (truncated)

Concrete techniques

Sliding context window

Keep only the last K conversation turns verbatim. Earlier turns are replaced with a one-paragraph summary generated by a smaller/cheaper model (e.g. Haiku or GPT-4o-mini). Claude Code uses a compaction step that fires when context utilisation exceeds a threshold (~85%).

Tool-output truncation

Bash command output is capped at a configurable limit (default ~8 k tokens in Claude Code). Test runners that emit 50 k lines of verbose output are truncated to the first N lines plus the summary line. The agent sees enough to diagnose the failure without wasting budget.

Prompt caching changes the calculus

Anthropic’s prompt cache (as of claude-3-5-sonnet-20241022 and later) caches the KV state of the first N tokens. If the system prompt + repo map is stable turn-to-turn, those tokens are billed at 10% of normal input cost after the first hit. This flips the optimisation: it is now cheaper to keep a large stable prefix (repo map, documentation) than to compute it each turn. Aider 0.55+ and Claude Code both exploit this. The practical implication: put the stable repo map near the top of context, not at the bottom.

05

Tool Design for Editing — The Read/Edit/Write Trio

Claude Code exposes three core editing tools. Their design is not arbitrary: each enforces a constraint that prevents a class of common failure.

Read

Returns the current file contents with line numbers. The agent must call Read before Edit — the tool system enforces this. This prevents the model from editing a stale mental image of the file from its training data.

  • file_path: absolute path
  • offset: start line (0-indexed)
  • limit: max lines (default 2 000)

Edit

Search-replace within the file loaded by a prior Read call. The tool verifies that old_string appears exactly once in the current file before applying. Uniqueness enforcement prevents accidental mass-replace.

  • file_path: must match the prior Read
  • old_string: must be unique
  • new_string: replacement text
  • replace_all: opt-in for all occurrences

Write

Writes a complete new file, also requiring a prior Read (or explicit new-file declaration). Used when the change is so large that search-replace is impractical, or when creating a new file entirely.

  • Requires Read first for existing files
  • Atomically replaces via os.replace()
  • Returns the bytes written for verification

Why the “must read first” constraint matters

The stale-context failure mode

Without the Read-before-Edit constraint, a model trained on code from 2024 might “remember” an older version of a file and generate an Edit with an old_string that no longer exists. The Edit tool will reject it (“old_string not found”) and the agent is forced to re-read. This friction is by design: it is far better to fail loudly at the tool layer than to silently apply a patch to the wrong content. In SWE-bench evaluations, eliminating this category of error accounts for ~5–10% of the pass@1 gap between agents.

Tool-call sequence for a typical single-function edit
// Turn 1: agent reads the file
Read({ file_path: "src/db/pool.py", offset: 40, limit: 20 })
// → returns lines 40–60 with line numbers

// Turn 2: agent issues the precise edit
Edit({
  file_path: "src/db/pool.py",
  old_string: "    def acquire(self):\n        conn = self._pool.pop()\n        return conn",
  new_string: "    async def acquire(self) -> Connection:\n        if not self._pool:\n            raise PoolExhausted(f\"pool exhausted\")\n        return self._pool.pop()"
})
// → success: { bytes_written: 412 }

// Turn 3: agent verifies by running the tests
Bash({ command: "pytest tests/test_pool.py -x -q 2>&1 | tail -20" })
06

Aider vs Cursor vs Claude Code — Three Architectures

The three most widely used coding agents have chosen fundamentally different edit architectures. Understanding these choices explains their respective strengths and failure modes.

DimensionAiderCursorClaude Code
Edit formatSearch-replace block (EDITOR) or unified diff (DIFF), model-selected per requestInline streaming diff shown to user; accepts/rejects per hunkRead→Edit→Write tool calls; search-replace semantics
Context loadingPre-built ctags repo-map; user adds files explicitly@-symbol retrieval; IDE document context always presentTool-driven, lazy — reads only what is needed
Model APIAny litellm-compatible model; EDITOR mode requires strong models (Sonnet 3.7+, GPT-4o)Proprietary hosted models; optional BYO-APIAnthropic API; claude-sonnet-4 / claude-opus-4
Test loopOptional --test-cmd; auto-iterates on test failure up to N roundsNot automatic; user-drivenBash tool; agent decides when to run tests
Multi-file atomicityIn-memory staging, then atomic writes per file; git used for rollbackUser accepts/rejects each file diff independentlySequential tool calls; no formal transaction; agent must detect partial failure
Open sourceYes (MIT, github.com/paul-gauthier/aider)No (Cursor Inc., Electron app)Yes (Apache 2.0, github.com/anthropics/claude-code)

Aider’s strength

Terminal-native, scriptable, works in CI. The repo-map ensures context is comprehensive upfront. EDITOR format is the most reliable for large single-function edits. The --architect mode uses two models: a high-quality planner and a cheaper executor.

Cursor’s strength

IDE-native diff review gives the user fine-grained control before writes land. Background indexing means @codebase queries are fast even on cold starts. Multi-model routing (cheap for chat, expensive for apply) controls cost.

Claude Code’s strength

Zero pre-configuration — no index to build. Tool-call transparency: every file access is explicit and auditable. The agent reasons about what to read next rather than relying on a potentially-stale index. Strong on novel codebases it has never seen.

07

Test-Driven Agentic Edits

A failing test is the highest-signal feedback signal an agent can receive. The TDD agentic loop is simple in principle but requires careful engineering to avoid infinite retry loops and token runaway.

TDD agentic loop with retry cap Write / load test pytest -x -q Read failing test + retrieve src files Apply edit Read→Edit→Write Run tests again capture output FAIL & retries < N Analyse failure re-read output PASS → commit; retries ≥ N → escalate to user PASS — commit PASS

Practical engineering details

Retry budget and escalation

Set a hard cap of 3–5 retry rounds per edit task. After that, the agent should surface the failure to the user rather than spinning. Without a cap, a model that oscillates between two broken states will consume the entire token budget. Aider uses --auto-test --test-cmd with --num-ctx to limit rounds; Claude Code delegates the retry decision to the agent’s own reasoning (which means a well-prompted agent self-stops, and a poorly-prompted one does not).

Truncate test output. A failing pytest run with 200 test files can emit 100 k tokens of output. Pass only the first failed test’s traceback and summary line to the next model call; do not dump all output into context.

08

Failure Recovery — When the Model Produces Bad Diffs

Even the best models produce malformed or semantically incorrect diffs on a non-trivial fraction of requests. Production agents need a layered recovery strategy.

Taxonomy of edit failures

Failure classExampleDetectionRecovery
Parse failureModel emits malformed fence block, missing REPLACE markerRegex parse of agent outputRe-prompt: “your last edit was malformed; please try again with the correct format”
No-match failureold_string not found in the current fileEdit tool returns old_string not foundForce a Read call, then retry; if it fails again, escalate
Ambiguous matchold_string matches 3 locationsEdit tool returns old_string not uniqueRe-prompt with surrounding context to make old_string unique
Syntax errorPython indentation broken, mismatched bracespy_compile / rustfmt --checkRoll back file; re-prompt with error message + original file content
Semantic errorTests pass but logic is wrong; wrong function renamedTest suite failureTest-driven retry loop (slide 07); after N failures escalate
Wrong fileModel edits a test file instead of the implementationCode review (human or automated)Human checkpoint; agents should not silently clobber unexpected files

Fuzzy-matching as a first-line recovery

Python: fuzzy old_string matching with difflib
from difflib import SequenceMatcher

def find_best_match(source: str, pattern: str,
                    threshold: float = 0.85) -> tuple[int, int] | None:
    """Return (start, end) char offsets of best fuzzy match, or None."""
    lines = source.splitlines(True)
    pat_lines = pattern.splitlines(True)
    best_ratio, best_start, best_end = 0.0, 0, 0
    for i in range(len(lines) - len(pat_lines) + 1):
        window = lines[i : i + len(pat_lines)]
        ratio = SequenceMatcher(None, window, pat_lines).ratio()
        if ratio > best_ratio:
            best_ratio, best_start = ratio, i
            best_end = i + len(pat_lines)
    if best_ratio >= threshold:
        start_char = sum(len(l) for l in lines[:best_start])
        end_char   = sum(len(l) for l in lines[:best_end])
        return start_char, end_char
    return None
When to stop retrying and ask the human

A safe heuristic: if the same file has been patched and rolled back more than twice in a single task, pause and describe the failure to the user. Automated recovery from a third consecutive failure is unlikely to succeed and likely to produce confusing state. The user can then rephrase the task, split it into smaller steps, or make the initial change manually.

09

What to Take Away

Series complete

This deck completes the Coding Agents Internals series. Deck 01 covered how agents find relevant code; this deck covered how agents change it. Together they describe the two core loops of any coding agent: retrieve, then apply. The quality of both loops — not just the underlying model — determines task success rates on realistic codebases.