CodingAgents 02 — Edit Strategies

00

Topics We’ll Cover

How Agents Make Changes
Search-Replace, Unified Diff, Patch Files — Wire Formats
Multi-File Edits — Atomicity, Rollback
Context Window Management — Load, Summarise, Forget
Tool Design for Editing — The Read/Edit/Write Trio
Aider vs Cursor vs Claude Code — Three Architectures
Test-Driven Agentic Edits
Failure Recovery — When the Model Produces Bad Diffs
What to Take Away

01

How Agents Make Changes

A coding agent ultimately has to write bytes to files. The mechanism through which it does so is more varied than it appears: there are at least five distinct edit formats in use across production agents, each with a different precision/robustness trade-off.

Edit format	Used by	Pros	Cons
Whole-file rewrite	Claude Code (small files), early GPT-4 coding assistants	Unambiguous — model outputs the complete new file	Wastes tokens on unchanged content; context-window hostile for large files
Search-replace block	Aider EDITOR format, Claude Code Edit tool	Compact; easy for models to generate; tolerates fuzz	Fails if search string appears multiple times; brittle on whitespace
Unified diff (GNU patch)	Aider DIFF format, GPT-Engineer	Standard; tooling support; git-compatible; compact for large files	Models frequently generate invalid context lines; off-by-one line numbers
AST-level patch	Research tools (Mentat, some IDE plugins)	Semantically correct; rename-safe	Requires language-specific AST; hard to generate from an LLM
Tool-call sequence (Read→Edit→Write)	Claude Code, OpenAI function-calling agents	Explicit, auditable; each step verifiable	Multiple round-trips; latency adds up for large edits

The central tension

A model that emits a complete file is never ambiguous about what it wants — but for a 1 000-line file, 95% of the output is unchanged content. A model that emits a diff is compact, but must get the context lines exactly right or patch rejects it. Most production agents settle on search-replace blocks because they hit the sweet spot: they are small yet unambiguous, and fuzzy-matching can be applied at the application layer to handle minor whitespace drift.

02

Search-Replace, Unified Diff, Patch Files — Wire Formats

Search-replace block

Aider popularised this format for its EDITOR mode. The agent emits a fenced block containing the text to find and the text to substitute. Claude Code’s Edit tool is a structured variant of the same idea.

Aider EDITOR search-replace format

src/db/pool.py
<<<<<<< SEARCH
    def acquire(self):
        conn = self._pool.pop()
        return conn
=======
    async def acquire(self) -> Connection:
        if not self._pool:
            raise PoolExhausted(
                f"pool exhausted after {self._max_size} conns"
            )
        return self._pool.pop()
>>>>>>> REPLACE

Unified diff — wire format

GNU unified diff (the format patch(1) expects)

--- a/src/db/pool.py   2025-04-24 10:12:00.000000000 +0000
+++ b/src/db/pool.py   2025-04-24 11:00:00.000000000 +0000
@@ -47,7 +47,11 @@
     def _evict_idle(self, max_idle_secs):
         now = time.monotonic()

-    def acquire(self):
-        conn = self._pool.pop()
-        return conn
+    async def acquire(self) -> Connection:
+        if not self._pool:
+            raise PoolExhausted(
+                f"pool exhausted after {self._max_size} conns"
+            )
+        return self._pool.pop()

     def release(self, conn):

Why models struggle with unified diffs

The hunk header @@ -47,7 +47,11 @@ encodes the start line and line count for both the before and after blocks. Models frequently miscalculate the +N,M field (they count the replaced lines but forget to add the context lines). patch --fuzz 3 tolerates context mismatches up to 3 lines — Aider’s DIFF format relies on this. Claude Code avoids the problem entirely by using the Read→Edit→Write sequence where line numbers are never embedded in the output.

Applying patches programmatically

Python: apply a unified diff with fallback fuzz

import subprocess, tempfile, pathlib

def apply_diff(diff_text: str, cwd: str, fuzz: int = 3) -> bool:
    with tempfile.NamedTemporaryFile(suffix='.patch', mode='w', delete=False) as f:
        f.write(diff_text)
        patch_path = f.name
    result = subprocess.run(
        ['patch', '--unified', '--forward',
         f'--fuzz={fuzz}', '-p1', '-i', patch_path],
        cwd=cwd, capture_output=True, text=True
    )
    pathlib.Path(patch_path).unlink(missing_ok=True)
    return result.returncode == 0

03

Multi-File Edits — Atomicity, Rollback

Many real edits touch multiple files simultaneously: rename a function and update all its callers, change an API contract and update both the implementation and its tests. Applying these changes partially — some files updated, some not — leaves the repo in a broken state. Agents need a transactional strategy.

In-memory staging before any writes

The safest pattern (used by Aider): build all edited file contents in memory, run a syntax-check pass on every file (e.g. py_compile for Python, rustfmt --check for Rust), and only if all pass, write to disk atomically using os.replace() (which is an atomic rename on POSIX). No partial state ever hits disk. If a write fails mid-way due to disk full or permissions, the un-replaced files are unchanged — repair by writing the remainder.

Conflict detection for concurrent edits

In multi-agent or long-session scenarios, a file may have been modified externally since it was read. Agents should compare the SHA-256 of the file at read time with the file at write time. If they differ, the agent must re-read and re-apply its edit rather than clobber an intervening change.

04

Context Window Management — Load, Summarise, Forget

Even a 200 k-token context window fills quickly during a multi-step edit session: conversation turns, file reads, test output, error messages, and the agent’s own reasoning all compete for the same budget. Managing what stays in context is as important as retrieval.

Fixed

System prompt + CLAUDE.md Repo map (ctags / Aider) Key config files

Working set

Current task file(s) Active test file Recent error output

Summarised

Earlier conversation turns Previously edited files (summary)

Forgotten

Irrelevant retrieved files Verbose tool output (truncated)

Concrete techniques

Sliding context window

Keep only the last K conversation turns verbatim. Earlier turns are replaced with a one-paragraph summary generated by a smaller/cheaper model (e.g. Haiku or GPT-4o-mini). Claude Code uses a compaction step that fires when context utilisation exceeds a threshold (~85%).

Tool-output truncation

Bash command output is capped at a configurable limit (default ~8 k tokens in Claude Code). Test runners that emit 50 k lines of verbose output are truncated to the first N lines plus the summary line. The agent sees enough to diagnose the failure without wasting budget.

Prompt caching changes the calculus

Anthropic’s prompt cache (as of claude-3-5-sonnet-20241022 and later) caches the KV state of the first N tokens. If the system prompt + repo map is stable turn-to-turn, those tokens are billed at 10% of normal input cost after the first hit. This flips the optimisation: it is now cheaper to keep a large stable prefix (repo map, documentation) than to compute it each turn. Aider 0.55+ and Claude Code both exploit this. The practical implication: put the stable repo map near the top of context, not at the bottom.

05

Tool Design for Editing — The Read/Edit/Write Trio

Claude Code exposes three core editing tools. Their design is not arbitrary: each enforces a constraint that prevents a class of common failure.

Read

Returns the current file contents with line numbers. The agent must call Read before Edit — the tool system enforces this. This prevents the model from editing a stale mental image of the file from its training data.

file_path: absolute path
offset: start line (0-indexed)
limit: max lines (default 2 000)

Edit

Search-replace within the file loaded by a prior Read call. The tool verifies that old_string appears exactly once in the current file before applying. Uniqueness enforcement prevents accidental mass-replace.

file_path: must match the prior Read
old_string: must be unique
new_string: replacement text
replace_all: opt-in for all occurrences

Write

Writes a complete new file, also requiring a prior Read (or explicit new-file declaration). Used when the change is so large that search-replace is impractical, or when creating a new file entirely.

Requires Read first for existing files
Atomically replaces via os.replace()
Returns the bytes written for verification

Why the “must read first” constraint matters

The stale-context failure mode

Without the Read-before-Edit constraint, a model trained on code from 2024 might “remember” an older version of a file and generate an Edit with an old_string that no longer exists. The Edit tool will reject it (“old_string not found”) and the agent is forced to re-read. This friction is by design: it is far better to fail loudly at the tool layer than to silently apply a patch to the wrong content. In SWE-bench evaluations, eliminating this category of error accounts for ~5–10% of the pass@1 gap between agents.

Tool-call sequence for a typical single-function edit

// Turn 1: agent reads the file
Read({ file_path: "src/db/pool.py", offset: 40, limit: 20 })
// → returns lines 40–60 with line numbers

// Turn 2: agent issues the precise edit
Edit({
  file_path: "src/db/pool.py",
  old_string: "    def acquire(self):\n        conn = self._pool.pop()\n        return conn",
  new_string: "    async def acquire(self) -> Connection:\n        if not self._pool:\n            raise PoolExhausted(f\"pool exhausted\")\n        return self._pool.pop()"
})
// → success: { bytes_written: 412 }

// Turn 3: agent verifies by running the tests
Bash({ command: "pytest tests/test_pool.py -x -q 2>&1 | tail -20" })

06

Aider vs Cursor vs Claude Code — Three Architectures

The three most widely used coding agents have chosen fundamentally different edit architectures. Understanding these choices explains their respective strengths and failure modes.

Dimension	Aider	Cursor	Claude Code
Edit format	Search-replace block (EDITOR) or unified diff (DIFF), model-selected per request	Inline streaming diff shown to user; accepts/rejects per hunk	Read→Edit→Write tool calls; search-replace semantics
Context loading	Pre-built ctags repo-map; user adds files explicitly	@-symbol retrieval; IDE document context always present	Tool-driven, lazy — reads only what is needed
Model API	Any litellm-compatible model; EDITOR mode requires strong models (Sonnet 3.7+, GPT-4o)	Proprietary hosted models; optional BYO-API	Anthropic API; claude-sonnet-4 / claude-opus-4
Test loop	Optional `--test-cmd`; auto-iterates on test failure up to N rounds	Not automatic; user-driven	Bash tool; agent decides when to run tests
Multi-file atomicity	In-memory staging, then atomic writes per file; git used for rollback	User accepts/rejects each file diff independently	Sequential tool calls; no formal transaction; agent must detect partial failure
Open source	Yes (MIT, github.com/paul-gauthier/aider)	No (Cursor Inc., Electron app)	Yes (Apache 2.0, github.com/anthropics/claude-code)

Aider’s strength

Terminal-native, scriptable, works in CI. The repo-map ensures context is comprehensive upfront. EDITOR format is the most reliable for large single-function edits. The --architect mode uses two models: a high-quality planner and a cheaper executor.

Cursor’s strength

IDE-native diff review gives the user fine-grained control before writes land. Background indexing means @codebase queries are fast even on cold starts. Multi-model routing (cheap for chat, expensive for apply) controls cost.

Claude Code’s strength

Zero pre-configuration — no index to build. Tool-call transparency: every file access is explicit and auditable. The agent reasons about what to read next rather than relying on a potentially-stale index. Strong on novel codebases it has never seen.

07

Test-Driven Agentic Edits

A failing test is the highest-signal feedback signal an agent can receive. The TDD agentic loop is simple in principle but requires careful engineering to avoid infinite retry loops and token runaway.

Practical engineering details

Retry budget and escalation

Set a hard cap of 3–5 retry rounds per edit task. After that, the agent should surface the failure to the user rather than spinning. Without a cap, a model that oscillates between two broken states will consume the entire token budget. Aider uses --auto-test --test-cmd with --num-ctx to limit rounds; Claude Code delegates the retry decision to the agent’s own reasoning (which means a well-prompted agent self-stops, and a poorly-prompted one does not).

Truncate test output. A failing pytest run with 200 test files can emit 100 k tokens of output. Pass only the first failed test’s traceback and summary line to the next model call; do not dump all output into context.

08

Failure Recovery — When the Model Produces Bad Diffs

Even the best models produce malformed or semantically incorrect diffs on a non-trivial fraction of requests. Production agents need a layered recovery strategy.

Taxonomy of edit failures

Failure class	Example	Detection	Recovery
Parse failure	Model emits malformed fence block, missing REPLACE marker	Regex parse of agent output	Re-prompt: “your last edit was malformed; please try again with the correct format”
No-match failure	`old_string` not found in the current file	Edit tool returns `old_string not found`	Force a Read call, then retry; if it fails again, escalate
Ambiguous match	`old_string` matches 3 locations	Edit tool returns `old_string not unique`	Re-prompt with surrounding context to make old_string unique
Syntax error	Python indentation broken, mismatched braces	`py_compile` / `rustfmt --check`	Roll back file; re-prompt with error message + original file content
Semantic error	Tests pass but logic is wrong; wrong function renamed	Test suite failure	Test-driven retry loop (slide 07); after N failures escalate
Wrong file	Model edits a test file instead of the implementation	Code review (human or automated)	Human checkpoint; agents should not silently clobber unexpected files

Fuzzy-matching as a first-line recovery

Python: fuzzy old_string matching with difflib

from difflib import SequenceMatcher

def find_best_match(source: str, pattern: str,
                    threshold: float = 0.85) -> tuple[int, int] | None:
    """Return (start, end) char offsets of best fuzzy match, or None."""
    lines = source.splitlines(True)
    pat_lines = pattern.splitlines(True)
    best_ratio, best_start, best_end = 0.0, 0, 0
    for i in range(len(lines) - len(pat_lines) + 1):
        window = lines[i : i + len(pat_lines)]
        ratio = SequenceMatcher(None, window, pat_lines).ratio()
        if ratio > best_ratio:
            best_ratio, best_start = ratio, i
            best_end = i + len(pat_lines)
    if best_ratio >= threshold:
        start_char = sum(len(l) for l in lines[:best_start])
        end_char   = sum(len(l) for l in lines[:best_end])
        return start_char, end_char
    return None

When to stop retrying and ask the human

A safe heuristic: if the same file has been patched and rolled back more than twice in a single task, pause and describe the failure to the user. Automated recovery from a third consecutive failure is unlikely to succeed and likely to produce confusing state. The user can then rephrase the task, split it into smaller steps, or make the initial change manually.

09

What to Take Away

Edit format choice is consequential. Whole-file rewrites are unambiguous but token-expensive. Unified diffs are compact but model-error-prone on line numbers. Search-replace blocks are the practical sweet spot: compact, unambiguous, and fuzzy-matchable.
Uniqueness enforcement prevents silent clobbers. The Claude Code Edit tool’s requirement that old_string appear exactly once is not a limitation — it is a safety invariant. Never relax it silently; use replace_all only when that is genuinely the intent.
Multi-file edits need a transaction model. Stage in memory, validate syntax, then write atomically. Use git checkout -- . or os.replace() semantics for rollback. Never leave the repo in a partial state.
Context window is a finite resource. Put stable content (repo map, system prompt) first for prompt-cache hits. Truncate test output. Summarise old conversation turns. The “forget” step is as important as the “load” step.
Know the three agents’ edit architectures. Aider stages in memory before writing; Cursor requires explicit user acceptance; Claude Code issues sequential tool calls with no formal transaction. Design your own agent with these trade-offs in mind.
TDD loops are the highest-signal feedback. Run tests after every edit. Cap retries at 3–5. Truncate test output. Escalate to the user on repeated failure.
Failure taxonomy drives recovery. Parse failures → re-prompt format. No-match → force re-read. Syntax error → rollback. Semantic error → test-driven retry. Wrong file → human checkpoint.

Series complete

This deck completes the Coding Agents Internals series. Deck 01 covered how agents find relevant code; this deck covered how agents change it. Together they describe the two core loops of any coding agent: retrieve, then apply. The quality of both loops — not just the underlying model — determines task success rates on realistic codebases.