Search-replace, unified diff, patch files — wire formats. Multi-file atomicity and rollback. Context-window management. The Read/Edit/Write tool trio. Aider vs Cursor vs Claude Code architectures. Test-driven edits. Failure recovery when the model produces a bad diff.
A coding agent ultimately has to write bytes to files. The mechanism through which it does so is more varied than it appears: there are at least five distinct edit formats in use across production agents, each with a different precision/robustness trade-off.
| Edit format | Used by | Pros | Cons |
|---|---|---|---|
| Whole-file rewrite | Claude Code (small files), early GPT-4 coding assistants | Unambiguous — model outputs the complete new file | Wastes tokens on unchanged content; context-window hostile for large files |
| Search-replace block | Aider EDITOR format, Claude Code Edit tool | Compact; easy for models to generate; tolerates fuzz | Fails if search string appears multiple times; brittle on whitespace |
| Unified diff (GNU patch) | Aider DIFF format, GPT-Engineer | Standard; tooling support; git-compatible; compact for large files | Models frequently generate invalid context lines; off-by-one line numbers |
| AST-level patch | Research tools (Mentat, some IDE plugins) | Semantically correct; rename-safe | Requires language-specific AST; hard to generate from an LLM |
| Tool-call sequence (Read→Edit→Write) | Claude Code, OpenAI function-calling agents | Explicit, auditable; each step verifiable | Multiple round-trips; latency adds up for large edits |
A model that emits a complete file is never ambiguous about what it wants — but for a 1 000-line file, 95% of the output is unchanged content. A model that emits a diff is compact, but must get the context lines exactly right or patch rejects it. Most production agents settle on search-replace blocks because they hit the sweet spot: they are small yet unambiguous, and fuzzy-matching can be applied at the application layer to handle minor whitespace drift.
Aider popularised this format for its EDITOR mode. The agent emits a fenced block containing the text to find and the text to substitute. Claude Code’s Edit tool is a structured variant of the same idea.
src/db/pool.py
<<<<<<< SEARCH
def acquire(self):
conn = self._pool.pop()
return conn
=======
async def acquire(self) -> Connection:
if not self._pool:
raise PoolExhausted(
f"pool exhausted after {self._max_size} conns"
)
return self._pool.pop()
>>>>>>> REPLACE
--- a/src/db/pool.py 2025-04-24 10:12:00.000000000 +0000
+++ b/src/db/pool.py 2025-04-24 11:00:00.000000000 +0000
@@ -47,7 +47,11 @@
def _evict_idle(self, max_idle_secs):
now = time.monotonic()
- def acquire(self):
- conn = self._pool.pop()
- return conn
+ async def acquire(self) -> Connection:
+ if not self._pool:
+ raise PoolExhausted(
+ f"pool exhausted after {self._max_size} conns"
+ )
+ return self._pool.pop()
def release(self, conn):
The hunk header @@ -47,7 +47,11 @@ encodes the start line and line count for both the before and after blocks. Models frequently miscalculate the +N,M field (they count the replaced lines but forget to add the context lines). patch --fuzz 3 tolerates context mismatches up to 3 lines — Aider’s DIFF format relies on this. Claude Code avoids the problem entirely by using the Read→Edit→Write sequence where line numbers are never embedded in the output.
import subprocess, tempfile, pathlib
def apply_diff(diff_text: str, cwd: str, fuzz: int = 3) -> bool:
with tempfile.NamedTemporaryFile(suffix='.patch', mode='w', delete=False) as f:
f.write(diff_text)
patch_path = f.name
result = subprocess.run(
['patch', '--unified', '--forward',
f'--fuzz={fuzz}', '-p1', '-i', patch_path],
cwd=cwd, capture_output=True, text=True
)
pathlib.Path(patch_path).unlink(missing_ok=True)
return result.returncode == 0
Many real edits touch multiple files simultaneously: rename a function and update all its callers, change an API contract and update both the implementation and its tests. Applying these changes partially — some files updated, some not — leaves the repo in a broken state. Agents need a transactional strategy.
The safest pattern (used by Aider): build all edited file contents in memory, run a syntax-check pass on every file (e.g. py_compile for Python, rustfmt --check for Rust), and only if all pass, write to disk atomically using os.replace() (which is an atomic rename on POSIX). No partial state ever hits disk. If a write fails mid-way due to disk full or permissions, the un-replaced files are unchanged — repair by writing the remainder.
In multi-agent or long-session scenarios, a file may have been modified externally since it was read. Agents should compare the SHA-256 of the file at read time with the file at write time. If they differ, the agent must re-read and re-apply its edit rather than clobber an intervening change.
Even a 200 k-token context window fills quickly during a multi-step edit session: conversation turns, file reads, test output, error messages, and the agent’s own reasoning all compete for the same budget. Managing what stays in context is as important as retrieval.
Keep only the last K conversation turns verbatim. Earlier turns are replaced with a one-paragraph summary generated by a smaller/cheaper model (e.g. Haiku or GPT-4o-mini). Claude Code uses a compaction step that fires when context utilisation exceeds a threshold (~85%).
Bash command output is capped at a configurable limit (default ~8 k tokens in Claude Code). Test runners that emit 50 k lines of verbose output are truncated to the first N lines plus the summary line. The agent sees enough to diagnose the failure without wasting budget.
Anthropic’s prompt cache (as of claude-3-5-sonnet-20241022 and later) caches the KV state of the first N tokens. If the system prompt + repo map is stable turn-to-turn, those tokens are billed at 10% of normal input cost after the first hit. This flips the optimisation: it is now cheaper to keep a large stable prefix (repo map, documentation) than to compute it each turn. Aider 0.55+ and Claude Code both exploit this. The practical implication: put the stable repo map near the top of context, not at the bottom.
Claude Code exposes three core editing tools. Their design is not arbitrary: each enforces a constraint that prevents a class of common failure.
Returns the current file contents with line numbers. The agent must call Read before Edit — the tool system enforces this. This prevents the model from editing a stale mental image of the file from its training data.
file_path: absolute pathoffset: start line (0-indexed)limit: max lines (default 2 000)Search-replace within the file loaded by a prior Read call. The tool verifies that old_string appears exactly once in the current file before applying. Uniqueness enforcement prevents accidental mass-replace.
file_path: must match the prior Readold_string: must be uniquenew_string: replacement textreplace_all: opt-in for all occurrencesWrites a complete new file, also requiring a prior Read (or explicit new-file declaration). Used when the change is so large that search-replace is impractical, or when creating a new file entirely.
os.replace()Without the Read-before-Edit constraint, a model trained on code from 2024 might “remember” an older version of a file and generate an Edit with an old_string that no longer exists. The Edit tool will reject it (“old_string not found”) and the agent is forced to re-read. This friction is by design: it is far better to fail loudly at the tool layer than to silently apply a patch to the wrong content. In SWE-bench evaluations, eliminating this category of error accounts for ~5–10% of the pass@1 gap between agents.
// Turn 1: agent reads the file
Read({ file_path: "src/db/pool.py", offset: 40, limit: 20 })
// → returns lines 40–60 with line numbers
// Turn 2: agent issues the precise edit
Edit({
file_path: "src/db/pool.py",
old_string: " def acquire(self):\n conn = self._pool.pop()\n return conn",
new_string: " async def acquire(self) -> Connection:\n if not self._pool:\n raise PoolExhausted(f\"pool exhausted\")\n return self._pool.pop()"
})
// → success: { bytes_written: 412 }
// Turn 3: agent verifies by running the tests
Bash({ command: "pytest tests/test_pool.py -x -q 2>&1 | tail -20" })
The three most widely used coding agents have chosen fundamentally different edit architectures. Understanding these choices explains their respective strengths and failure modes.
| Dimension | Aider | Cursor | Claude Code |
|---|---|---|---|
| Edit format | Search-replace block (EDITOR) or unified diff (DIFF), model-selected per request | Inline streaming diff shown to user; accepts/rejects per hunk | Read→Edit→Write tool calls; search-replace semantics |
| Context loading | Pre-built ctags repo-map; user adds files explicitly | @-symbol retrieval; IDE document context always present | Tool-driven, lazy — reads only what is needed |
| Model API | Any litellm-compatible model; EDITOR mode requires strong models (Sonnet 3.7+, GPT-4o) | Proprietary hosted models; optional BYO-API | Anthropic API; claude-sonnet-4 / claude-opus-4 |
| Test loop | Optional --test-cmd; auto-iterates on test failure up to N rounds | Not automatic; user-driven | Bash tool; agent decides when to run tests |
| Multi-file atomicity | In-memory staging, then atomic writes per file; git used for rollback | User accepts/rejects each file diff independently | Sequential tool calls; no formal transaction; agent must detect partial failure |
| Open source | Yes (MIT, github.com/paul-gauthier/aider) | No (Cursor Inc., Electron app) | Yes (Apache 2.0, github.com/anthropics/claude-code) |
Terminal-native, scriptable, works in CI. The repo-map ensures context is comprehensive upfront. EDITOR format is the most reliable for large single-function edits. The --architect mode uses two models: a high-quality planner and a cheaper executor.
IDE-native diff review gives the user fine-grained control before writes land. Background indexing means @codebase queries are fast even on cold starts. Multi-model routing (cheap for chat, expensive for apply) controls cost.
Zero pre-configuration — no index to build. Tool-call transparency: every file access is explicit and auditable. The agent reasons about what to read next rather than relying on a potentially-stale index. Strong on novel codebases it has never seen.
A failing test is the highest-signal feedback signal an agent can receive. The TDD agentic loop is simple in principle but requires careful engineering to avoid infinite retry loops and token runaway.
Set a hard cap of 3–5 retry rounds per edit task. After that, the agent should surface the failure to the user rather than spinning. Without a cap, a model that oscillates between two broken states will consume the entire token budget. Aider uses --auto-test --test-cmd with --num-ctx to limit rounds; Claude Code delegates the retry decision to the agent’s own reasoning (which means a well-prompted agent self-stops, and a poorly-prompted one does not).
Truncate test output. A failing pytest run with 200 test files can emit 100 k tokens of output. Pass only the first failed test’s traceback and summary line to the next model call; do not dump all output into context.
Even the best models produce malformed or semantically incorrect diffs on a non-trivial fraction of requests. Production agents need a layered recovery strategy.
| Failure class | Example | Detection | Recovery |
|---|---|---|---|
| Parse failure | Model emits malformed fence block, missing REPLACE marker | Regex parse of agent output | Re-prompt: “your last edit was malformed; please try again with the correct format” |
| No-match failure | old_string not found in the current file | Edit tool returns old_string not found | Force a Read call, then retry; if it fails again, escalate |
| Ambiguous match | old_string matches 3 locations | Edit tool returns old_string not unique | Re-prompt with surrounding context to make old_string unique |
| Syntax error | Python indentation broken, mismatched braces | py_compile / rustfmt --check | Roll back file; re-prompt with error message + original file content |
| Semantic error | Tests pass but logic is wrong; wrong function renamed | Test suite failure | Test-driven retry loop (slide 07); after N failures escalate |
| Wrong file | Model edits a test file instead of the implementation | Code review (human or automated) | Human checkpoint; agents should not silently clobber unexpected files |
from difflib import SequenceMatcher
def find_best_match(source: str, pattern: str,
threshold: float = 0.85) -> tuple[int, int] | None:
"""Return (start, end) char offsets of best fuzzy match, or None."""
lines = source.splitlines(True)
pat_lines = pattern.splitlines(True)
best_ratio, best_start, best_end = 0.0, 0, 0
for i in range(len(lines) - len(pat_lines) + 1):
window = lines[i : i + len(pat_lines)]
ratio = SequenceMatcher(None, window, pat_lines).ratio()
if ratio > best_ratio:
best_ratio, best_start = ratio, i
best_end = i + len(pat_lines)
if best_ratio >= threshold:
start_char = sum(len(l) for l in lines[:best_start])
end_char = sum(len(l) for l in lines[:best_end])
return start_char, end_char
return None
A safe heuristic: if the same file has been patched and rolled back more than twice in a single task, pause and describe the failure to the user. Automated recovery from a third consecutive failure is unlikely to succeed and likely to produce confusing state. The user can then rephrase the task, split it into smaller steps, or make the initial change manually.
old_string appear exactly once is not a limitation — it is a safety invariant. Never relax it silently; use replace_all only when that is genuinely the intent.git checkout -- . or os.replace() semantics for rollback. Never leave the repo in a partial state.This deck completes the Coding Agents Internals series. Deck 01 covered how agents find relevant code; this deck covered how agents change it. Together they describe the two core loops of any coding agent: retrieve, then apply. The quality of both loops — not just the underlying model — determines task success rates on realistic codebases.