CodingAgents 04 — Live Repo Context & Prompt-Cache Reuse

00

Topics We’ll Cover

The “Fix the Tests” Problem — Why Bare Prompts Fail
Anatomy of a Workspace Summary
Stable Facts vs Changing Facts
Prompt-Shape Architecture
Interactive: Walk Through a Real Prompt
Prompt-Cache Mechanics — What You Actually Pay
Interactive: Cache-Hit Cost Calculator
Pitfalls — What Breaks the Cache
Practical Patterns You Can Steal
Things to Try Yourself

01

The “Fix the Tests” Problem

Raschka opens Component 1 with a small but devastating example. A user types “fix the tests”. To a competent human colleague this is enough: they know the language, the test runner, the recent changes, the project conventions. To an LLM with nothing but the message, it is unsolvable.

What the LLM is missing

Which tests are failing — or how to find them.
The test command — pytest? cargo test? npm test? something custom?
The repo layout — where is the source vs the tests?
Project conventions — lint rules, type checking, commit style.
The current Git branch and what just changed.

What the agent needs

A workspace summary — what Raschka calls “stable facts”. Built once, at session start. Cheap to keep around. Carries most of what a colleague would know without being told.

The user message then becomes a delta on top of a known context, not a complete brief in itself.

A telling phrase from the article

Raschka writes that the agent “collects info (‘stable facts’ as a workspace summary) upfront before doing any work” precisely so that it does not start “from zero, without context, on every prompt”. Most of the engineering effort in Component 1 is choosing what counts as a stable fact and how to surface it cheaply.

02

Anatomy of a Workspace Summary

What does a real coding agent put in its workspace summary? Let’s reverse-engineer it from observable behaviour. Here is roughly what Claude Code reads on session start in a typical Python repo (~2 kLOC, mixed src/tests).

workspace_summary.md — observed Claude Code session-start payload

# Workspace summary — built once at session start

project:
  name: connection-pool
  root:  /home/user/code/connection-pool
  vcs:   git, branch=feat/idle-eviction, ahead 3 commits of main

tooling:
  language: Python 3.12
  build:    uv (pyproject.toml)
  test:     pytest -q
  lint:     ruff check, ruff format
  types:    pyright (strict)

layout:
  - src/                 # 14 Python files, 1 821 LOC
  - src/db/pool.py       # core class, 412 LOC
  - src/db/migrations.py
  - tests/               # 23 test files, 2 104 LOC
  - docs/architecture.md # 180 LOC, last modified 8 days ago

conventions:                 # pulled from CLAUDE.md / AGENTS.md / README
  - "All public methods must have type hints and a 1-line docstring."
  - "PRs run pytest + ruff + pyright; CI fails on any error."
  - "Avoid mocking the database; use a real Postgres in test fixtures."

recent_change:                # last commit summary
  sha: 3a91f0c
  title: "add idle-connection eviction; tests pending"

What goes in, what stays out

In — cheap, stable, useful

Project name, language, build/test commands.
Top-level directory structure to depth 2–3.
Identified key files (largest, most-imported, recently changed).
Conventions from CLAUDE.md / AGENTS.md / CONTRIBUTING.md.
Branch and most-recent commit, succinctly.

Out — expensive, unstable, or noisy

Full file contents (load on demand via tools).
Long Git history (the harness can run git log later).
Build artefacts, node_modules, generated code.
Secrets, env vars, anything from .env or password managers.
Output of any command that takes >1 s — defer to a tool call.

Where harnesses differ

Aider precomputes a repo map (ctags + PageRank) and inlines it. Cursor builds an embedding index server-side and exposes @codebase. Claude Code does almost no precomputation — it relies on CLAUDE.md + a few directory listings + tool-call retrieval. Codex emphasises an AGENTS.md at the repo root. All four are different bets about how much to do at session start vs on demand.

03

Stable Facts vs Changing Facts

Component 2 in Raschka’s essay zooms in on a critical detail: not all of the prompt changes between turns. Some of it is set once and reused for the entire session; some of it accumulates with every new tool call; some of it is regenerated every turn. Sorting these into the right order is the single biggest cost-and-quality lever in a coding harness.

System & instructions — stable for the entire deployment

↓

Tool definitions — stable until tools change (rarely)

↓

Workspace summary — stable for the session

↓

Compact transcript — grows with each turn

↓

Working memory — updates each turn

↓

Latest tool result — only this turn

↓

User message — only this turn

Cache lifetimes for each layer

Layer	Lifetime	Typical size	Cacheable?
System & instructions	Days–months (rarely changes)	500–2 000 tokens	Yes — place first hot
Tool definitions	Days–weeks	1 000–4 000 tokens	Yes — place after system hot
Workspace summary	Per-session (~hours)	500–2 000 tokens	Yes — place at end of prefix hot
Compact transcript	Per-turn growth	500–5 000 tokens	Partial — up to last cache breakpoint warm
Working memory	Per-turn rewrite	200–800 tokens	No — changes every turn cold
Latest tool result	This turn only	Variable, often clipped to ~1 500 tokens	No cold
User message	This turn only	20–500 tokens	No cold

The architectural rule

Sort by stability, descending. Stable layers go first; volatile layers go last. Place a cache breakpoint immediately after the last stable layer. Anything before the breakpoint is paid for once per session; anything after is paid for every turn. Get this ordering wrong and you pay full price on every turn for content the cache could have served.

04

Prompt-Shape Architecture

This is Raschka’s Figure 7 redrawn. A single turn’s prompt has three regions, separated by where the cache breakpoint sits.

How a real Anthropic API call expresses this

Python — using cache_control to mark the breakpoint

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {"type": "text", "text": SYSTEM_INSTRUCTIONS},        # stable
        {"type": "text", "text": TOOL_DESCRIPTIONS},          # stable
        {"type": "text", "text": WORKSPACE_SUMMARY,           # stable for session
         "cache_control": {"type": "ephemeral"}},    # ← cache breakpoint
    ],
    messages=[
        *compact_transcript,                                 # warm
        {"role": "user",
         "content": working_memory_block + tool_result_block + user_msg},  # cold
    ],
    tools=TOOL_SCHEMAS,
)

# Subsequent turns hit the cache and pay 0.10x for the prefix tokens.

Why this is harness work, not user work

End users will never write cache_control directly. The harness arranges the prompt this way every turn so that the user just types “fix the tests” and the right architecture happens automatically. Raschka’s point in Component 2 is that this layout is not optional; it is what separates a usable harness from a wasteful one.

05

Interactive: Walk Through a Real Prompt

Click any layer below to see what it contains, why it sits where it does, and whether it benefits from the cache. The colours match the diagram on the previous slide: green = cache hot, orange = cache warm, red = cache cold.

SYSTEM · instructions, persona, refusals~ 1 200 tok

TOOLS · JSON schemas for read_file, write_file, bash, …~ 2 800 tok

WORKSPACE SUMMARY · layout, conventions, branch (CACHE BREAKPOINT after)~ 1 600 tok

COMPACT TRANSCRIPT · previous turns, summarised~ 2 100 tok

WORKING MEMORY · current task, important files, notes~ 600 tok

LATEST TOOL RESULT · output of last command~ 900 tok

USER MESSAGE · this turn’s instruction~ 80 tok

Click a layer above to see its role, lifetime, and cache status. Try them all — the difference between layers is the entire reason this architecture matters.

cache hot — paid once, reused for session cache warm — partially reusable cache cold — full price every turn

Token estimates are illustrative for a typical Claude Code session on a small Python repo. Real values depend on the harness, model, and project size.

06

Prompt-Cache Mechanics — What You Actually Pay

The economics behind Component 2 only make sense if you know how the cache actually bills. The Anthropic API uses ephemeral caching with a default 5-minute TTL (an extended 1-hour TTL is also available with a higher write multiplier). OpenAI’s mechanism is similar in spirit.

The price multipliers (Anthropic, default 5-min TTL)

Operation	Cost multiplier	What it means
Cache write — first time you submit a prefix	1.25× standard input price	You pay a 25% premium to populate the cache.
Cache read — subsequent turns within TTL	0.10× standard input price	Cached prefix tokens cost 10% of normal — a 10× reduction.
Cache miss — prefix changed by even 1 byte	1.00× standard input price	Full re-tokenisation; no benefit.
Cache refresh — new request resets the TTL	Free if prefix matches	Active sessions stay hot; idle ones expire after 5 min.

Break-even analysis

The 25% write premium is paid up front; the 10× reduction kicks in on subsequent turns. The break-even is therefore very low — just over one cache hit. Concretely:

A useful rule of thumb

For any system prompt + workspace summary >1 000 tokens that is reused 2 or more times, prompt caching is essentially free money. The interesting design question is not whether to cache — it is where to put the breakpoint and how to keep the prefix bit-stable across turns.

07

Interactive: Cache-Hit Cost Calculator

Slide the inputs to see how cache geometry shapes a typical session. The calculator uses Anthropic’s default multipliers (1.25× write, 0.10× read) on a 5-minute TTL.

Stable prefix size 5 000 tok

Turns in this session 12

Per-turn changing tail 1 200 tok

Input price per 1M tok ($) 15.00

Without caching

$0.00

With caching

$0.00

Saved per session

$0.00

What the calculator tells you

For a Claude Code-style session (5 000-tok prefix, ~12 turns, ~1 200 tok tail at $15/M input) caching saves roughly 70–80% of the input bill. Long sessions amortise the 25% write premium completely; short sessions still come out ahead on the second turn. Sessions of length 1 are the only case where caching costs slightly more — and even those typically don’t happen in practice for coding agents.

08

Pitfalls — What Breaks the Cache

The cache is content-addressed: the cache key is the byte-for-byte prefix. Any of the following silently turn cache reads back into cache writes — an order of magnitude price increase, with no warning.

Timestamps in the system prompt

“Today is 2026-04-28T09:35:12Z” injected at the top of every turn destroys the cache. If you need a date, put it after the cache breakpoint, or round to the day.

Random JSON key ordering

If your tool schema is regenerated each turn from a Python dict (which had its key order shuffled by a refactor), the prefix changes. Sort keys deterministically or freeze schemas at startup.

User-message-dependent prefix

Tempting to do “if user mentions tests, add a tests-related instruction to the system prompt”. This branches the prefix per turn and defeats the cache. Push the conditional content into the cold tail.

Whitespace and Unicode drift

A trailing newline added by an editor, a tab vs spaces change, NFC/NFD normalisation differences — all change the byte stream. Diff your prefix between turns when caching mysteriously stops working.

5-minute TTL idle

If the user goes for coffee, the cache expires. The next turn pays the 1.25× write premium again. Long-form sessions benefit from the 1-hour TTL option even with its higher write multiplier.

Wrong breakpoint placement

Putting the breakpoint before the workspace summary means the summary is recomputed every turn. Putting it after session-state is impossible (session state changes). The right place is just after the last stable layer.

A diagnostic worth running

Anthropic’s API returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens on every response. Log both per turn. If the read counter doesn’t climb during a session, something is invalidating your prefix — usually one of the bullets above.

09

Practical Patterns You Can Steal

Translating Components 1 + 2 from concept to code. These are patterns I’ve seen work in real harnesses; each maps to a specific design decision the article calls out.

Pattern 1 — the `CLAUDE.md` / `AGENTS.md` root file

A markdown file at the repo root that the harness inlines into the workspace summary. Costs nothing to maintain (it’s just a doc), survives across sessions, and lives in version control where it can be reviewed alongside code changes. Treat it as the harness’s contract with the project: conventions, commands, do-and-don’ts.

CLAUDE.md — minimal but high-signal example

# CLAUDE.md

## Repo conventions
- Test:  pytest -q --maxfail=1
- Lint:  ruff check . && ruff format --check .
- Types: pyright

## Style
- Type-hint every public function.
- Docstrings: 1 line for simple, full Google style for complex.
- No mocking the database. Use the postgres fixture in tests/conftest.py.

## Where to look first
- Connection logic:  src/db/pool.py
- Migrations:        src/db/migrations.py
- Architecture doc:  docs/architecture.md

## Don’t
- Don’t edit generated/ — regenerated by make gen.
- Don’t add new top-level packages without proposing in a comment first.

Pattern 2 — deterministic prefix builder

A pure function build_prefix(workspace) -> bytes that always produces the same bytes for the same workspace. Sort dicts, normalise newlines, strip trailing whitespace. Run it through a SHA-256 and log the digest each turn — if it changes, your cache will too.

Pattern 3 — secondary breakpoint mid-session

Anthropic supports up to four cache breakpoints. Put the second one immediately after the compact transcript: that way, even though the transcript grows, the part of it that hasn’t changed since last turn is also cached. This is how long sessions stay cheap.

Pattern 4 — round timestamps to coarse buckets

If your prompt genuinely needs a date (auditing, prompt-injection-aware grounding), round to the day. "today is 2026-04-28" survives caching for 24 hours; "now is 2026-04-28T09:35:12.741Z" survives for one millisecond.

10

Things to Try Yourself

Audit one of your prompts

Take a real prompt your code already builds. Print it. Highlight in three colours: hot (truly stable), warm (reusable until session changes), cold (changes every turn). Most prompts have at least one cold thing wedged into a hot region.

Write a CLAUDE.md for a small project

Pick a side project. Write the smallest CLAUDE.md you would actually rely on. Use Claude Code or Cursor against it for a week. Notice which lines you find yourself adding — those are the actual conventions of the project.

Plot your cache hit rate over time

Log cache_creation_input_tokens and cache_read_input_tokens per turn. Plot the ratio. A healthy session reaches >80% reads within 2–3 turns. If yours doesn’t, something is breaking the prefix.

Build the deterministic prefix builder

Write a pure function in 30 lines that takes a workspace dict and returns the prefix bytes plus a SHA-256. Make it part of your test suite: same input ↠ same digest. This catches subtle regressions instantly.

Where this leads

Once you have a stable prefix and a working cache, the next bottleneck moves to what the model is allowed to do in each turn. That is Component 3 — tool access — covered in the next deck.

Next deck → Tool Access & Use

Topics We’ll Cover

The “Fix the Tests” Problem

What the LLM is missing

What the agent needs

Anatomy of a Workspace Summary

What goes in, what stays out

In — cheap, stable, useful

Out — expensive, unstable, or noisy

Stable Facts vs Changing Facts

Cache lifetimes for each layer

Prompt-Shape Architecture

How a real Anthropic API call expresses this

Interactive: Walk Through a Real Prompt

Prompt-Cache Mechanics — What You Actually Pay

The price multipliers (Anthropic, default 5-min TTL)

Break-even analysis

Interactive: Cache-Hit Cost Calculator

Pitfalls — What Breaks the Cache

Timestamps in the system prompt

Random JSON key ordering

User-message-dependent prefix

Whitespace and Unicode drift

5-minute TTL idle

Wrong breakpoint placement

Practical Patterns You Can Steal

Pattern 1 — the CLAUDE.md / AGENTS.md root file

Pattern 2 — deterministic prefix builder

Pattern 3 — secondary breakpoint mid-session

Pattern 4 — round timestamps to coarse buckets

Things to Try Yourself

Audit one of your prompts

Write a CLAUDE.md for a small project

Plot your cache hit rate over time

Build the deterministic prefix builder

Pattern 1 — the `CLAUDE.md` / `AGENTS.md` root file