Companion deck for Components 1 + 2 of Sebastian Raschka’s essay. What goes into a workspace summary, why the prompt is split into a stable prefix and a changing tail, and how the Anthropic / OpenAI prompt cache turns thousands of repeated tokens into a near-zero recurring cost.
Raschka opens Component 1 with a small but devastating example. A user types “fix the tests”. To a competent human colleague this is enough: they know the language, the test runner, the recent changes, the project conventions. To an LLM with nothing but the message, it is unsolvable.
pytest? cargo test? npm test? something custom?A workspace summary — what Raschka calls “stable facts”. Built once, at session start. Cheap to keep around. Carries most of what a colleague would know without being told.
The user message then becomes a delta on top of a known context, not a complete brief in itself.
Raschka writes that the agent “collects info (‘stable facts’ as a workspace summary) upfront before doing any work” precisely so that it does not start “from zero, without context, on every prompt”. Most of the engineering effort in Component 1 is choosing what counts as a stable fact and how to surface it cheaply.
What does a real coding agent put in its workspace summary? Let’s reverse-engineer it from observable behaviour. Here is roughly what Claude Code reads on session start in a typical Python repo (~2 kLOC, mixed src/tests).
# Workspace summary — built once at session start
project:
name: connection-pool
root: /home/user/code/connection-pool
vcs: git, branch=feat/idle-eviction, ahead 3 commits of main
tooling:
language: Python 3.12
build: uv (pyproject.toml)
test: pytest -q
lint: ruff check, ruff format
types: pyright (strict)
layout:
- src/ # 14 Python files, 1 821 LOC
- src/db/pool.py # core class, 412 LOC
- src/db/migrations.py
- tests/ # 23 test files, 2 104 LOC
- docs/architecture.md # 180 LOC, last modified 8 days ago
conventions: # pulled from CLAUDE.md / AGENTS.md / README
- "All public methods must have type hints and a 1-line docstring."
- "PRs run pytest + ruff + pyright; CI fails on any error."
- "Avoid mocking the database; use a real Postgres in test fixtures."
recent_change: # last commit summary
sha: 3a91f0c
title: "add idle-connection eviction; tests pending"
CLAUDE.md / AGENTS.md / CONTRIBUTING.md.git log later).node_modules, generated code..env or password managers.Aider precomputes a repo map (ctags + PageRank) and inlines it. Cursor builds an embedding index server-side and exposes @codebase. Claude Code does almost no precomputation — it relies on CLAUDE.md + a few directory listings + tool-call retrieval. Codex emphasises an AGENTS.md at the repo root. All four are different bets about how much to do at session start vs on demand.
Component 2 in Raschka’s essay zooms in on a critical detail: not all of the prompt changes between turns. Some of it is set once and reused for the entire session; some of it accumulates with every new tool call; some of it is regenerated every turn. Sorting these into the right order is the single biggest cost-and-quality lever in a coding harness.
| Layer | Lifetime | Typical size | Cacheable? |
|---|---|---|---|
| System & instructions | Days–months (rarely changes) | 500–2 000 tokens | Yes — place first hot |
| Tool definitions | Days–weeks | 1 000–4 000 tokens | Yes — place after system hot |
| Workspace summary | Per-session (~hours) | 500–2 000 tokens | Yes — place at end of prefix hot |
| Compact transcript | Per-turn growth | 500–5 000 tokens | Partial — up to last cache breakpoint warm |
| Working memory | Per-turn rewrite | 200–800 tokens | No — changes every turn cold |
| Latest tool result | This turn only | Variable, often clipped to ~1 500 tokens | No cold |
| User message | This turn only | 20–500 tokens | No cold |
Sort by stability, descending. Stable layers go first; volatile layers go last. Place a cache breakpoint immediately after the last stable layer. Anything before the breakpoint is paid for once per session; anything after is paid for every turn. Get this ordering wrong and you pay full price on every turn for content the cache could have served.
This is Raschka’s Figure 7 redrawn. A single turn’s prompt has three regions, separated by where the cache breakpoint sits.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{"type": "text", "text": SYSTEM_INSTRUCTIONS}, # stable
{"type": "text", "text": TOOL_DESCRIPTIONS}, # stable
{"type": "text", "text": WORKSPACE_SUMMARY, # stable for session
"cache_control": {"type": "ephemeral"}}, # ← cache breakpoint
],
messages=[
*compact_transcript, # warm
{"role": "user",
"content": working_memory_block + tool_result_block + user_msg}, # cold
],
tools=TOOL_SCHEMAS,
)
# Subsequent turns hit the cache and pay 0.10x for the prefix tokens.
End users will never write cache_control directly. The harness arranges the prompt this way every turn so that the user just types “fix the tests” and the right architecture happens automatically. Raschka’s point in Component 2 is that this layout is not optional; it is what separates a usable harness from a wasteful one.
Click any layer below to see what it contains, why it sits where it does, and whether it benefits from the cache. The colours match the diagram on the previous slide: green = cache hot, orange = cache warm, red = cache cold.
Token estimates are illustrative for a typical Claude Code session on a small Python repo. Real values depend on the harness, model, and project size.
The economics behind Component 2 only make sense if you know how the cache actually bills. The Anthropic API uses ephemeral caching with a default 5-minute TTL (an extended 1-hour TTL is also available with a higher write multiplier). OpenAI’s mechanism is similar in spirit.
| Operation | Cost multiplier | What it means |
|---|---|---|
| Cache write — first time you submit a prefix | 1.25× standard input price | You pay a 25% premium to populate the cache. |
| Cache read — subsequent turns within TTL | 0.10× standard input price | Cached prefix tokens cost 10% of normal — a 10× reduction. |
| Cache miss — prefix changed by even 1 byte | 1.00× standard input price | Full re-tokenisation; no benefit. |
| Cache refresh — new request resets the TTL | Free if prefix matches | Active sessions stay hot; idle ones expire after 5 min. |
The 25% write premium is paid up front; the 10× reduction kicks in on subsequent turns. The break-even is therefore very low — just over one cache hit. Concretely:
For any system prompt + workspace summary >1 000 tokens that is reused 2 or more times, prompt caching is essentially free money. The interesting design question is not whether to cache — it is where to put the breakpoint and how to keep the prefix bit-stable across turns.
Slide the inputs to see how cache geometry shapes a typical session. The calculator uses Anthropic’s default multipliers (1.25× write, 0.10× read) on a 5-minute TTL.
For a Claude Code-style session (5 000-tok prefix, ~12 turns, ~1 200 tok tail at $15/M input) caching saves roughly 70–80% of the input bill. Long sessions amortise the 25% write premium completely; short sessions still come out ahead on the second turn. Sessions of length 1 are the only case where caching costs slightly more — and even those typically don’t happen in practice for coding agents.
The cache is content-addressed: the cache key is the byte-for-byte prefix. Any of the following silently turn cache reads back into cache writes — an order of magnitude price increase, with no warning.
“Today is 2026-04-28T09:35:12Z” injected at the top of every turn destroys the cache. If you need a date, put it after the cache breakpoint, or round to the day.
If your tool schema is regenerated each turn from a Python dict (which had its key order shuffled by a refactor), the prefix changes. Sort keys deterministically or freeze schemas at startup.
Tempting to do “if user mentions tests, add a tests-related instruction to the system prompt”. This branches the prefix per turn and defeats the cache. Push the conditional content into the cold tail.
A trailing newline added by an editor, a tab vs spaces change, NFC/NFD normalisation differences — all change the byte stream. Diff your prefix between turns when caching mysteriously stops working.
If the user goes for coffee, the cache expires. The next turn pays the 1.25× write premium again. Long-form sessions benefit from the 1-hour TTL option even with its higher write multiplier.
Putting the breakpoint before the workspace summary means the summary is recomputed every turn. Putting it after session-state is impossible (session state changes). The right place is just after the last stable layer.
Anthropic’s API returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens on every response. Log both per turn. If the read counter doesn’t climb during a session, something is invalidating your prefix — usually one of the bullets above.
Translating Components 1 + 2 from concept to code. These are patterns I’ve seen work in real harnesses; each maps to a specific design decision the article calls out.
CLAUDE.md / AGENTS.md root fileA markdown file at the repo root that the harness inlines into the workspace summary. Costs nothing to maintain (it’s just a doc), survives across sessions, and lives in version control where it can be reviewed alongside code changes. Treat it as the harness’s contract with the project: conventions, commands, do-and-don’ts.
# CLAUDE.md
## Repo conventions
- Test: pytest -q --maxfail=1
- Lint: ruff check . && ruff format --check .
- Types: pyright
## Style
- Type-hint every public function.
- Docstrings: 1 line for simple, full Google style for complex.
- No mocking the database. Use the postgres fixture in tests/conftest.py.
## Where to look first
- Connection logic: src/db/pool.py
- Migrations: src/db/migrations.py
- Architecture doc: docs/architecture.md
## Don’t
- Don’t edit generated/ — regenerated by make gen.
- Don’t add new top-level packages without proposing in a comment first.
A pure function build_prefix(workspace) -> bytes that always produces the same bytes for the same workspace. Sort dicts, normalise newlines, strip trailing whitespace. Run it through a SHA-256 and log the digest each turn — if it changes, your cache will too.
Anthropic supports up to four cache breakpoints. Put the second one immediately after the compact transcript: that way, even though the transcript grows, the part of it that hasn’t changed since last turn is also cached. This is how long sessions stay cheap.
If your prompt genuinely needs a date (auditing, prompt-injection-aware grounding), round to the day. "today is 2026-04-28" survives caching for 24 hours; "now is 2026-04-28T09:35:12.741Z" survives for one millisecond.
Take a real prompt your code already builds. Print it. Highlight in three colours: hot (truly stable), warm (reusable until session changes), cold (changes every turn). Most prompts have at least one cold thing wedged into a hot region.
Pick a side project. Write the smallest CLAUDE.md you would actually rely on. Use Claude Code or Cursor against it for a week. Notice which lines you find yourself adding — those are the actual conventions of the project.
Log cache_creation_input_tokens and cache_read_input_tokens per turn. Plot the ratio. A healthy session reaches >80% reads within 2–3 turns. If yours doesn’t, something is breaking the prefix.
Write a pure function in 30 lines that takes a workspace dict and returns the prefix bytes plus a SHA-256. Make it part of your test suite: same input ↠ same digest. This catches subtle regressions instantly.
Once you have a stable prefix and a working cache, the next bottleneck moves to what the model is allowed to do in each turn. That is Component 3 — tool access — covered in the next deck.