Beyond Standard LLMs 03 — Code World Models

00

Topics We’ll Cover

Why Standard Code LLMs Fail at Execution
The CWM Bet — Predict State, Not Just Syntax
The Architecture — A 32B Model With Sliding-Window Attention
Four-Stage Training — Where the World Model Comes In
SWE-Bench Results — Punching Above the Weight Class
Interactive: World-Model Rollout Stepper
Limits — Latency, Training Complexity, Scope
Companion Decks — Where to Go Next
Things to Try Yourself

Where this deck sits in the series

Linear hybrids and diffusion are generic-purpose architectural plays. CWM is different: it is a training innovation tied to a specific domain — code. The interesting question is whether the world-model trick generalises to other tasks, or whether code is uniquely suited because it has a free, executable, ground-truth simulator.

01

Why Standard Code LLMs Fail at Execution

Standard code LLMs are next-token predictors over tokens of source code. They learn beautifully fluent code that follows the patterns in training data. They do not, in general, know what their code does when it runs. The model has read the function signature, the call sites, the docstring — but it has not run the program in its head.

What standard code LLMs know

How code is usually spelled: idioms, naming, structure.
What APIs commonly compose with what other APIs.
What error messages tend to follow what bugs in training-data tickets.
How to imitate “a unit test for this” or “a docstring for this”.

What they don’t reliably know

What the value of x is after line 5.
Whether the loop terminates.
Whether self.cache[key] is populated when line 12 executes.
What the program prints if you actually run it.

Real engineering tasks — debugging, multi-step refactors, fixing test failures — require knowing what the program does. The standard-LLM workaround is to give it a tool: let it run the code in a sandbox and read the output. CWM’s bet is that you can also internalise some of that capability so the model needs the sandbox less.

Raschka’s framing of the difference

“A code world model learns to simulate what happens when the code runs by predicting program state evolution rather than just syntax patterns.” The clean-up of that statement: standard LLMs predict the next character; CWM predicts what the next character implies for the program’s state, and learns from the gap between its prediction and what really happened.

02

The CWM Bet — Predict State, Not Just Syntax

The training data has a new shape. For every code snippet, the training corpus also contains execution traces — structured records of variable values, function entry/exit, returned objects, side effects, exception paths.

CWM training-style record (illustrative)

# source
def tally(items):
    counts = {}
    for x in items:
        counts[x] = counts.get(x, 0) + 1
    return counts

# execution trace, interleaved at training time
<TRACE>
  call: tally(items=['a', 'b', 'a'])
  step 1: counts = {}
  step 2: x = 'a'; counts = {'a': 1}
  step 3: x = 'b'; counts = {'a': 1, 'b': 1}
  step 4: x = 'a'; counts = {'a': 2, 'b': 1}
  return: {'a': 2, 'b': 1}
</TRACE>

The model is trained to predict both the source code and the trace tokens. At inference time, the model can choose to emit a trace before answering — or you can prompt it to. The execution intuition lives in the model, not just in a tool the model calls.

Why this is more than “chain-of-thought for code”

Structure. The trace is a typed, machine-checkable record — not free-form prose. That gives the training signal precision that CoT’s “let me think...” doesn’t have.
Verifiability. At training time, the trace was generated by actually executing the code. The ground truth is exact. The model is being supervised against reality, not against another model’s plausible-sounding text.
Recoverability. If a downstream task needs the trace, the model can produce it on demand. If it needs only the answer, the trace is internalised but doesn’t need to surface.

Raschka’s key claim

“Tokens can encode structured execution traces rather than plain text”. The structured-trace format is what gives the approach traction: ordinary CoT is a model’s guess about its own reasoning; an execution trace is the program’s actual behaviour. The training signal is far stronger.

03

The Architecture — A 32B Model With Sliding-Window Attention

The architecture is deliberately conservative: a dense decoder transformer, no MoE, no exotic attention, no recurrence. The novelty is in the training data and procedure — not the model. This is methodologically important: it isolates the contribution of the world-modelling stage.

Property	CWM 32B	Notes
Total parameters	32B (dense)	No MoE; every parameter active per token.
Architecture	Decoder-only transformer	Standard pre-norm; no architectural exotica.
Context length	131 072 tokens	Long enough to fit a meaningful slice of a real repo plus traces.
Attention	Sliding-window	Local attention bands rather than full `O(n²)` at full context.
Training data	Code + execution traces	The differentiating ingredient; size is in the same league as comparable code-LLM training mixes.
Tokenizer	Standard BPE-style	Trace markers are dedicated special tokens.

Why sliding-window matters here

Real repos plus their traces are long. Full O(n²) attention at 131 k tokens would dominate inference cost. Sliding-window attention bounds each token’s attention span (e.g. the last few thousand tokens), which keeps inference tractable. The trade-off: information from very distant parts of the repo can’t be retrieved by attention — the model has to use the structured trace to carry that state forward, which is exactly what the training is teaching it to do.

An understated point

The architectural conservatism is the methodological win. It means the SWE-bench gains can’t be attributed to clever attention tricks or routing networks — they have to be coming from the training data and procedure. That gives the result much more generalisability as evidence for the world-model thesis than a co-designed architecture would.

04

Four-Stage Training — Where the World Model Comes In

The training pipeline has four stages. The third — mid-training on world-modelling data — is the new one. Without it, you have a competent but unremarkable 32B code model. With it, you get the SWE-bench numbers on the next slide.

How the trace data is built

Take a piece of code (function, snippet, or whole file). Have an executor instrument it to log every variable assignment, function entry/exit, and side effect.
Run with synthetic or sampled inputs. Record the trace.
Interleave source and trace into a single training sequence with structured tokens delineating the two.
Train the model to predict the trace tokens conditional on the source — so producing a trace becomes a learned skill, not a tool call.

The cost of generating this data is non-trivial: you need a working sandbox, instrumentation, and curated input distributions. That investment is a large part of why the technique is hard to reproduce without significant engineering.

A subtle methodological point

Mid-training, not pre-training, is where world-modelling lives. That choice means the model is not paying the cost of trace data on every pre-training token — it learns code first the standard way, then specialises into world-modelling once the basics are in place. Empirically this seems to work better than mixing trace data into the pre-training corpus from the start.

05

SWE-Bench Results — Punching Above the Weight Class

SWE-bench is the benchmark closest to what real coding agents do: take a real GitHub issue, produce a patch, see if the tests pass. The CWM numbers in the article are striking because they isolate capability per parameter.

Model	Params	SWE-bench tier	Notes
gpt-oss-20b (mid reasoning)	20B	Baseline	Standard code-LLM training; smaller but solid.
CWM 32B	32B	Matches gpt-oss-20b	Same league at modestly larger size; the world-modelling is “earning back” the parameter overhead.
gpt-oss-120b (high reasoning)	120B	Standard frontier-tier	4× larger than CWM. The hard target.
CWM 32B + test-time scaling	32B	Exceeds gpt-oss-120b	The headline result — with extended sampling, beats a 4×-larger model.

What “test-time scaling” means here

The CWM at inference can sample multiple candidate patches, check each against the trace it predicts, and pick the one that survives. The execution-trace prediction is itself a verifier: a candidate that produces a trace inconsistent with the test expectations can be filtered before submission. The standard LLM doesn’t have this; it has to ask a tool to run the code, and pays latency for every check. The CWM does much of the verification in-model.

The compounding effect

Two things buy capability: (a) the world-model gives sharper per-sample candidates than next-token-only, and (b) the model can reject its own bad candidates by examining the predicted trace. Each effect alone might be small. Together they explain the 4×-parameter advantage.

Where the result is fragile

SWE-bench is one benchmark, on one task family. The result likely generalises to closely-related tasks (debugging, refactoring) and might not to distant ones (code review, design). Treat the headline number as a strong signal, not a proof of universal advantage.

The deeper claim worth taking seriously

If world-modelling can save 4× parameters on a code task, the same idea may save parameters on any task with a verifiable simulator. Mathematics has formal proof checkers. Database queries have query engines. Theorem proving has Lean. The CWM result is best read as evidence for a general approach, not just a one-off code optimisation.

06

Interactive: World-Model Rollout Stepper

Step through a small program. The left panel shows what a standard code LLM emits — just the next line of source. The right panel shows what a CWM emits — the next line plus its predicted state delta, and a verdict on whether the prediction matches actual execution. Watch how often the standard view goes through with no signal that something has gone wrong.

ready

Standard code LLM

predicts source tokens only

no state prediction

Code World Model

predicts source + state trace

What to notice

Press Inject bug. The standard model emits the same fluent code as before — no signal anything is wrong. The CWM’s state prediction diverges from actual execution, the verdict flips to red, and the model has internal evidence the candidate is bad before any test runs. That is the verification signal that lets test-time scaling do real work.

07

Limits — Latency, Training Complexity, Scope

Inference latency

Predicting trace tokens means producing more tokens per task. Even with sliding-window attention, the model is generating substantially more text than a same-size autoregressive code model. Latency-sensitive completions (auto-complete in an editor) are a poor fit.

Training complexity

You need a sandbox, instrumentation, an input-sampling distribution, and pipeline plumbing to interleave traces with source. This is significantly more engineering than “feed the model The Stack”. The barrier to reproducing the technique is not weights, it is data.

Scope — right now

The published evaluations are on Python-heavy SWE-bench-style tasks. The technique should generalise — the trace format is language-agnostic in principle — but published evidence beyond Python is thin. Treat the result as Python-strong, other-languages-plausible.

The open generalisation question

If the world-model trick works only when there is a clean executable simulator, then the technique applies to code, mathematics, formal verification, SQL, and a small handful of other domains. If it works for any domain where you can construct a structured trace — including domains where the trace is itself learned — the technique becomes far more general. Raschka explicitly flags this as an open question. The 2026 papers on this will tell us a lot.

A practical reading

For now: CWM is the right tool when you have a code-completion-or-debugging task, you want frontier-tier performance, and you don’t want frontier-tier inference cost. It is not yet the right tool for code review, code explanation, design conversations, or anything where the user wants to read what the model is “thinking” in plain prose.

08

Companion Decks — Where to Go Next

01

Linear-Attention Hybrids →
MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear. Gated DeltaNet, the 3:1 hybrid pattern. KV-cache calculator widget.
02

Text Diffusion Models →
LLaDA, Gemini Diffusion. Iterative denoising, parallel decoding, conditional-dependency limits. Diffusion-vs-autoregressive visualiser.
04

Small Recursive Transformers →
HRM, TRM (7M params). Iterative self-loops for ARC-AGI. The surprising attention-not-required ablation. Recursive trace viewer.
05

When to Reach for Non-Transformer →
A practical decision-tree synthesising all four families. Walk your workload through the tree and see which architecture fits.

09

Things to Try Yourself

20 minutes — play with the rollout stepper

Press Inject bug at different points and watch the CWM verdict flip. Note where the standard panel gives no signal — that is the cost the standard model pays for not knowing what its code does.

1 hour — manually trace 5 buggy commits

From your own commit history, take five “fix bug” commits. Walk through the buggy code and write down the variable state at each line manually. The exercise builds the same intuition the CWM is being trained to internalise.

An afternoon — build a tiny tracer

Write a Python sys.settrace wrapper that produces structured traces in the CWM format from any input function. The amount of plumbing required is the data-engineering barrier the technique faces — doing it yourself makes that barrier viscerally clear.

A weekend — pit CWM vs a comparable LLM

If CWM weights are open, run it on the easiest tier of SWE-bench. Compare to a Llama-3 70B or comparable open code model. Look not just at pass-rate but at the shape of the failures — the CWM should fail on tasks where the trace itself was misleading; the standard LLM should fail more uniformly.

Read the source

The CWM technical report is the primary source — pay attention to the data-construction section, which is the part that’s hardest to reproduce. The SWE-bench paper for context on what the benchmark actually tests.

Watch for —

A non-code domain getting the same treatment. Mathematics with Lean traces. SQL with query-engine traces. Whichever domain wins next will tell us whether CWM was a code-specific result or the start of a method.

A diagnostic for your team

Next time someone says “our code agent needs a bigger model”, ask: does it have a way to know whether its candidate is correct before we run the tests? If the answer is no, the problem may be a missing world-model, not a missing 100B parameters — and CWM is the proof-point that those are not the same lever.