Beyond Standard LLMs 04 — Small Recursive Transformers

00

Topics We’ll Cover

A Reasonable Question — What If Depth Is Just Iteration?
HRM — Hierarchical Reasoning Model
TRM — Tiny Recursive Model, 7M Parameters
How Recursion Replaces Depth
The Attention-Not-Required Ablation
Interactive: Recursive Trace Viewer
Scope — Grid Puzzles Now, Components Later
Companion Decks — Where to Go Next
Things to Try Yourself

Where this deck sits in the series

The first three decks were about scaling-with-twists. This one is the opposite bet: scaling down. A 7M-parameter recursive model beats a 70B LLM on a benchmark designed to measure reasoning. Whether that’s a one-off (puzzle-shaped tasks favour iterative solvers) or a general principle (small specialised reasoners as components in larger systems) is the open question.

01

A Reasonable Question — What If Depth Is Just Iteration?

A standard transformer is a fixed stack: 24, 80, 100 layers deep. Each forward pass runs through every layer once, and the answer is whatever falls out the bottom. If the task needs more reasoning than your stack can do in one pass, the standard answer has been: train a bigger stack. Or: let the model emit chain-of-thought tokens so it can re-enter itself textually.

Recursive transformers ask a different question. What if a small model ran itself many times in a learned loop? The depth-equivalent comes from iteration count, not from layer count. A 4-block model run 16 times has, in some sense, 64 layers of computation — but with 16× fewer parameters and a learned halting behaviour that decides when to stop.

Standard transformer

Fixed 24–100 layers, run once per forward pass.
Width and depth are training-time choices, frozen at deploy.
Computation per token is a constant, regardless of difficulty.
Reasoning depth is bounded by physical layer count.

Recursive transformer

Small core (e.g. 2–4 layers), run K times in a loop.
K is variable per input; the model learns when to halt.
Easy inputs use few iterations; hard ones use many.
Total compute is adaptive; parameter count is tiny.

A historical echo

This is conceptually the “adaptive computation time” idea from 2016 (Graves), and the “universal transformer” from 2018 — but with a 2025 training recipe and a domain (ARC-AGI) where the iterative-refinement bias matches the problem structure. The architectural ideas are old; the result is new.

02

HRM — Hierarchical Reasoning Model

HRM is the model that re-validated this approach in 2025. Two small transformer modules (each 4 blocks deep), arranged hierarchically: an outer module that updates a high-level latent state, an inner module that operates on the answer conditioned on that state. The outer module is the “planner”; the inner is the “solver”.

The headline result

HRM topped the ARC-AGI leaderboard at the time of release — the benchmark explicitly designed to test out-of-distribution reasoning that LLMs struggle with. Two small modules iterating against a benchmark designed to break large frozen models. The architectural choice was the lever, not parameter count.

A useful comparison

Frontier LLMs running on ARC-AGI typically score in the single digits to low tens. HRM, at orders-of-magnitude fewer parameters, scored in the range that started a small industry of follow-up work. The next slide is the most important entry in that follow-up work.

03

TRM — Tiny Recursive Model, 7M Parameters

TRM is the simplification that ate HRM. The hierarchical structure turned out not to be necessary: a single 2-layer transformer, run recursively with the same alternating-update pattern, beats HRM on ARC at a quarter of the parameter count. Seven million parameters — about a hundredth of a percent of a 70B LLM — and a $500 training run.

Property	HRM	TRM
Parameters	~27M	~7M (4× smaller)
Architecture	Two 4-block modules (hierarchical)	Single 2-layer transformer
Attention	Standard self-attention	Bidirectional self-attention — or no attention at all (see slide 05)
Update rule	Alternating high/low-level updates	Alternating: refine latent z, refine answer y
Backprop	Through final few iterations	Through all recursive steps
Halting	Heuristic	Binary cross-entropy on a halt head
Training cost	(not published as headline)	~$500 on 4×H100, ~2 days
ARC-AGI score	(top at release)	Beats HRM

The two design choices that mattered

Backprop through all recursive steps. HRM only differentiated through a few of its iterations to keep memory tractable. TRM differentiates through every step, accepting the memory cost in exchange for cleaner gradients. The improvement is large enough to absorb the simplification of the architecture.
Learned halting via BCE loss. A binary head decides per-input whether to run another iteration. Easy puzzles halt early (saving compute); hard ones get more iterations (spending compute where it matters). The model is doing test-time compute scaling adaptively.

Raschka’s framing

“TRM uses a single 2-layer transformer with bidirectional attention. Two alternating updates: (1) compute latent reasoning state, (2) update answer. Backpropagates through all recursive steps. Binary cross-entropy loss determines iteration halting.” That is the entire architecture. If a description that short produces the result on the next slide, something interesting is going on in the loop, not in the layers.

04

How Recursion Replaces Depth

The mechanism is conceptually simple. The model maintains two state vectors: an answer y (the grid being filled in) and a latent z (the reasoning state). On each iteration, both get updated, conditioned on each other and on the input.

Update step

Update z using (input, current y, current z). The latent absorbs new evidence about the partial answer.
Update y using (input, current y, new z). The answer is refined given the updated reasoning state.
Run halt head on z. If above threshold, return y. Otherwise loop.

What this resembles

An iterative solver doing belief propagation: each step accumulates evidence, refines a candidate, and decides whether to do more work. The model is not just “deeper by repetition” — the recursive structure aligns the architecture with the iterative nature of the puzzles being solved.

Why grid puzzles are the right target

Iterative refinement is natural. Sudoku, mazes, and ARC puzzles are solved by repeatedly checking constraints and propagating information — exactly the loop the architecture implements.
Hard tasks need more compute than easy ones. Adaptive halting is much more useful here than on, say, standard text generation, where iteration count is roughly proportional to output length anyway.
The grid structure is fixed and small. The model doesn’t have to handle variable-length output; the answer space has a known shape.

A general lesson

Recursion-as-depth is most powerful when the task itself has an iterative structure that classical algorithms would also exploit. Recursive transformers don’t replace next-token prediction for general text — they replace fixed-depth attention for tasks that look like constraint satisfaction. Match the architectural inductive bias to the problem.

05

The Attention-Not-Required Ablation

The most shocking single finding in the article: “Attention is not required... Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%)”. On the recursive transformer’s grid-puzzle benchmark, removing attention entirely — replacing it with a feedforward MLP — not only worked but worked better.

Why this is so striking

For seven years the load-bearing claim of the field has been “attention is the mechanism that makes transformers work”. On this task, on this architecture, in this scale regime, that claim is empirically false. Whatever is doing the work, it isn’t the attention mechanism per se.

What is plausibly doing the work

The recursion — iterating any nonlinear transformation on a structured state — appears to be the load-bearing primitive. Attention was a useful enough mixing operation that nobody noticed it could be replaced with an MLP at this scale. On a fixed grid where positional structure is given by the input itself, the all-pairs token mixing attention provides may be redundant.

How seriously to take this

The result is real but narrow. It is on grid puzzles at 7M parameters with a fixed input shape. Generalising it to text generation or to scaled models is unjustified.
It does not say “attention is dead”. It says: in this regime, attention’s contribution is not what we assumed. The mechanism may earn its keep elsewhere; here it doesn’t.
It is great evidence for “structure matters more than mechanism”. The recursive loop and the alternating-update pattern are doing the work. Architectural inductive bias is the lever — not which mixing operation you happened to use inside the recursive core.

Read this one carefully

This is the kind of result that gets repeated badly. Do not let yourself or anyone else summarise it as “attention isn’t actually needed”. The right summary is: at small scale, on tasks with iterative structure, the recursive loop carries the capability and the per-step mixing operation can be much simpler than full attention. That is a much more interesting and less reckless claim.

06

Interactive: Recursive Trace Viewer

A 4×4 puzzle. Three panels: the input (with given cells), the current best answer (refined per iteration), and the target. The model starts with a wrong answer; each iteration refines it. Watch the latent state (bars below) evolve and the answer converge. The halt head fires when the latent stabilises.

ready

Input

Current answer y

iter 0 · halt 0.00

Target

What to notice

The answer doesn’t fill in a fixed order — the model attacks the most-constrained cells first, then the rest fall out. The latent bars start jagged (high-uncertainty reasoning state) and smooth out as the answer converges. The halt probability climbs as the latent stabilises; once it crosses 0.9, the model returns. Easy puzzles halt in 4–6 iterations; hard ones take 12–16. Press New puzzle to see the cadence change.

07

Scope — Grid Puzzles Now, Components Later

The honest scope is narrow. HRM and TRM solve grid-shaped puzzles: Sudoku, mazes, ARC. The input and output are fixed-size grids. The technique does not yet apply to general text, to long-form generation, or to reasoning tasks with variable-length outputs.

What it’s good at today

Constraint-satisfaction puzzles with structured input/output. Tasks where iterative refinement matches the problem. Anywhere a classical solver (Z3, A*, belief propagation) would also be a reasonable choice.

What it’s not

A general-purpose LLM. A chat assistant. A code generator. A document summariser. Treating it as such will produce disappointment. The article is explicit about this.

What it might become

A tool for an agent. The same way an LLM uses a calculator API, a future agent could use a recursive-puzzle module to handle constraint problems — faster, cheaper, and more reliable than asking the main LLM.

Raschka’s speculative reading

The article’s closing speculation about recursive transformers is worth quoting: “tiny reasoning models as ‘tools’ for tool-calling LLMs, similar to how systems use Python/calculator APIs”. The interesting future is composition. A frontier model that knows when to spawn a tiny specialised reasoner is more capable, in aggregate, than either alone — and substantially cheaper to deploy.

A 2026 question to track

What other narrow domains can be packaged as a 7M-parameter recursive solver? Theorem checking? Game playing? Constraint solving for scheduling, layout, query optimisation? The TRM result strongly suggests the answer is “more than we’ve looked for”, but the sociological barriers (papers want frontier-tier results, not 7M-parameter components) are real.

08

Companion Decks — Where to Go Next

01

Linear-Attention Hybrids →
MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear. Gated DeltaNet, the 3:1 hybrid pattern. KV-cache calculator widget.
02

Text Diffusion Models →
LLaDA, Gemini Diffusion. Iterative denoising, parallel decoding, conditional-dependency limits. Diffusion-vs-autoregressive visualiser.
03

Code World Models →
CWM 32B. Mid-training that makes the model predict program state evolution. SWE-bench parity at 4× smaller. World-model rollout stepper.
05

When to Reach for Non-Transformer →
A practical decision-tree synthesising all four families. Walk your workload through the tree and see which architecture fits.

09

Things to Try Yourself

15 minutes — play with the trace viewer

Try multiple puzzles. Notice how the iteration count and halting threshold differ between easy and hard ones. The model running fewer iterations on easy puzzles is the “adaptive compute” behaviour visible without instrumentation.

1 hour — read the TRM repo

The code is <1000 lines. Find the recursive update rule, the halt head, and the BCE loss. The whole thing is shorter than most production LLM training scripts.

An afternoon — train a TRM-clone on Sudoku

Sudoku is a small enough domain to fit on a laptop. Implement the alternating-update rule, the latent + answer buffer, the halt head. Watch it converge. The training loop is unusual because gradients flow through the full unrolled recursion.

A weekend — replace attention with an MLP

Reproduce the article’s ablation. On your own toy version, sweep: attention vs MLP, iteration count, model width. Confirm that the recursion, not the mixing operation, is doing the work. The result is one of the most informative experiments you can run on a transformer.

Read the source

Start with the TRM paper as the cleanest exposition. Then HRM for context on what was being simplified. Then the original 2018 Universal Transformer paper to see the architectural lineage.

Watch for —

The first credible “recursive solver as a tool inside a chat agent” deployment. Whichever team ships that first — LLM hands off a constraint problem to a 7M solver, gets the answer back, continues — will have demonstrated the composition story. None has, as of early 2026.

A diagnostic for your team

Next time someone proposes “use the bigger LLM” for a task that looks like constraint solving, ask: could a 7M-parameter recursive model do this faster, cheaper and better? The answer is sometimes yes — and recognising it is the value the recursive-transformer line of work has added to the toolbox, regardless of whether the technique becomes general-purpose.