Beyond Standard LLMs 05 — When to Reach for Non-Transformer

00

Topics We’ll Cover

The Synthesis Question — Which Architecture, When?
All Four Families on One Page
The Cost-vs-Capability Frontier
Why Standard Transformers Are Still the Default
The Decision Tree — Five Questions
Interactive: Walk Your Workload Through the Tree
Composition — Using These Together
Companion Decks — Where to Go Next
Things to Try Yourself

Where this deck sits in the series

The first four decks introduced each architecture family in detail. This one is the synthesis. The point is not to argue that standard transformers are obsolete — they are emphatically not — but to give you a structured way to recognise the workloads where one of the alternatives genuinely earns its complexity.

01

The Synthesis Question — Which Architecture, When?

The four architecture families covered in the article each pick a different point on the same trade-off surface. None of them is “better” than the standard decoder; each is better at one specific thing, at the cost of being worse at the others. That makes the choice load-bearing, and worth the deliberation.

What standard transformers optimise for

Generality: one model can do summarisation, code, chat, reasoning, tool use.
Maturity: pre-training, RLHF, serving, evaluation are all well-understood.
Compose-ability: every framework, every harness, every dataset assumes this architecture.
Fluent prose at moderate context.

What they don’t optimise for

Long-context efficiency — the n² tax is real and hurts at 100k+.
On-device latency for short outputs — serial decoding is slow at small batch sizes.
Verifiable code reasoning — the model has no built-in execution intuition.
Constraint-satisfaction puzzles — the iterative-refinement bias is not native.

Each non-transformer family covered in this series targets exactly one of those weaknesses. The deck’s job is to help you tell which weakness is actually blocking your workload, and which is just an architectural curiosity.

The simplest version of the rule

If you can’t tell whether your workload is hitting one of the four named weaknesses — stay on standard transformers. The cost of switching architectures is large, and the savings only materialise when the original architecture was specifically what was hurting you.

02

All Four Families on One Page

Architecture	What it’s for	Trade-off	Maturity	Pick when
Standard transformer	Everything — the SOTA generalist.	Quadratic attention, single-pass cost.	Production-mature.	Default. Anything that doesn’t match a row below.
Linear-Attention Hybrid	Long context with bounded KV cache.	Slight accuracy loss on exact-recall and reasoning. MiniMax-M2 reverted.	Open weights from 2–4 frontier labs.	Workload routinely sees >32k tokens AND throughput / cost matters.
Text Diffusion	Parallel decoding for short outputs.	No streaming. Conditional-dependency failures. No tool use yet.	One open release (LLaDA), one production hint (Gemini Diffusion).	Short, single-turn outputs on latency-sensitive on-device deployments.
Code World Model	Code tasks needing execution intuition.	Higher inference latency. Training pipeline is hard to reproduce.	One major release (CWM 32B).	Code debugging / refactor / SWE-bench-style. You want frontier code quality at non-frontier cost.
Small Recursive Transformer	Constraint-satisfaction puzzles.	Tiny scope: grid puzzles only, today. Not a general LM.	Research-grade (HRM, TRM).	You have a structured puzzle that classical solvers handle but you want a learned solver.

Reading the table

The Trade-off column is the price of admission. None of these is free.
The Maturity column is honest. Three of the four are not yet ecosystem-standard. Be willing to invest in tooling if you adopt.
The Pick when column is conjunctive: the conditions that need to hold for the architecture to pay back its cost. If you don’t check all of them, default back to a standard transformer.

03

The Cost-vs-Capability Frontier

It is useful to picture all five families on one chart. The x-axis is “capability” (broadly construed: SOTA-tier reasoning at one end, narrow specialist at the other). The y-axis is “cost per useful answer”. Each architecture sits at a different point.

Three things to read off the chart

The standard transformer is the only generalist. Everything else trades breadth for either cost or specialisation. If you don’t know your workload type, you don’t want to leave the top-right corner.
The frontier is not a curve, it’s a region. A Pareto-optimal choice depends on which dimension — cost or capability — you weigh more. There’s no single “next architecture”.
The cheap-but-narrow corner is where most of the action is. TRM, diffusion-on-device, CWM-for-code — all sit on the bottom-left, where the wins are biggest in absolute cost but only on a narrow workload slice.

A practical reading

If the standard transformer is overkill for what you’re doing, you don’t need to replace it — you need to shrink the part of your workload it’s overkill for. A frontier model that delegates appropriate tasks to a 7M-parameter recursive solver is cheaper and more capable than either alone. That is the composition story on slide 7.

04

Why Standard Transformers Are Still the Default

Before we walk the decision tree, a deliberate caution: most workloads should still run on a standard decoder. The four alternatives are real, but the cost of switching is rarely paid back unless the workload has a specific shape.

Sociological reasons

The ecosystem assumes the standard architecture: vLLM, sglang, Triton kernels, RLHF tooling, eval frameworks, mature serving stacks.
Engineer mind-share is calibrated on the standard model. Hiring & on-boarding is faster.
Your customer is probably using a standard-LLM-flavored mental model.

Technical reasons

RLHF / DPO / GRPO recipes are tuned for standard decoders.
Tool-calling protocols (OpenAI / Anthropic / Google) assume sequential token generation.
Prompt-cache, KV-cache offload, speculative decoding — all the cost-cutting optimisations are mature on standard architectures and not yet on the others.
Standard models have a head start of a year or more on every benchmark you actually care about.

Where this caution does not apply

You are at scale (millions of requests per day) and the cost reduction from a niche architecture meaningfully changes your unit economics.
You have a specific bottleneck (very long context, on-device latency, code execution understanding, or constraint solving) that the standard decoder is empirically not addressing.
You are doing research on the architecture itself.

Be especially wary if

Your motivation is “the new thing is more interesting”. Switching architecture is not a tech-debt write-off — it is debt accrual. The decision tree on the next slide is structured to make you justify the switch, not to encourage it.

05

The Decision Tree — Five Questions

Five questions in order. The first one whose answer is “yes” selects an architecture; if you reach the end with no “yes”, you stay on a standard transformer. The widget on the next slide implements the same flow interactively.

Q1. Constraint-satisfaction puzzle?

Is your task a structured puzzle (Sudoku, ARC, scheduling, layout, planning over a fixed grid) where the input and output are bounded and the problem has the iterative-refinement shape?

Yes → Small Recursive Transformer. 7M parameters, $500 to train, beats LLMs on this task family. Not a chat model.

Q2. Code execution intuition needed?

Is your workload code that requires the model to reason about what the code does at runtime — debugging, multi-step refactoring, fixing test failures, SWE-bench-style?

Yes → Code World Model. CWM 32B matches gpt-oss-120b at 4× smaller. The world-modelling mid-training is what buys this.

Q3. Routinely >32k context AND cost-sensitive?

Do most of your inference calls process >32 k tokens, AND is the GPU memory or per-token cost a binding constraint? Both must be true.

Yes → Linear-Attention Hybrid. Qwen3-Next, Kimi Linear. 75% KV-cache reduction, up to 6× decoding throughput. Watch for accuracy regressions on multi-turn reasoning.

Q4. Short single-turn output, on-device, latency-bound?

Is the workload short outputs (a few hundred tokens), running on a phone or low-power device, where time-to-final-answer dominates? AND no tool use AND no streaming UX?

Yes → Text Diffusion. LLaDA-class. Replaces distilled small models. Don’t use for chat or reasoning.

Q5 — the catch-all

If you reached Q5, the answer is standard transformer, possibly with one of: extended thinking / reasoning mode, prompt caching, RAG, agent-harness scaffolding. Most workloads land here. The four alternatives are not the future — they are a few specific futures.

06

Interactive: Walk Your Workload Through the Tree

Answer the questions about your actual workload — not in the abstract. The tree only gives a useful recommendation when the inputs are honest about what your task is and isn’t.

step 1 of 4

A useful self-check

If the recommendation surprises you, the question to ask is which question did I get wrong? Most of the wrong-answer regret comes from over-claiming on Q3 (“long context” means the routine bulk of traffic, not edge cases) or Q1 (a chat task with structure isn’t a constraint-satisfaction puzzle).

07

Composition — Using These Together

The decision tree picks one architecture per workload. The richer story is composition: a system that uses several at once, each on the slice of the problem it’s best at.

Frontier LLM as orchestrator

A standard transformer takes the user’s request, decomposes it, decides what kind of subtask each piece is, and routes to the right specialist.

Specialist as tool

A 7M recursive solver is exposed as a tool. The orchestrator detects a constraint problem in the conversation, packages it as a grid, calls the tool, gets the answer.

Architecture per subtask

Long-document analysis: linear hybrid. Code edit: code world model. Final natural-language reply: standard. The user sees one assistant.

Raschka’s closing speculation

The article ends on this exact theme: tiny reasoning models as “tools” for tool-calling LLMs, similar to how systems use Python or calculator APIs. The four families covered in this series are not in competition with each other; they are likely to be deployed together. The interesting question is not which one wins; it is which orchestrator framework will package them together first.

A 2026 prediction worth holding lightly

The reference architecture for serious agentic systems by end of 2026 will probably look like this: a frontier transformer plus 3–5 small specialised solvers behind a tool interface. None of the individual components is novel; the orchestration is the engineering. Whoever builds the cleanest abstraction for “route this subtask to the right architecture” will own the next platform layer.

08

Companion Decks — Where to Go Next

This is the final deck of the five-deck series. To go deeper on any of the four architecture families, jump to the corresponding deck:

01

Linear-Attention Hybrids →
MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear. Gated DeltaNet, the 3:1 hybrid pattern, and why MiniMax-M2 reverted. KV-cache calculator widget.
02

Text Diffusion Models →
LLaDA, Gemini Diffusion. Iterative denoising, parallel decoding, conditional-dependency limits. Diffusion-vs-autoregressive visualiser.
03

Code World Models →
CWM 32B. World-modelling mid-training. SWE-bench parity at 4× smaller. World-model rollout stepper.
04

Small Recursive Transformers →
HRM, TRM (7M params). Iterative self-loops for ARC-AGI. The surprising attention-not-required ablation. Recursive trace viewer.

Where this fits in the broader index

This series is part of the Modern Architectures sub-hub — itself a leaf of the LLMs hub covering transformer internals, training, hardware, retrieval, agents, evaluation and safety. This deck closes the Raschka companion; the rest of the modern-architectures sub-hub goes broader (MoE, state-space models, long-context techniques, hybrids).

09

Things to Try Yourself

10 minutes — walk one workload through the tree

Pick the most cost- or capability-binding production workload you have. Walk it through the widget. If the result is “standard transformer”, that is also a useful answer — it tells you the migrating-architecture conversation isn’t the highest-leverage one for that workload.

1 hour — do it for three workloads

One workload sometimes hides another. The customer’s “chat assistant” is often three workloads stitched together: short Q&A, long-document RAG, and occasional code edits. Each of those may pick a different architecture. The richest answers come from decomposing.

An afternoon — cost the switch

For each “not standard” recommendation, write down: training-data engineering, serving stack changes, eval-harness retooling, customer-visible regressions. The number is usually larger than expected. That is the bar the saving needs to clear.

A weekend — build a routing prototype

Wire a standard LLM with one specialist tool from this series — whichever the tree pointed you at. The discomfort of designing the “when should the orchestrator delegate?” logic teaches you more about composition than any article.

Read the source

Re-read Raschka’s essay with the four decks open in adjacent tabs. The article is shorter than the decks combined for a reason — he is dense and you might want to revisit any section after seeing the technical detail.

Watch for —

The first agent framework that exposes “route to architecture X” as a first-class concept. LangGraph, the OpenAI Agents SDK, MCP-flavoured stacks — whichever lands the abstraction first will set the convention for years.

A diagnostic for your team, in two questions

What does our standard transformer do badly enough that switching architectures pays for itself? And could we keep it as the orchestrator while a specialist handles the bad parts? If both answers exist, you have a project. If neither does, you have a reasonable status quo — and that is also a valid finding.