A practical decision-tree across all four architecture families. The standard transformer is still the right default — this deck is about when, specifically, the cost of leaving it pays for itself.
The first four decks introduced each architecture family in detail. This one is the synthesis. The point is not to argue that standard transformers are obsolete — they are emphatically not — but to give you a structured way to recognise the workloads where one of the alternatives genuinely earns its complexity.
The four architecture families covered in the article each pick a different point on the same trade-off surface. None of them is “better” than the standard decoder; each is better at one specific thing, at the cost of being worse at the others. That makes the choice load-bearing, and worth the deliberation.
Each non-transformer family covered in this series targets exactly one of those weaknesses. The deck’s job is to help you tell which weakness is actually blocking your workload, and which is just an architectural curiosity.
If you can’t tell whether your workload is hitting one of the four named weaknesses — stay on standard transformers. The cost of switching architectures is large, and the savings only materialise when the original architecture was specifically what was hurting you.
| Architecture | What it’s for | Trade-off | Maturity | Pick when |
|---|---|---|---|---|
| Standard transformer | Everything — the SOTA generalist. | Quadratic attention, single-pass cost. | Production-mature. | Default. Anything that doesn’t match a row below. |
| Linear-Attention Hybrid | Long context with bounded KV cache. | Slight accuracy loss on exact-recall and reasoning. MiniMax-M2 reverted. | Open weights from 2–4 frontier labs. | Workload routinely sees >32k tokens AND throughput / cost matters. |
| Text Diffusion | Parallel decoding for short outputs. | No streaming. Conditional-dependency failures. No tool use yet. | One open release (LLaDA), one production hint (Gemini Diffusion). | Short, single-turn outputs on latency-sensitive on-device deployments. |
| Code World Model | Code tasks needing execution intuition. | Higher inference latency. Training pipeline is hard to reproduce. | One major release (CWM 32B). | Code debugging / refactor / SWE-bench-style. You want frontier code quality at non-frontier cost. |
| Small Recursive Transformer | Constraint-satisfaction puzzles. | Tiny scope: grid puzzles only, today. Not a general LM. | Research-grade (HRM, TRM). | You have a structured puzzle that classical solvers handle but you want a learned solver. |
It is useful to picture all five families on one chart. The x-axis is “capability” (broadly construed: SOTA-tier reasoning at one end, narrow specialist at the other). The y-axis is “cost per useful answer”. Each architecture sits at a different point.
If the standard transformer is overkill for what you’re doing, you don’t need to replace it — you need to shrink the part of your workload it’s overkill for. A frontier model that delegates appropriate tasks to a 7M-parameter recursive solver is cheaper and more capable than either alone. That is the composition story on slide 7.
Before we walk the decision tree, a deliberate caution: most workloads should still run on a standard decoder. The four alternatives are real, but the cost of switching is rarely paid back unless the workload has a specific shape.
Your motivation is “the new thing is more interesting”. Switching architecture is not a tech-debt write-off — it is debt accrual. The decision tree on the next slide is structured to make you justify the switch, not to encourage it.
Five questions in order. The first one whose answer is “yes” selects an architecture; if you reach the end with no “yes”, you stay on a standard transformer. The widget on the next slide implements the same flow interactively.
Is your task a structured puzzle (Sudoku, ARC, scheduling, layout, planning over a fixed grid) where the input and output are bounded and the problem has the iterative-refinement shape?
Yes → Small Recursive Transformer. 7M parameters, $500 to train, beats LLMs on this task family. Not a chat model.
Is your workload code that requires the model to reason about what the code does at runtime — debugging, multi-step refactoring, fixing test failures, SWE-bench-style?
Yes → Code World Model. CWM 32B matches gpt-oss-120b at 4× smaller. The world-modelling mid-training is what buys this.
Do most of your inference calls process >32 k tokens, AND is the GPU memory or per-token cost a binding constraint? Both must be true.
Yes → Linear-Attention Hybrid. Qwen3-Next, Kimi Linear. 75% KV-cache reduction, up to 6× decoding throughput. Watch for accuracy regressions on multi-turn reasoning.
Is the workload short outputs (a few hundred tokens), running on a phone or low-power device, where time-to-final-answer dominates? AND no tool use AND no streaming UX?
Yes → Text Diffusion. LLaDA-class. Replaces distilled small models. Don’t use for chat or reasoning.
If you reached Q5, the answer is standard transformer, possibly with one of: extended thinking / reasoning mode, prompt caching, RAG, agent-harness scaffolding. Most workloads land here. The four alternatives are not the future — they are a few specific futures.
Answer the questions about your actual workload — not in the abstract. The tree only gives a useful recommendation when the inputs are honest about what your task is and isn’t.
If the recommendation surprises you, the question to ask is which question did I get wrong? Most of the wrong-answer regret comes from over-claiming on Q3 (“long context” means the routine bulk of traffic, not edge cases) or Q1 (a chat task with structure isn’t a constraint-satisfaction puzzle).
The decision tree picks one architecture per workload. The richer story is composition: a system that uses several at once, each on the slice of the problem it’s best at.
A standard transformer takes the user’s request, decomposes it, decides what kind of subtask each piece is, and routes to the right specialist.
A 7M recursive solver is exposed as a tool. The orchestrator detects a constraint problem in the conversation, packages it as a grid, calls the tool, gets the answer.
Long-document analysis: linear hybrid. Code edit: code world model. Final natural-language reply: standard. The user sees one assistant.
The article ends on this exact theme: tiny reasoning models as “tools” for tool-calling LLMs, similar to how systems use Python or calculator APIs. The four families covered in this series are not in competition with each other; they are likely to be deployed together. The interesting question is not which one wins; it is which orchestrator framework will package them together first.
The reference architecture for serious agentic systems by end of 2026 will probably look like this: a frontier transformer plus 3–5 small specialised solvers behind a tool interface. None of the individual components is novel; the orchestration is the engineering. Whoever builds the cleanest abstraction for “route this subtask to the right architecture” will own the next platform layer.
This is the final deck of the five-deck series. To go deeper on any of the four architecture families, jump to the corresponding deck:
This series is part of the Modern Architectures sub-hub — itself a leaf of the LLMs hub covering transformer internals, training, hardware, retrieval, agents, evaluation and safety. This deck closes the Raschka companion; the rest of the modern-architectures sub-hub goes broader (MoE, state-space models, long-context techniques, hybrids).
Pick the most cost- or capability-binding production workload you have. Walk it through the widget. If the result is “standard transformer”, that is also a useful answer — it tells you the migrating-architecture conversation isn’t the highest-leverage one for that workload.
One workload sometimes hides another. The customer’s “chat assistant” is often three workloads stitched together: short Q&A, long-document RAG, and occasional code edits. Each of those may pick a different architecture. The richest answers come from decomposing.
For each “not standard” recommendation, write down: training-data engineering, serving stack changes, eval-harness retooling, customer-visible regressions. The number is usually larger than expected. That is the bar the saving needs to clear.
Wire a standard LLM with one specialist tool from this series — whichever the tree pointed you at. The discomfort of designing the “when should the orchestrator delegate?” logic teaches you more about composition than any article.
Re-read Raschka’s essay with the four decks open in adjacent tabs. The article is shorter than the decks combined for a reason — he is dense and you might want to revisit any section after seeing the technical detail.
The first agent framework that exposes “route to architecture X” as a first-class concept. LangGraph, the OpenAI Agents SDK, MCP-flavoured stacks — whichever lands the abstraction first will set the convention for years.
What does our standard transformer do badly enough that switching architectures pays for itself? And could we keep it as the orchestrator while a specialist handles the bad parts? If both answers exist, you have a project. If neither does, you have a reasonable status quo — and that is also a valid finding.