Beyond Standard LLMs 02 — Text Diffusion Models

00

Topics We’ll Cover

From Next-Token to Denoising — The Paradigm Shift
How a Diffusion Language Model Generates Text
LLaDA — The First Credible 8B Diffusion LM
Gemini Diffusion — Google’s Production Bet
Where the Wins Come From — And Where They Don’t
Interactive: Diffusion vs Autoregressive Visualiser
The Hard Limits — Streaming, Conditional Dependency, Tool Use
Companion Decks — Where to Go Next
Things to Try Yourself

Where this deck sits in the series

Deck 01 covered linear-attention hybrids — an efficiency play within the autoregressive paradigm. This deck takes a sharper turn: text diffusion abandons next-token prediction entirely. The wins are real but the trade-offs are unfamiliar — especially for engineers who have internalised the autoregressive mental model.

01

From Next-Token to Denoising — The Paradigm Shift

Standard LLMs are autoregressive: produce token t, then condition on it to produce t+1. The whole stack — KV cache, streaming UX, tool-call protocols, RLHF reward shaping — was designed around this loop. Diffusion language models break the loop on purpose.

From the article, Raschka frames it cleanly: “Diffusion LLMs generate multiple tokens in parallel rather than sequentially”, through iterative denoising steps rather than next-token prediction. The mental model is closer to image diffusion or to BERT’s masked language modelling than to GPT.

Autoregressive (GPT-style)

One forward pass per token.
2 000 tokens of output = 2 000 sequential passes.
Causal mask: each token sees only what came before.
Streaming “just works” — emit each token as it’s produced.

Diffusion (LLaDA-style)

Start with all positions masked; output length is fixed up front.
K denoising steps, each fills in some masked positions in parallel.
Bidirectional attention: every position sees every other.
No streaming: the answer only exists after the last step.

Raschka’s comparison, in numbers

“Even if a diffusion model needs 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2 000 sequential generation steps.” The 64 vs 2 000 framing is the central marketing claim — and it is true to first order. The interesting questions are about everything else: latency to first useful output, accuracy on multi-step reasoning, and compatibility with the agent protocols the rest of the ecosystem has standardised on.

02

How a Diffusion Language Model Generates Text

A diffusion LM trains by progressively masking text and learning to recover it. At inference time, the same forward direction is run: start with a target-length sequence of [MASK] tokens, then iteratively un-mask.

The training procedure

Sample text. Sample a random masking ratio r ∈ (0,1).
Replace each token independently with [MASK] with probability r.
Run the model with bidirectional attention — not the causal mask used for autoregressive training.
Train the model to predict the original tokens at the masked positions, weighted by 1/r so the loss is well-calibrated across mask ratios.

The inference procedure

Start with all positions masked. Run the model. Each position outputs a distribution over the vocabulary.
Commit the most-confident k positions — un-mask those tokens. Leave the rest masked.
Run the model again. With more committed context, the next round’s predictions are sharper.
Repeat until all positions are set, or until a fixed step budget is exhausted.

Why this is more than a re-skin of BERT

Raschka makes the connection explicit: diffusion LMs are “analogous to BERT’s masked language modelling, extended probabilistically”. The extension matters — rather than predicting one masked position with the rest of the sentence intact, the model learns to recover from any mask ratio, giving it the iterative-refinement behaviour. BERT learned the easy version; diffusion LMs learn the hard version.

03

LLaDA — The First Credible 8B Diffusion LM

LLaDA (Large Language Diffusion with mAsking) is the cleanest open-weights diffusion language model to date. It is what makes the “is this approach actually viable?” question concrete instead of theoretical.

What it actually is

8B parameter dense decoder.
Built on the Llama 3 architecture with the causal mask removed and the prediction head retargeted at masked positions.
Standard SFT and RLHF on top of the diffusion-trained base.
Released as both base and instruct variants.

What it shows

You can lift Llama-3-tier capability into the diffusion regime without rewriting the architecture from scratch.
Standard benchmarks (MMLU, GSM8K) come within striking distance of comparable autoregressive 8B models.
The instruction-tuning recipe transfers; the model takes RLHF.
Inference can be run with 16–64 denoising steps on a generation up to 1 024 tokens.

The architectural punchline

The same neural-network skeleton serves both paradigms. The difference is what you train it to do and how you decode: causal-mask + next-token-loss + greedy/sample = autoregressive; no-mask + masked-LM-loss + iterative-refinement = diffusion. The cost of switching is in the training pipeline, not the model card.

A useful comparison point

If LLaDA-Instruct can hold its own at 8B, the question becomes: what would a frontier-scale diffusion LM look like? Does the “parallel decoding” advantage compound with scale, or does autoregressive training pull ahead at 70B+ where the data and compute budget is more efficiently spent on standard methods? At time of writing, no one has run the experiment in public.

04

Gemini Diffusion — Google’s Production Bet

Google announced Gemini Diffusion as a production diffusion LM — the first time a frontier lab has put diffusion text generation on a serving roadmap rather than treating it as a research demo. The interesting numbers are not on accuracy, but on speed.

The headline claim

Maintains parity with Gemini 2.0 Flash-Lite on standard benchmarks while being faster to generate. Google is positioning it for low-latency on-device and mobile use cases — the same niche Flash-Lite occupies, but with the parallel-decoding throughput advantage.

What is conspicuously absent

No reasoning-mode variant. No tool-calling story. No streaming UX. No public release of weights. The deployment scope is narrow and explicit: generate a complete answer, fast. The model is a replacement candidate for distilled small autoregressive models, not for the flagship.

Why this is interesting even if you don’t deploy Gemini

It is the first production-scale validation that a diffusion LM can match a tuned distillation pipeline. The cottage-industry of distilled small models now has a credible competitor architecture.
The scope of the deployment tells you where Google’s engineers think the approach is mature: short-output, low-latency, single-turn. That maps cleanly to the article’s recommendation.
If a frontier lab can’t (yet) ship diffusion in their reasoning model or their agentic loop, that is a load-bearing data point for the limitations on the next slide.

A pattern worth watching

Both major announcements in this family — LLaDA and Gemini Diffusion — sit at 8B-class capability. None of the published systems is a frontier-scale reasoning model. The small-model niche is where diffusion lives in 2026. Whether that’s a permanent ceiling or just the current capability frontier is the question.

05

Where the Wins Come From — And Where They Don’t

The diffusion approach delivers two distinct kinds of gain. They are easy to confuse, and treating them as the same thing leads to confused expectations.

Win 1 — throughput

16–64 forward passes vs 2 000 forward passes for an equivalent-length output. On hardware where batch parallelism is cheap and sequential dependencies are expensive, that is a 30–100× reduction in serial steps. The wall-clock latency advantage is real and measured.

Win 2 — bidirectional context

Each token sees the entire output, not just what came before. For tasks where late tokens disambiguate early ones (constraint-satisfaction, structured output), that is a non-trivial inductive bias. The model can “plan ahead” in a way GPT cannot.

Loss 1 — no streaming

The answer doesn’t exist token-by-token; it materialises in the last denoising step. For chat UX, this kills the “tokens trickle in” experience users have come to expect. For agent loops where the next action depends on a prefix of the output, it is a hard architectural mismatch.

Loss 2 — conditional dependency

From the article: when generating the phrase “New York”, the autoregressive model can commit to “New” first and then strongly prefer “York”. The diffusion model has to sample the two positions jointly in one forward pass, and may end up with “New City” or “Newport York”. This is the central failure mode.

What ParallelBench measured

The ParallelBench paper, cited by Raschka, isolates exactly this failure: “current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty”. The model commits too many tokens too early on hard tasks, then can’t recover. Adaptive parallelism — deciding per-step how many tokens to commit — is an open research problem.

A helpful framing

Diffusion LMs are great when the joint distribution over the output is mostly captured by surface-level fluency. They struggle when there is a strong sequential dependency that a left-to-right pass would handle automatically. Knowing which side of that line your task sits on tells you most of what you need to know.

06

Interactive: Diffusion vs Autoregressive Visualiser

Run both decoding strategies side by side on a fixed target sentence. The autoregressive model fills tokens one at a time; the diffusion model fills the most-confident few per step in parallel. Watch the trade-off: more total forward passes vs more sequential dependencies.

ready

Autoregressive (GPT-style)

one token per forward pass · left-to-right

forward passes: 0 · tokens set: 0

Diffusion (LLaDA-style)

commit top-k confident positions per step · bidirectional

denoising steps: 0 · tokens set: 0

What to notice

Both finish in the same number of slides — but watch the forward pass counters. The autoregressive side ticks up to the full token count; the diffusion side stops well before. That ratio is the throughput win. Now look at the order the diffusion side fills tokens in — not strictly left-to-right, often the “easy” tokens (punctuation, common words) first. That is the bidirectional view at work.

07

The Hard Limits — Streaming, Conditional Dependency, Tool Use

Three load-bearing limitations that the article surfaces, none of them solved as of early 2026:

No streaming

The output exists only after the last denoising step. Chat UI expectations break. Agent loops that condition on partial output (early-exit, interrupt) cannot be retrofitted — the architecture doesn’t produce a partial output.

Conditional dependency

“New York” vs “New City”. Long-range factual chains (“Alice gave Bob the key, then Bob unlocked the door”) where committing the wrong noun early creates inconsistencies the rest of the sequence can’t fix in the remaining denoising budget.

Tool-call compatibility unclear

Tool calls are structured chunks with strict syntax. Are they emitted in one denoising pass? Across many? What happens if the JSON is half-committed and then the rest can’t close it? No production diffusion LM has shipped tool use yet.

Reasoning: the harder open question

Chain-of-thought, RLHF’d reasoning, the o1/R1 paradigm — all of these rely on the model emitting long sequential reasoning chains where each step depends on the previous one. That is the worst case for diffusion: maximum conditional dependency, maximum sequence length where the parallel-decoding advantage matters. None of the published diffusion LMs have a credible reasoning-mode story.

The verdict, for now

Diffusion text generation is plausible for small on-device models that produce short, fluent, single-turn answers. It is unproven for reasoning, tool-use, and agentic loops. Raschka’s framing is precisely this: it may replace distilled autoregressive models, not the frontier. That is a genuinely useful niche — not a paradigm shift.

08

Companion Decks — Where to Go Next

01

Linear-Attention Hybrids →
MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear. Gated DeltaNet, the 3:1 hybrid pattern, and why MiniMax-M2 reverted. KV-cache calculator widget.
03

Code World Models →
CWM 32B. Mid-training that makes the model predict program state evolution. SWE-bench parity at 4× smaller. Interactive: world-model rollout stepper.
04

Small Recursive Transformers →
HRM, TRM (7M params). Iterative self-loops for ARC-AGI. The surprising attention-not-required ablation. Recursive trace viewer.
05

When to Reach for Non-Transformer →
A practical decision-tree synthesising all four families. Walk your workload through the tree and see which architecture fits.

09

Things to Try Yourself

15 minutes — play with the visualiser

Watch the order tokens get committed in. Predict, before pressing Run, where the diffusion model will stumble — positions that depend on later tokens. The widget seeds a fixed sentence so you can see the same trajectory repeatedly.

1 hour — run LLaDA-Instruct

The weights are open. Run a prompt that requires “New York” in a non-obvious place. Compare to a same-size autoregressive model. The error mode you hit (or don’t) is the one ParallelBench characterises.

An afternoon — vary the step count

LLaDA exposes the denoising step count as a knob. Sweep it from 4 to 256. Plot accuracy and latency. The trade-off curve is non-monotonic in interesting ways — too few steps causes commitment errors; too many wastes compute and approximates autoregressive cost.

A weekend — force it to do tool use

Try to wire LLaDA into an agent loop with a single bounded tool. Document where it breaks. The failure modes are not yet well-understood; even a small empirical study fills a real gap.

Read the source

Start with the LLaDA paper for the cleanest exposition. Then ParallelBench for the limitations. The Gemini Diffusion blog post is short; the technical detail in it is what they chose not to disclose.

Watch for —

The next public diffusion-LM release that ships tool-calling, or a reasoning-mode variant, or both. Whichever lab does it first will have made a non-trivial architectural breakthrough. None has, as of this writing.

A diagnostic prompt for your team

Next time a vendor pitches you on “parallel decoding for free”, ask them: does your model stream? Does it support tool calls? Does it run in your reasoning mode? If any of those is “no”, you are looking at an on-device-class deployment, not a frontier replacement — whether or not the marketing says otherwise.