Generate every token at once and refine, instead of left-to-right one at a time. LLaDA, Gemini Diffusion, and the question of whether parallel decoding can survive contact with reasoning, streaming, and tool use.
Deck 01 covered linear-attention hybrids — an efficiency play within the autoregressive paradigm. This deck takes a sharper turn: text diffusion abandons next-token prediction entirely. The wins are real but the trade-offs are unfamiliar — especially for engineers who have internalised the autoregressive mental model.
Standard LLMs are autoregressive: produce token t, then condition on it to produce t+1. The whole stack — KV cache, streaming UX, tool-call protocols, RLHF reward shaping — was designed around this loop. Diffusion language models break the loop on purpose.
From the article, Raschka frames it cleanly: “Diffusion LLMs generate multiple tokens in parallel rather than sequentially”, through iterative denoising steps rather than next-token prediction. The mental model is closer to image diffusion or to BERT’s masked language modelling than to GPT.
“Even if a diffusion model needs 64 denoising steps to produce all tokens in parallel at each step, this is still computationally more efficient than performing 2 000 sequential generation steps.” The 64 vs 2 000 framing is the central marketing claim — and it is true to first order. The interesting questions are about everything else: latency to first useful output, accuracy on multi-step reasoning, and compatibility with the agent protocols the rest of the ecosystem has standardised on.
A diffusion LM trains by progressively masking text and learning to recover it. At inference time, the same forward direction is run: start with a target-length sequence of [MASK] tokens, then iteratively un-mask.
r ∈ (0,1).[MASK] with probability r.1/r so the loss is well-calibrated across mask ratios.Raschka makes the connection explicit: diffusion LMs are “analogous to BERT’s masked language modelling, extended probabilistically”. The extension matters — rather than predicting one masked position with the rest of the sentence intact, the model learns to recover from any mask ratio, giving it the iterative-refinement behaviour. BERT learned the easy version; diffusion LMs learn the hard version.
LLaDA (Large Language Diffusion with mAsking) is the cleanest open-weights diffusion language model to date. It is what makes the “is this approach actually viable?” question concrete instead of theoretical.
The same neural-network skeleton serves both paradigms. The difference is what you train it to do and how you decode: causal-mask + next-token-loss + greedy/sample = autoregressive; no-mask + masked-LM-loss + iterative-refinement = diffusion. The cost of switching is in the training pipeline, not the model card.
If LLaDA-Instruct can hold its own at 8B, the question becomes: what would a frontier-scale diffusion LM look like? Does the “parallel decoding” advantage compound with scale, or does autoregressive training pull ahead at 70B+ where the data and compute budget is more efficiently spent on standard methods? At time of writing, no one has run the experiment in public.
Google announced Gemini Diffusion as a production diffusion LM — the first time a frontier lab has put diffusion text generation on a serving roadmap rather than treating it as a research demo. The interesting numbers are not on accuracy, but on speed.
Maintains parity with Gemini 2.0 Flash-Lite on standard benchmarks while being faster to generate. Google is positioning it for low-latency on-device and mobile use cases — the same niche Flash-Lite occupies, but with the parallel-decoding throughput advantage.
No reasoning-mode variant. No tool-calling story. No streaming UX. No public release of weights. The deployment scope is narrow and explicit: generate a complete answer, fast. The model is a replacement candidate for distilled small autoregressive models, not for the flagship.
Both major announcements in this family — LLaDA and Gemini Diffusion — sit at 8B-class capability. None of the published systems is a frontier-scale reasoning model. The small-model niche is where diffusion lives in 2026. Whether that’s a permanent ceiling or just the current capability frontier is the question.
The diffusion approach delivers two distinct kinds of gain. They are easy to confuse, and treating them as the same thing leads to confused expectations.
16–64 forward passes vs 2 000 forward passes for an equivalent-length output. On hardware where batch parallelism is cheap and sequential dependencies are expensive, that is a 30–100× reduction in serial steps. The wall-clock latency advantage is real and measured.
Each token sees the entire output, not just what came before. For tasks where late tokens disambiguate early ones (constraint-satisfaction, structured output), that is a non-trivial inductive bias. The model can “plan ahead” in a way GPT cannot.
The answer doesn’t exist token-by-token; it materialises in the last denoising step. For chat UX, this kills the “tokens trickle in” experience users have come to expect. For agent loops where the next action depends on a prefix of the output, it is a hard architectural mismatch.
From the article: when generating the phrase “New York”, the autoregressive model can commit to “New” first and then strongly prefer “York”. The diffusion model has to sample the two positions jointly in one forward pass, and may end up with “New City” or “Newport York”. This is the central failure mode.
The ParallelBench paper, cited by Raschka, isolates exactly this failure: “current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty”. The model commits too many tokens too early on hard tasks, then can’t recover. Adaptive parallelism — deciding per-step how many tokens to commit — is an open research problem.
Diffusion LMs are great when the joint distribution over the output is mostly captured by surface-level fluency. They struggle when there is a strong sequential dependency that a left-to-right pass would handle automatically. Knowing which side of that line your task sits on tells you most of what you need to know.
Run both decoding strategies side by side on a fixed target sentence. The autoregressive model fills tokens one at a time; the diffusion model fills the most-confident few per step in parallel. Watch the trade-off: more total forward passes vs more sequential dependencies.
Both finish in the same number of slides — but watch the forward pass counters. The autoregressive side ticks up to the full token count; the diffusion side stops well before. That ratio is the throughput win. Now look at the order the diffusion side fills tokens in — not strictly left-to-right, often the “easy” tokens (punctuation, common words) first. That is the bidirectional view at work.
Three load-bearing limitations that the article surfaces, none of them solved as of early 2026:
The output exists only after the last denoising step. Chat UI expectations break. Agent loops that condition on partial output (early-exit, interrupt) cannot be retrofitted — the architecture doesn’t produce a partial output.
“New York” vs “New City”. Long-range factual chains (“Alice gave Bob the key, then Bob unlocked the door”) where committing the wrong noun early creates inconsistencies the rest of the sequence can’t fix in the remaining denoising budget.
Tool calls are structured chunks with strict syntax. Are they emitted in one denoising pass? Across many? What happens if the JSON is half-committed and then the rest can’t close it? No production diffusion LM has shipped tool use yet.
Chain-of-thought, RLHF’d reasoning, the o1/R1 paradigm — all of these rely on the model emitting long sequential reasoning chains where each step depends on the previous one. That is the worst case for diffusion: maximum conditional dependency, maximum sequence length where the parallel-decoding advantage matters. None of the published diffusion LMs have a credible reasoning-mode story.
Diffusion text generation is plausible for small on-device models that produce short, fluent, single-turn answers. It is unproven for reasoning, tool-use, and agentic loops. Raschka’s framing is precisely this: it may replace distilled autoregressive models, not the frontier. That is a genuinely useful niche — not a paradigm shift.
Watch the order tokens get committed in. Predict, before pressing Run, where the diffusion model will stumble — positions that depend on later tokens. The widget seeds a fixed sentence so you can see the same trajectory repeatedly.
The weights are open. Run a prompt that requires “New York” in a non-obvious place. Compare to a same-size autoregressive model. The error mode you hit (or don’t) is the one ParallelBench characterises.
LLaDA exposes the denoising step count as a knob. Sweep it from 4 to 256. Plot accuracy and latency. The trade-off curve is non-monotonic in interesting ways — too few steps causes commitment errors; too many wastes compute and approximates autoregressive cost.
Try to wire LLaDA into an agent loop with a single bounded tool. Document where it breaks. The failure modes are not yet well-understood; even a small empirical study fills a real gap.
Start with the LLaDA paper for the cleanest exposition. Then ParallelBench for the limitations. The Gemini Diffusion blog post is short; the technical detail in it is what they chose not to disclose.
The next public diffusion-LM release that ships tool-calling, or a reasoning-mode variant, or both. Whichever lab does it first will have made a non-trivial architectural breakthrough. None has, as of this writing.
Next time a vendor pitches you on “parallel decoding for free”, ask them: does your model stream? Does it support tool calls? Does it run in your reasoning mode? If any of those is “no”, you are looking at an on-device-class deployment, not a frontier replacement — whether or not the marketing says otherwise.