Can Diffusion Replace Autoregressive LLMs? The Complete LLaDA 2.X Guide

Can Diffusion Replace the LLM? A Complete Anatomy of LLaDA 2.X

ChatGPT, Claude, Gemini — every large language model (LLM) we use today is built on a single principle. Autoregressive (AR) generation: left to right, one token at a time, predicting the next word.

This approach works remarkably well. But it has structural limitations.

Tokens must be produced one at a time in sequence, making parallel generation impossible
Even if the model knows "A is B," it cannot infer "B is A" — the Reversal Curse
Because it only looks left to right, it cannot leverage right-side context

But what if we built an LLM using Diffusion?

Just as Stable Diffusion and DALL-E demonstrated in image generation — starting from noise and progressively refining toward a clean result — what if we could apply the same Diffusion approach to text?

In February 2025, a research team from HKU/PKU published LLaDA (Large Language Diffusion with mAsking), turning this possibility into reality. Then in late 2025, Ant Group's InclusionAI scaled up to 100B parameters with LLaDA 2.0, and in February 2026, LLaDA 2.1 solved the speed problem with an innovation called Token Editing.

This series fully dissects everything from the fundamentals of Diffusion to the latest techniques in LLaDA 2.1, across 4 parts.

Why Diffusion LLMs?

The core premise of Autoregressive models is simple: text is generated left to right.

P(x) = P(x_1) * P(x_2|x_1) * P(x_3|x_1,x_2) * ...

This assumption keeps training straightforward and scaling clean. But the structural weaknesses are equally clear.

Speed bottleneck: To generate 1000 tokens, the model must be called 1000 times sequentially. Each step re-processes the entire context. KV-cache mitigates this, but fundamentally, O(n) sequential calls are unavoidable.

Unidirectional dependency: The model only sees left context. Even if it learns "Tom Cruise's mother is Mary Lee Pfeiffer," it struggles to answer "Who is Mary Lee Pfeiffer's son?" This is the Reversal Curse.

No revision: AR models cannot go back and fix tokens they have already generated. Even when they spot a mistake, they can only move forward.

Diffusion models approach all three of these differently.

All tokens are generated simultaneously and progressively refined (parallel generation)
Context is leveraged bidirectionally (mitigating the Reversal Curse)
Results are revised across multiple steps (iterative refinement)

LLaDA Series Timeline

Date	Model	Key Contribution
2025.02	LLaDA 8B	First large-scale Diffusion LLM. Masked Diffusion + Transformer
2025.11	LLaDA 2.0-mini (16B)	Introduces MoE architecture. 1.4B active parameters
2025.11	LLaDA 2.0-flash (100B)	First 100B Diffusion LLM. 6.1B active parameters
2026.02	LLaDA 2.1-mini (16B)	Token Editing (T2T + M2T). S-Mode / Q-Mode
2026.02	LLaDA 2.1-flash (100B)	RL Framework + Token Editing. 892 TPS achieved

Series Overview

Part 1: Diffusion Fundamentals — From DDPM to Score Matching

We cover the core principles of Diffusion, proven in image generation. The forward process (adding noise), the reverse process (removing noise), the ELBO training objective, and the connection to score matching. After reading this part, you will have a mathematical understanding of why Diffusion works.

Key terms: DDPM, Forward/Reverse Process, ELBO, Score Function, SDE

Part 2: Discrete Diffusion — How Do You Add Noise to Text?

We explore how to apply continuous-space Diffusion to discrete tokens. D3PM's Transition Matrix, the point where Absorbing State meets BERT's [MASK], and MDLM's simplifications. This part explains how the gap between images and text is bridged.

Key terms: D3PM, Transition Matrix, Absorbing State, MDLM, Masked Diffusion

Part 3: LLaDA — Building an 8B LLM with Masked Diffusion

We examine how LLaDA scaled Masked Diffusion to an 8B-parameter LLM. The meaning of variable masking ratio, why In-Context Learning is possible, and the structural advantages that avoid the Reversal Curse. We analyze head-to-head comparison results against LLaMA3 8B.

Key terms: Variable Masking, ELBO Training, Scaling Law, Reversal Curse, In-Context Learning

Part 4: LLaDA 2.0 -> 2.1 — Breaking 100B with MoE + Token Editing

We cover LLaDA 2.0's MoE scaling and LLaDA 2.1's Token Editing innovation. The T2T (Token-to-Token) + M2T (Mask-to-Token) hybrid, the speed-quality tradeoff of S-Mode/Q-Mode, and the first large-scale RL Framework for Diffusion LLMs.

Key terms: MoE, CAP Decoding, Token Editing, T2T+M2T, S-Mode/Q-Mode, RL for dLLMs

Benchmark Scorecard

LLaDA 2.0-flash (100B, 6.1B active) vs major Autoregressive models:

Benchmark	LLaDA 2.0-flash	Qwen3-30B-A3B
MMLU	87.69	-
MMLU-Pro	73.36	-
GPQA	61.98	-
HumanEval	94.51	-
GSM8K	96.06	-
MATH	95.44	-
AIME 2025	60.00	-
IFEval-strict	81.70	-
Overall Average	79.32	79.47

The notable takeaway: a 100B Diffusion model has reached parity with similarly-sized AR models. The conventional wisdom that "Diffusion LLMs cannot match AR" is being overturned.

LLaDA 2.1-flash speed:

Task	Throughput (TPS)
HumanEval+	892
BigCodeBench	801
LiveCodeBench	663

AR vs Diffusion: Key Differences at a Glance

Property	Autoregressive	Diffusion (LLaDA)
Generation direction	Left -> right sequential	Simultaneous generation then refinement
Context utilization	Unidirectional (left only)	Bidirectional
Reversal Curse	Vulnerable	Structurally avoided
Parallel generation	Impossible (1 token/step)	Possible (all tokens at once)
KV-Cache	Required (speed optimization)	Not needed (different optimization path)
Training objective	Next-token prediction	ELBO (likelihood lower bound)
Output revision	Impossible (once generated, final)	Possible (iterative denoising)

What This Series Does Not Cover

Detailed architectures of image Diffusion models such as Stable Diffusion and DALL-E
Latest optimization techniques for Autoregressive models (e.g., Speculative Decoding)
Diffusion LLMs other than LLaDA (PLAID, Diffusion-LM, etc.) are mentioned only as background

References

Ho, Jain, Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS 2020.
Song et al. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021.
Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces." NeurIPS 2021.
Sahoo et al. "Simple and Effective Masked Diffusion Language Models." NeurIPS 2024.
Nie et al. "Large Language Diffusion Models." arXiv:2502.09992, 2025.
InclusionAI. "LLaDA 2.0 Technical Report." 2025.
InclusionAI. "LLaDA 2.1: Speeding Up Text Diffusion via Token Editing." arXiv:2602.08676, 2026.

Can Diffusion Replace Autoregressive LLMs? The Complete LLaDA 2.X Guide

Can Diffusion Replace the LLM? A Complete Anatomy of LLaDA 2.X

Why Diffusion LLMs?

LLaDA Series Timeline

Series Overview

Part 1: Diffusion Fundamentals — From DDPM to Score Matching

Part 2: Discrete Diffusion — How Do You Add Noise to Text?

Part 3: LLaDA — Building an 8B LLM with Masked Diffusion

Part 4: LLaDA 2.0 -> 2.1 — Breaking 100B with MoE + Token Editing

Benchmark Scorecard

AR vs Diffusion: Key Differences at a Glance

What This Series Does Not Cover

References

Stay Updated

Subscribe to Newsletter

Related Posts

Self-Evolving AI Agents — The New Paradigm of 2026

Build Your Own LLM Knowledge Base — A Karpathy-Style Knowledge System

Why Karpathy's CLAUDE.md Got 48K Stars — And How to Write Your Own