Can Diffusion Replace Autoregressive LLMs? The Complete LLaDA 2.X Guide

Can Diffusion Replace the LLM? A Complete Anatomy of LLaDA 2.X
ChatGPT, Claude, Gemini — every large language model (LLM) we use today is built on a single principle. Autoregressive (AR) generation: left to right, one token at a time, predicting the next word.
This approach works remarkably well. But it has structural limitations.
- Tokens must be produced one at a time in sequence, making parallel generation impossible
- Even if the model knows "A is B," it cannot infer "B is A" — the Reversal Curse
- Because it only looks left to right, it cannot leverage right-side context
But what if we built an LLM using Diffusion?
Just as Stable Diffusion and DALL-E demonstrated in image generation — starting from noise and progressively refining toward a clean result — what if we could apply the same Diffusion approach to text?
In February 2025, a research team from HKU/PKU published LLaDA (Large Language Diffusion with mAsking), turning this possibility into reality. Then in late 2025, Ant Group's InclusionAI scaled up to 100B parameters with LLaDA 2.0, and in February 2026, LLaDA 2.1 solved the speed problem with an innovation called Token Editing.
This series fully dissects everything from the fundamentals of Diffusion to the latest techniques in LLaDA 2.1, across 4 parts.
Why Diffusion LLMs?
The core premise of Autoregressive models is simple: text is generated left to right.
P(x) = P(x_1) * P(x_2|x_1) * P(x_3|x_1,x_2) * ...
This assumption keeps training straightforward and scaling clean. But the structural weaknesses are equally clear.
Speed bottleneck: To generate 1000 tokens, the model must be called 1000 times sequentially. Each step re-processes the entire context. KV-cache mitigates this, but fundamentally, O(n) sequential calls are unavoidable.
Unidirectional dependency: The model only sees left context. Even if it learns "Tom Cruise's mother is Mary Lee Pfeiffer," it struggles to answer "Who is Mary Lee Pfeiffer's son?" This is the Reversal Curse.
No revision: AR models cannot go back and fix tokens they have already generated. Even when they spot a mistake, they can only move forward.
Diffusion models approach all three of these differently.
- All tokens are generated simultaneously and progressively refined (parallel generation)
- Context is leveraged bidirectionally (mitigating the Reversal Curse)
- Results are revised across multiple steps (iterative refinement)
LLaDA Series Timeline
Series Overview
Part 1: Diffusion Fundamentals — From DDPM to Score Matching
We cover the core principles of Diffusion, proven in image generation. The forward process (adding noise), the reverse process (removing noise), the ELBO training objective, and the connection to score matching. After reading this part, you will have a mathematical understanding of why Diffusion works.
Key terms: DDPM, Forward/Reverse Process, ELBO, Score Function, SDE
Part 2: Discrete Diffusion — How Do You Add Noise to Text?
We explore how to apply continuous-space Diffusion to discrete tokens. D3PM's Transition Matrix, the point where Absorbing State meets BERT's [MASK], and MDLM's simplifications. This part explains how the gap between images and text is bridged.
Key terms: D3PM, Transition Matrix, Absorbing State, MDLM, Masked Diffusion
Part 3: LLaDA — Building an 8B LLM with Masked Diffusion
We examine how LLaDA scaled Masked Diffusion to an 8B-parameter LLM. The meaning of variable masking ratio, why In-Context Learning is possible, and the structural advantages that avoid the Reversal Curse. We analyze head-to-head comparison results against LLaMA3 8B.
Key terms: Variable Masking, ELBO Training, Scaling Law, Reversal Curse, In-Context Learning
Part 4: LLaDA 2.0 -> 2.1 — Breaking 100B with MoE + Token Editing
We cover LLaDA 2.0's MoE scaling and LLaDA 2.1's Token Editing innovation. The T2T (Token-to-Token) + M2T (Mask-to-Token) hybrid, the speed-quality tradeoff of S-Mode/Q-Mode, and the first large-scale RL Framework for Diffusion LLMs.
Key terms: MoE, CAP Decoding, Token Editing, T2T+M2T, S-Mode/Q-Mode, RL for dLLMs
Benchmark Scorecard
LLaDA 2.0-flash (100B, 6.1B active) vs major Autoregressive models:
The notable takeaway: a 100B Diffusion model has reached parity with similarly-sized AR models. The conventional wisdom that "Diffusion LLMs cannot match AR" is being overturned.
LLaDA 2.1-flash speed:
AR vs Diffusion: Key Differences at a Glance
What This Series Does Not Cover
- Detailed architectures of image Diffusion models such as Stable Diffusion and DALL-E
- Latest optimization techniques for Autoregressive models (e.g., Speculative Decoding)
- Diffusion LLMs other than LLaDA (PLAID, Diffusion-LM, etc.) are mentioned only as background
References
- Ho, Jain, Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS 2020.
- Song et al. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021.
- Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces." NeurIPS 2021.
- Sahoo et al. "Simple and Effective Masked Diffusion Language Models." NeurIPS 2024.
- Nie et al. "Large Language Diffusion Models." arXiv:2502.09992, 2025.
- InclusionAI. "LLaDA 2.0 Technical Report." 2025.
- InclusionAI. "LLaDA 2.1: Speeding Up Text Diffusion via Token Editing." arXiv:2602.08676, 2026.