AI ResearchKR

Can Diffusion Replace Autoregressive LLMs? The Complete LLaDA 2.X Guide

From DDPM to LLaDA 2.1 -- everything about diffusion-based LLMs. Masked Diffusion, Token Editing, and MoE scaling dissected across 4 parts.

Can Diffusion Replace Autoregressive LLMs? The Complete LLaDA 2.X Guide

Can Diffusion Replace the LLM? A Complete Anatomy of LLaDA 2.X

ChatGPT, Claude, Gemini — every large language model (LLM) we use today is built on a single principle. Autoregressive (AR) generation: left to right, one token at a time, predicting the next word.

This approach works remarkably well. But it has structural limitations.

  • Tokens must be produced one at a time in sequence, making parallel generation impossible
  • Even if the model knows "A is B," it cannot infer "B is A" — the Reversal Curse
  • Because it only looks left to right, it cannot leverage right-side context

But what if we built an LLM using Diffusion?

Just as Stable Diffusion and DALL-E demonstrated in image generation — starting from noise and progressively refining toward a clean result — what if we could apply the same Diffusion approach to text?

In February 2025, a research team from HKU/PKU published LLaDA (Large Language Diffusion with mAsking), turning this possibility into reality. Then in late 2025, Ant Group's InclusionAI scaled up to 100B parameters with LLaDA 2.0, and in February 2026, LLaDA 2.1 solved the speed problem with an innovation called Token Editing.

This series fully dissects everything from the fundamentals of Diffusion to the latest techniques in LLaDA 2.1, across 4 parts.

Why Diffusion LLMs?

The core premise of Autoregressive models is simple: text is generated left to right.

P(x) = P(x_1) * P(x_2|x_1) * P(x_3|x_1,x_2) * ...

This assumption keeps training straightforward and scaling clean. But the structural weaknesses are equally clear.

Speed bottleneck: To generate 1000 tokens, the model must be called 1000 times sequentially. Each step re-processes the entire context. KV-cache mitigates this, but fundamentally, O(n) sequential calls are unavoidable.

Unidirectional dependency: The model only sees left context. Even if it learns "Tom Cruise's mother is Mary Lee Pfeiffer," it struggles to answer "Who is Mary Lee Pfeiffer's son?" This is the Reversal Curse.

No revision: AR models cannot go back and fix tokens they have already generated. Even when they spot a mistake, they can only move forward.

Diffusion models approach all three of these differently.

  • All tokens are generated simultaneously and progressively refined (parallel generation)
  • Context is leveraged bidirectionally (mitigating the Reversal Curse)
  • Results are revised across multiple steps (iterative refinement)

LLaDA Series Timeline

DateModelKey Contribution
2025.02LLaDA 8BFirst large-scale Diffusion LLM. Masked Diffusion + Transformer
2025.11LLaDA 2.0-mini (16B)Introduces MoE architecture. 1.4B active parameters
2025.11LLaDA 2.0-flash (100B)First 100B Diffusion LLM. 6.1B active parameters
2026.02LLaDA 2.1-mini (16B)Token Editing (T2T + M2T). S-Mode / Q-Mode
2026.02LLaDA 2.1-flash (100B)RL Framework + Token Editing. 892 TPS achieved

Series Overview

Part 1: Diffusion Fundamentals — From DDPM to Score Matching

We cover the core principles of Diffusion, proven in image generation. The forward process (adding noise), the reverse process (removing noise), the ELBO training objective, and the connection to score matching. After reading this part, you will have a mathematical understanding of why Diffusion works.

Key terms: DDPM, Forward/Reverse Process, ELBO, Score Function, SDE

Part 2: Discrete Diffusion — How Do You Add Noise to Text?

We explore how to apply continuous-space Diffusion to discrete tokens. D3PM's Transition Matrix, the point where Absorbing State meets BERT's [MASK], and MDLM's simplifications. This part explains how the gap between images and text is bridged.

Key terms: D3PM, Transition Matrix, Absorbing State, MDLM, Masked Diffusion

Part 3: LLaDA — Building an 8B LLM with Masked Diffusion

We examine how LLaDA scaled Masked Diffusion to an 8B-parameter LLM. The meaning of variable masking ratio, why In-Context Learning is possible, and the structural advantages that avoid the Reversal Curse. We analyze head-to-head comparison results against LLaMA3 8B.

Key terms: Variable Masking, ELBO Training, Scaling Law, Reversal Curse, In-Context Learning

Part 4: LLaDA 2.0 -> 2.1 — Breaking 100B with MoE + Token Editing

We cover LLaDA 2.0's MoE scaling and LLaDA 2.1's Token Editing innovation. The T2T (Token-to-Token) + M2T (Mask-to-Token) hybrid, the speed-quality tradeoff of S-Mode/Q-Mode, and the first large-scale RL Framework for Diffusion LLMs.

Key terms: MoE, CAP Decoding, Token Editing, T2T+M2T, S-Mode/Q-Mode, RL for dLLMs

Benchmark Scorecard

LLaDA 2.0-flash (100B, 6.1B active) vs major Autoregressive models:

BenchmarkLLaDA 2.0-flashQwen3-30B-A3B
MMLU87.69-
MMLU-Pro73.36-
GPQA61.98-
HumanEval94.51-
GSM8K96.06-
MATH95.44-
AIME 202560.00-
IFEval-strict81.70-
Overall Average79.3279.47

The notable takeaway: a 100B Diffusion model has reached parity with similarly-sized AR models. The conventional wisdom that "Diffusion LLMs cannot match AR" is being overturned.

LLaDA 2.1-flash speed:

TaskThroughput (TPS)
HumanEval+892
BigCodeBench801
LiveCodeBench663

AR vs Diffusion: Key Differences at a Glance

PropertyAutoregressiveDiffusion (LLaDA)
Generation directionLeft -> right sequentialSimultaneous generation then refinement
Context utilizationUnidirectional (left only)Bidirectional
Reversal CurseVulnerableStructurally avoided
Parallel generationImpossible (1 token/step)Possible (all tokens at once)
KV-CacheRequired (speed optimization)Not needed (different optimization path)
Training objectiveNext-token predictionELBO (likelihood lower bound)
Output revisionImpossible (once generated, final)Possible (iterative denoising)

What This Series Does Not Cover

  • Detailed architectures of image Diffusion models such as Stable Diffusion and DALL-E
  • Latest optimization techniques for Autoregressive models (e.g., Speculative Decoding)
  • Diffusion LLMs other than LLaDA (PLAID, Diffusion-LM, etc.) are mentioned only as background

References

  • Ho, Jain, Abbeel. "Denoising Diffusion Probabilistic Models." NeurIPS 2020.
  • Song et al. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021.
  • Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces." NeurIPS 2021.
  • Sahoo et al. "Simple and Effective Masked Diffusion Language Models." NeurIPS 2024.
  • Nie et al. "Large Language Diffusion Models." arXiv:2502.09992, 2025.
  • InclusionAI. "LLaDA 2.0 Technical Report." 2025.
  • InclusionAI. "LLaDA 2.1: Speeding Up Text Diffusion via Token Editing." arXiv:2602.08676, 2026.

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts