Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing

In Part 3, LLaDA proved that "Diffusion LLMs are viable" by scaling Masked Diffusion to the 8B parameter range. But practical challenges remained: inference speed was far behind AR models, and alignment training like RLHF was absent.

In November 2025, Ant Group's InclusionAI began closing this gap with LLaDA 2.0. Then in February 2026, LLaDA 2.1 redefined the speed-quality tradeoff with an innovation called Token Editing.

This post covers the scaling journey from 8B to 100B, the adoption of MoE architecture, and how Token Editing works under the hood.

LLaDA 2.0: The Leap to 100B

LLaDA 2.0 shipped two models:

Model	Total Params	Active Params	Layers	Heads	Context	Vocab
LLaDA 2.0-mini	16B	1.4B	20	16	32,768	157,184
LLaDA 2.0-flash	100B	6.1B	32	32	32,768	157,184

The key change: introducing MoE (Mixture of Experts).

The original LLaDA 8B was a dense model -- every parameter activates for every input. LLaDA 2.0 adopts MoE, dramatically increasing total parameters while only activating a small subset of experts during inference.

LLaDA 2.0-flash activates just 6.1B of its 100B parameters. This is the same strategy used by AR MoE models like Mixtral and DeepSeek: "Keep the model's total knowledge broad, but keep inference costs low."

MoE + Diffusion: Why It Works So Well

There is a reason MoE is a particularly good fit for Diffusion models.

MoE in AR models: A router selects appropriate experts for each token. Since tokens are generated sequentially, the router picks experts for one token at each step.

MoE in Diffusion models: All tokens in the entire sequence are processed simultaneously. In a single denoising step, thousands of tokens are distributed across multiple experts at once, so expert utilization is naturally high.

AR models need large batch sizes to improve expert utilization, but Diffusion models are efficient even on a single input -- the diverse tokens within a sequence naturally activate different experts.

Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing