Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing
MoE scaling, Token Editing (T2T+M2T), S-Mode/Q-Mode, RL Framework -- how LLaDA 2.X makes diffusion LLMs practical.

Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing
In Part 3, LLaDA proved that "Diffusion LLMs are viable" by scaling Masked Diffusion to the 8B parameter range. But practical challenges remained: inference speed was far behind AR models, and alignment training like RLHF was absent.
In November 2025, Ant Group's InclusionAI began closing this gap with LLaDA 2.0. Then in February 2026, LLaDA 2.1 redefined the speed-quality tradeoff with an innovation called Token Editing.
This post covers the scaling journey from 8B to 100B, the adoption of MoE architecture, and how Token Editing works under the hood.
LLaDA 2.0: The Leap to 100B
LLaDA 2.0 shipped two models:
| Model | Total Params | Active Params | Layers | Heads | Context | Vocab |
|---|---|---|---|---|---|---|
| LLaDA 2.0-mini | 16B | 1.4B | 20 | 16 | 32,768 | 157,184 |
| LLaDA 2.0-flash | 100B | 6.1B | 32 | 32 | 32,768 | 157,184 |
The key change: introducing MoE (Mixture of Experts).
The original LLaDA 8B was a dense model -- every parameter activates for every input. LLaDA 2.0 adopts MoE, dramatically increasing total parameters while only activating a small subset of experts during inference.
LLaDA 2.0-flash activates just 6.1B of its 100B parameters. This is the same strategy used by AR MoE models like Mixtral and DeepSeek: "Keep the model's total knowledge broad, but keep inference costs low."
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.