Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing

Diffusion LLM Part 4: LLaDA 2.0 -> 2.1 -- Breaking 100B with MoE + Token Editing
In Part 3, LLaDA proved that "Diffusion LLMs are viable" by scaling Masked Diffusion to the 8B parameter range. But practical challenges remained: inference speed was far behind AR models, and alignment training like RLHF was absent.
In November 2025, Ant Group's InclusionAI began closing this gap with LLaDA 2.0. Then in February 2026, LLaDA 2.1 redefined the speed-quality tradeoff with an innovation called Token Editing.
This post covers the scaling journey from 8B to 100B, the adoption of MoE architecture, and how Token Editing works under the hood.
LLaDA 2.0: The Leap to 100B
LLaDA 2.0 shipped two models:
The key change: introducing MoE (Mixture of Experts).
The original LLaDA 8B was a dense model -- every parameter activates for every input. LLaDA 2.0 adopts MoE, dramatically increasing total parameters while only activating a small subset of experts during inference.
LLaDA 2.0-flash activates just 6.1B of its 100B parameters. This is the same strategy used by AR MoE models like Mixtral and DeepSeek: "Keep the model's total knowledge broad, but keep inference costs low."
MoE + Diffusion: Why It Works So Well
There is a reason MoE is a particularly good fit for Diffusion models.
MoE in AR models: A router selects appropriate experts for each token. Since tokens are generated sequentially, the router picks experts for one token at each step.
MoE in Diffusion models: All tokens in the entire sequence are processed simultaneously. In a single denoising step, thousands of tokens are distributed across multiple experts at once, so expert utilization is naturally high.
AR models need large batch sizes to improve expert utilization, but Diffusion models are efficient even on a single input -- the diverse tokens within a sequence naturally activate different experts.
Training Scale
LLaDA 2.0's training scale:
- Training data: ~20 trillion tokens
- Training framework: dFactory (built on FSDP2)
- Continual training: Built on top of the Ling2.0 series
Compared to the original LLaDA's 2.3T tokens, this represents roughly 8.7x more data.
LLaDA 2.0 Benchmarks
LLaDA 2.0-flash (100B) key results:
For comparison: Qwen3-30B-A3B-Instruct (an AR MoE model) has an overall average of 79.47. Nearly identical.
A 100B Diffusion model matching an AR MoE model of similar size is strong evidence for the practicality of Diffusion LLMs.
LLaDA 2.0-mini (16B) key results:
LLaDA 2.1: The Token Editing Breakthrough
LLaDA 2.0 matched AR in quality, but the speed problem remained. The conventional Diffusion approach:
- Start from a sequence where every position is [MASK]
- Gradually restore tokens over multiple denoising steps
- Reprocess the entire sequence at each step
LLaDA 2.1 fundamentally improves this process with Token Editing.
Core idea: Combine Token-to-Token (T2T) editing with the existing Mask-to-Token (M2T) approach.
M2T (existing): [MASK] -> actual token. Start from blanks and fill in tokens.
T2T (new): Already-generated token -> better token. Detect and fix tokens from previous steps that may be wrong.
This is the decisive difference from AR models. AR models cannot take back a token once generated. If "The capital of France is Berlin" has a wrong "Berlin," there is no way to fix the already-emitted token. An early mistake propagates through all subsequent generation -- the error propagation problem.
LLaDA 2.1's Token Editing solves this fundamentally. Even for tokens that have already been generated, if the model judges in the next denoising step that "this token does not fit the context," it replaces it with a different token. Generation is editing, and editing is generation. To use a human writing analogy: AR is writing left-to-right with a pen that can never be erased, while LLaDA 2.1 is writing with a pencil and eraser simultaneously.
S-Mode and Q-Mode
Two modes for controlling the intensity of Token Editing:
S-Mode (Speed Mode):
- threshold: 0.5 (aggressive M2T progression)
- editing_threshold: 0.0 (T2T editing compensates for quality)
- Speed-first. Generates quickly with fewer denoising steps
Q-Mode (Quality Mode):
- threshold: 0.7 (conservative M2T progression)
- editing_threshold: 0.5 (only keeps tokens with high confidence)
- Quality-first. Goes through more refinement steps
S-Mode more than doubles TPF compared to LLaDA 2.0 with only a slight quality drop. Q-Mode maintains quality while being over 20% faster.
Inference Speed: Can Diffusion Beat AR?
LLaDA 2.1-flash (100B) coding benchmark speeds:
892 TPS (Tokens Per Second) is a remarkable number. Hitting this speed on a 100B model demonstrates that the Token Editing + MoE synergy is working.
For context: AR MoE models of similar size struggle to reach this throughput without optimizations like speculative decoding. The combination of Diffusion's parallel generation and Token Editing's selective refinement is creating scenarios where Diffusion can outpace AR's sequential generation in raw speed.
An RL Framework for Diffusion LLMs
Another major contribution of LLaDA 2.1 is the first large-scale RL (Reinforcement Learning) framework designed for Diffusion LLMs.
RLHF in AR models:
- The model generates a response
- A reward model evaluates quality
- The model is updated via policy gradient
This is more complex in Diffusion models:
- Generation happens through iterative denoising, not sequentially
- How do you estimate gradients for intermediate steps?
- The discrete nature of masking/unmasking makes gradient computation difficult
LLaDA 2.1 developed dedicated techniques to address these challenges:
- Specialized methods for stable gradient estimation
- Extended context window
- Improved reasoning precision
- Better instruction-following fidelity
Thanks to this RL framework, LLaDA 2.1 achieves more sophisticated reasoning and instruction-following than its predecessors.
LLaDA 2.1 Benchmark Deep Dive
LLaDA 2.1-mini (S-Mode vs Q-Mode vs 2.0):
Notable results:
GPQA: Q-Mode improves +5.52 over 2.0. This benchmark most clearly shows the effect of the RL framework.
ZebraLogic: Q-Mode improves +12.9 over 2.0 (64.20 -> 77.10). The RL impact on logical reasoning is dramatic.
AIME 2025: Q-Mode goes from 36.67 to 43.33, showing improvement even on competition-level math problems.
Running the Code
Python code for using LLaDA 2.1:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "inclusionAI/LLaDA2.1-mini"
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map="auto"
)
model = model.to(torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "Explain the concept of diffusion in language models."
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
)
# S-Mode (Speed)
generated = model.generate(
inputs=input_ids,
gen_length=512,
block_length=32,
threshold=0.5,
editing_threshold=0.0,
temperature=0.0,
)
print(tokenizer.decode(generated[0]))
# Q-Mode (Quality)
generated = model.generate(
inputs=input_ids,
gen_length=512,
block_length=32,
threshold=0.7,
editing_threshold=0.5,
temperature=0.0,
)
print(tokenizer.decode(generated[0]))Deploying a server with SGLang:
python3 -m sglang.launch_server \
--model-path inclusionAI/LLaDA2.1-flash \
--dllm-algorithm JointThreshold \
--tp-size 4 \
--trust-remote-code \
--mem-fraction-static 0.8Full Timeline
The Future of Diffusion LLMs
What the LLaDA series has demonstrated is that Diffusion LLMs are not just an academic curiosity -- they can be a practical alternative.
Current advantages of Diffusion LLMs:
- Bidirectional context (mitigates the Reversal Curse)
- Parallel generation (speed advantage when combined with Token Editing)
- Iterative refinement (tokens can be corrected even after generation)
- Natural synergy with MoE
Remaining challenges:
- Smaller ecosystem compared to AR (fine-tuning tools, quantization techniques, etc.)
- Multimodal extension (LLaDA-V has started but is still early-stage)
- Longer context windows (currently 32K)
- More real-world deployment case studies
But the direction is clear. In just one year, we went from an 8B dense model to 100B MoE + Token Editing + RL. Diffusion LLMs are putting the first cracks in AR's monopoly.
Key Takeaways
References
- InclusionAI. "LLaDA 2.0 Technical Report." arXiv:2512.15745, 2025.
- InclusionAI. "LLaDA 2.1: Speeding Up Text Diffusion via Token Editing." arXiv:2602.08676, 2026.
- Nie et al. "Large Language Diffusion Models." arXiv:2502.09992, 2025.
- Fedus et al. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022.