Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion
Variable Masking, Fisher Consistency, In-Context Learning, Reversal Curse -- how LLaDA built a real LLM with diffusion.

Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion
In Part 2, we explored how D3PM and MDLM define Diffusion in discrete spaces. We also confirmed that Absorbing State Diffusion using [MASK] tokens is the most effective approach for text.
However, prior work remained at relatively small scales. The question "Can we actually build a real LLM with Diffusion?" was answered by LLaDA (Large Language Diffusion with mAsking).
Nie et al. (2025) scaled Masked Diffusion to 8B parameters, directly compared it against LLaMA3 8B, and demonstrated that Diffusion LLMs can possess the core capabilities of AR models -- In-Context Learning and Instruction Following.
Core Idea: Variable Masking Ratio
The most important design decision in LLaDA is the variable masking ratio.
BERT masks a fixed 15% of the input during training. Once set, this ratio never changes.
LLaDA randomly samples the masking ratio from anywhere between 0% and 100% during training. In some batches, only 5% is masked; in others, 95% is masked.
Here is why this is critically important:
In-Context Learning: When the masking ratio is very low (e.g., 5%), the model predicts the remaining tokens while most tokens are already visible. This is essentially a "read the given context and fill in the blanks" task, which naturally connects to In-Context Learning.
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.