Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion

In Part 2, we explored how D3PM and MDLM define Diffusion in discrete spaces. We also confirmed that Absorbing State Diffusion using [MASK] tokens is the most effective approach for text.

However, prior work remained at relatively small scales. The question "Can we actually build a real LLM with Diffusion?" was answered by LLaDA (Large Language Diffusion with mAsking).

Nie et al. (2025) scaled Masked Diffusion to 8B parameters, directly compared it against LLaMA3 8B, and demonstrated that Diffusion LLMs can possess the core capabilities of AR models -- In-Context Learning and Instruction Following.

Core Idea: Variable Masking Ratio

The most important design decision in LLaDA is the variable masking ratio.

BERT masks a fixed 15% of the input during training. Once set, this ratio never changes.

LLaDA randomly samples the masking ratio from anywhere between 0% and 100% during training. In some batches, only 5% is masked; in others, 95% is masked.

Here is why this is critically important:

In-Context Learning: When the masking ratio is very low (e.g., 5%), the model predicts the remaining tokens while most tokens are already visible. This is essentially a "read the given context and fill in the blanks" task, which naturally connects to In-Context Learning.

Parameter	LLaDA 8B	LLaMA3 8B
Hidden Size	4096	4096
Layers	32	32
Attention Heads	32	32 (GQA: 8 KV)
FFN Size	12288	14336
Vocab Size	126,464	128,256
Max Sequence Length	4096	8192
Position Embedding	RoPE	RoPE
Activation	SiLU	SiLU
Attention	Multi-Head (full)	Grouped Query

Parameter

LLaDA 8B

LLaMA3 8B

Hidden Size

4096

Layers

Attention Heads

32 (GQA: 8 KV)

FFN Size

12288

14336

Vocab Size

126,464

128,256

Max Sequence Length

4096

8192

Position Embedding

RoPE

Activation

SiLU

Attention

Multi-Head (full)

Grouped Query

Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion

Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion

Core Idea: Variable Masking Ratio

Architecture

Sign in to continue reading

Related Posts

Self-Evolving AI Agents — The New Paradigm of 2026

Build Your Own LLM Knowledge Base — A Karpathy-Style Knowledge System

Why Karpathy's CLAUDE.md Got 48K Stars — And How to Write Your Own