Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion
Variable Masking, Fisher Consistency, In-Context Learning, Reversal Curse -- how LLaDA built a real LLM with diffusion.

Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion
In Part 2, we explored how D3PM and MDLM define Diffusion in discrete spaces. We also confirmed that Absorbing State Diffusion using [MASK] tokens is the most effective approach for text.
However, prior work remained at relatively small scales. The question "Can we actually build a real LLM with Diffusion?" was answered by LLaDA (Large Language Diffusion with mAsking).
Nie et al. (2025) scaled Masked Diffusion to 8B parameters, directly compared it against LLaMA3 8B, and demonstrated that Diffusion LLMs can possess the core capabilities of AR models -- In-Context Learning and Instruction Following.
Core Idea: Variable Masking Ratio
Related Posts

From Evaluation to Deployment — The Complete Fine-tuning Guide
Evaluate with Perplexity, KoBEST, ROUGE-L. Merge adapters with merge_and_unload(), convert to GGUF, deploy via vLLM/Ollama. Overfitting prevention, data quality, hyperparameter guide.

QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU
Fine-tune Qwen 2.5 7B on a T4 16GB using QLoRA (4-bit NormalFloat + LoRA). Korean dataset preparation guide, NF4/Double Quantization/Paged Optimizer explained, Wandb monitoring.

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook
From LoRA theory to hands-on Qwen 2.5 7B fine-tuning. Train only 0.18% of parameters while achieving 98% of full fine-tuning performance. VRAM reduced from 130GB to 18GB.