Can Diffusion Replace Autoregressive LLMs? The Complete LLaDA 2.X Guide
From DDPM to LLaDA 2.1 -- everything about diffusion-based LLMs. Masked Diffusion, Token Editing, and MoE scaling dissected across 4 parts.

Can Diffusion Replace the LLM? A Complete Anatomy of LLaDA 2.X
ChatGPT, Claude, Gemini — every large language model (LLM) we use today is built on a single principle. Autoregressive (AR) generation: left to right, one token at a time, predicting the next word.
This approach works remarkably well. But it has structural limitations.
- Tokens must be produced one at a time in sequence, making parallel generation impossible
- Even if the model knows "A is B," it cannot infer "B is A" — the Reversal Curse
Related Posts

From Evaluation to Deployment — The Complete Fine-tuning Guide
Evaluate with Perplexity, KoBEST, ROUGE-L. Merge adapters with merge_and_unload(), convert to GGUF, deploy via vLLM/Ollama. Overfitting prevention, data quality, hyperparameter guide.

QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU
Fine-tune Qwen 2.5 7B on a T4 16GB using QLoRA (4-bit NormalFloat + LoRA). Korean dataset preparation guide, NF4/Double Quantization/Paged Optimizer explained, Wandb monitoring.

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook
From LoRA theory to hands-on Qwen 2.5 7B fine-tuning. Train only 0.18% of parameters while achieving 98% of full fine-tuning performance. VRAM reduced from 130GB to 18GB.