Are LLMs Really Smart? Dissecting AI's Reasoning Failures
Stanford researchers analyzed 500+ papers to systematically map LLM reasoning failures. From cognitive biases to the reversal curse, discover where and why AI reasoning breaks down.

Are LLMs Really Smart? A Complete Guide to AI Reasoning Failures
Large Language Models like ChatGPT and Claude write complex code, compose poetry, and hold philosophical conversations. Yet they occasionally produce baffling answers to remarkably simple questions.
"Why does such a smart AI make such basic mistakes?"
A survey paper from Stanford -- "Large Language Model Reasoning Failures" by Song, Han, and Goodman (TMLR 2026) -- is the first comprehensive taxonomy of where and why LLMs break. Drawing from over 500 research papers, it maps out dozens of failure categories across reasoning types and failure modes.
This post walks through the paper's framework and key findings. Inspired by their taxonomy, we also designed 10 hands-on experiments and ran them across 7 current models. Detailed results are in Parts 1-3; this post is the overview.
Related Posts

From Evaluation to Deployment — The Complete Fine-tuning Guide
Evaluate with Perplexity, KoBEST, ROUGE-L. Merge adapters with merge_and_unload(), convert to GGUF, deploy via vLLM/Ollama. Overfitting prevention, data quality, hyperparameter guide.

QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU
Fine-tune Qwen 2.5 7B on a T4 16GB using QLoRA (4-bit NormalFloat + LoRA). Korean dataset preparation guide, NF4/Double Quantization/Paged Optimizer explained, Wandb monitoring.

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook
From LoRA theory to hands-on Qwen 2.5 7B fine-tuning. Train only 0.18% of parameters while achieving 98% of full fine-tuning performance. VRAM reduced from 130GB to 18GB.