Are LLMs Really Smart? Dissecting AI's Reasoning Failures
Stanford researchers analyzed 500+ papers to systematically map LLM reasoning failures. From cognitive biases to the reversal curse, discover where and why AI reasoning breaks down.

Are LLMs Really Smart? A Complete Guide to AI Reasoning Failures
Large Language Models like ChatGPT and Claude write complex code, compose poetry, and hold philosophical conversations. Yet they occasionally produce baffling answers to remarkably simple questions.
"Why does such a smart AI make such basic mistakes?"
A survey paper from Stanford -- "Large Language Model Reasoning Failures" by Song, Han, and Goodman (TMLR 2026) -- is the first comprehensive taxonomy of where and why LLMs break. Drawing from over 500 research papers, it maps out dozens of failure categories across reasoning types and failure modes.
This post walks through the paper's framework and key findings. Inspired by their taxonomy, we also designed 10 hands-on experiments and ran them across 7 current models. Detailed results are in Parts 1-3; this post is the overview.
The Paper's Framework

The paper classifies LLM reasoning failures along two axes.
Axis 1 -- Reasoning Type:
- Informal (Intuitive) Reasoning: cognitive skills (working memory, inhibitory control, cognitive flexibility), cognitive biases, Theory of Mind, social norms, moral reasoning, emotional intelligence
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.