Can AI Read Minds? LLM Failures in Common Sense and Cognition
Theory of Mind, Physical Common Sense, Working Memory — testing where text-only LLMs fail in common sense and cognition.

Can AI Read Minds? LLM Failures in Common Sense and Cognition
Humans know that dropped objects fall. We know that if someone leaves a room and the furniture gets rearranged, they will look where they left things, not where things actually are. We know that when a fact gets updated, we should remember the new version.
All of this comes from living in a physical body and navigating the world. LLMs learn from text alone. They have read "objects fall due to gravity" thousands of times, but they have never dropped anything.
This is Part 3 of the LLM Reasoning Failures series, covering three tests in common sense and cognition:
- Theory of Mind (ToM): Can models track what others believe?
- Physical Common Sense: Can models handle counter-intuitive physics?
- Working Memory: Can models track fact updates without reverting?
We tested 7 models: GPT-4o, GPT-4o-mini, o3-mini, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 2.5 Flash, and Gemini 2.5 Flash-Lite.
Theory of Mind: From Sally-Anne to 3rd-Order Beliefs
What Is Theory of Mind?
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.