Can AI Read Minds? LLM Failures in Common Sense and Cognition

Humans know that dropped objects fall. We know that if someone leaves a room and the furniture gets rearranged, they will look where they left things, not where things actually are. We know that when a fact gets updated, we should remember the new version.

All of this comes from living in a physical body and navigating the world. LLMs learn from text alone. They have read "objects fall due to gravity" thousands of times, but they have never dropped anything.

This is Part 3 of the LLM Reasoning Failures series, covering three tests in common sense and cognition:

Theory of Mind (ToM): Can models track what others believe?
Physical Common Sense: Can models handle counter-intuitive physics?
Working Memory: Can models track fact updates without reverting?

We tested 7 models: GPT-4o, GPT-4o-mini, o3-mini, Claude Sonnet 4.5, Claude Haiku 4.5, Gemini 2.5 Flash, and Gemini 2.5 Flash-Lite.

Theory of Mind: From Sally-Anne to 3rd-Order Beliefs

What Is Theory of Mind?

Theory of Mind (ToM) is the ability to understand that others can hold beliefs different from your own. The classic test is the Sally-Anne paradigm from developmental psychology.

Sally places a marble in a basket and leaves the room. Anne moves the marble to a box. When Sally returns, where will she look first?

The answer is "basket." Sally does not know the marble was moved. Most children over age 4 get this right.

Basic Tests: Nearly Perfect

We ran four standard ToM tests across all models.

All 7 models pass the classic Sally-Anne test. The "transparent bag" variant is also no problem: when a transparent bag is labeled "chocolate" but clearly contains popcorn, every model correctly reports that Sam believes the bag contains popcorn.

Only GPT-4o failed the second-order belief question. When asked "Where does Bob think Alice will look?", it answered Drawer B instead of Drawer A. Bob knows Alice left before any moves happened, so Bob should predict Alice will look in Drawer A.

Raising the Difficulty: 3rd-Order Beliefs

Since the basic tests were nearly clean sweeps, we pushed harder.

Alice puts a ring in Box 1 and leaves. Bob moves it to Box 2. Carol, who watched Bob, moves it to Box 3. Alice and Bob did not see Carol's action.

Where does Carol think Bob thinks Alice will look for the ring?

This is a third-order belief. You have to model Carol's model of Bob's model of Alice's knowledge. Three nested levels of mental simulation.

Three models failed the 3rd-order belief test: GPT-4o-mini, o3-mini, and Gemini 2.5 Flash. GPT-4o-mini answered "Box 2," which is Bob's perspective, not Carol's model of Bob's model of Alice. It skipped a level of recursion.

The temporal scramble test is interesting. It starts with "When Jenny returned to the kitchen, she looked in the cupboard for her keys," then backtracks to explain the morning context. Every model got this right, but arguably because the answer appears directly in the first sentence.

Why Models Break at Depth 3

Transformer attention handles explicit relationships well. "A knows X" is a straightforward pattern. But "A thinks B thinks C knows X" requires maintaining separate belief states for each nested agent. The attention mechanism does not naturally preserve these distinct layers.

There is also a data contamination angle. The Sally-Anne test is everywhere on the internet. Models may be pattern-matching to a well-known answer format rather than genuinely simulating beliefs. The 3rd-order variant is rare in training data, which is precisely why it exposes the gap between memorization and understanding.

Physical Common Sense: Which Way Does the Balloon Move?

Can You Learn Physics from Text?

LLMs encounter physics facts constantly in their training data. "In a vacuum, all objects fall at the same rate" is practically a meme at this point. But what about counter-intuitive scenarios that require applying physics principles to unfamiliar situations?

We tested five physical common sense questions.

Test 1: Galileo's Experiment

Drop a 10kg bowling ball and a 1kg tennis ball from the same height in a vacuum. Which hits the ground first?

Answer: same time. Every model got this right. This is a textbook fact that appears in training data countless times.

Test 2: Inverted Glass of Water

Place a full glass of water upside down on a table, then remove your hand. What happens?

This one is trickier than it looks. The water spills. When you remove your hand from an inverted glass sitting on a table, the seal breaks and air enters. But some models confused this with the classic "paper on a glass" demonstration where atmospheric pressure holds the water in place.

Test 3: Ice in a Microwave

Microwave an ice cube for 10 minutes, then immediately put the resulting water in a freezer. Is the water temperature higher, lower, or the same as the original ice cube?

Answer: higher. After 10 minutes in a microwave, the water is likely near boiling. Putting it in a freezer does not instantly cool it.

Test 4: Helium Balloon in an Accelerating Car

A helium balloon is tied to the floor of a closed car. The car accelerates forward suddenly. Which way does the balloon move?

This is the most counter-intuitive problem. The answer is "forward."

When the car accelerates, the air inside the car is pushed backward by inertia. This creates a pressure gradient: lower pressure at the front, higher at the back. The helium balloon, being less dense than the surrounding air, moves toward the lower-pressure region. It is essentially buoyancy in a pseudo-gravitational field created by the acceleration.

Test 5: Shadow and Weight

Does standing in the shade make you weigh less than standing in sunlight?

Answer: no. Weight is determined by gravity. Radiation pressure from sunlight is negligible.

Results by Model

The overall accuracy is high, but each model fails on a different question. Only GPT-4o and Claude Sonnet 4.5 achieved a perfect 5/5.

The Pattern: Famous Facts vs. Rare Scenarios

There is a clear pattern. Well-known physics facts (Galileo's experiment, shadow weight) are universally correct. But less common counter-intuitive scenarios split the field: GPT-4o-mini and Flash-Lite got the helium balloon wrong (answering "backward"), while Haiku 4.5 and Flash got the inverted glass wrong (claiming the water stays put).

This is memorization, not understanding. When a physics scenario appears frequently in training data, the model retrieves the correct answer. When it does not, the model falls back on surface-level intuitions that happen to be wrong.

Working Memory: Can Models Track Fact Updates?

Proactive Interference

In cognitive psychology, proactive interference is when previously learned information disrupts the recall of newer information. You learn A on Monday, it gets updated to B on Tuesday, and when asked on Wednesday, A interferes with your recall of B.

Do LLMs exhibit the same phenomenon?

Experiment Design

We gave models 5 team-to-project assignments, then updated 2 of them.

Original assignments:

Monday: Team Alpha on Project Mercury
Tuesday: Team Beta on Project Venus
Wednesday: Team Gamma on Project Earth
Thursday: Team Delta on Project Mars
Friday: Team Epsilon on Project Jupiter

Updates:

Monday: Team Alpha now on Project Saturn (changed from Mercury)
Wednesday: Team Gamma now on Project Neptune (changed from Earth)

Then we asked 5 questions:

What project does Team Alpha work on Monday? (changed, answer: Saturn)
What project does Team Beta work on Tuesday? (unchanged, answer: Venus)
What project does Team Gamma work on Wednesday? (changed, answer: Neptune)
What project does Team Epsilon work on Friday? (unchanged, answer: Jupiter)
Which team works on Project Mercury? (trap, answer: none)

Question 5 is the critical test. Mercury was replaced by Saturn, so no team currently works on it. A model suffering from proactive interference would answer "Team Alpha."

Results

Nearly universal success. Only o3-mini stumbled on the interference trap. It answered "No team currently works on Project Mercury," which is semantically correct but did not contain the keyword "none" and was marked as a failure.

Paradoxically, this reveals a characteristic failure mode of reasoning models. Models like o3-mini run longer internal chains of thought and tend to produce verbose, sentence-level responses. A question that calls for a single word -- "none" -- gets a complete sentence instead. The model's reasoning ability works against it under strict evaluation criteria. It is a "thinking too much" failure: the more capable the reasoning, the harder it becomes to produce a terse answer.

Why This Was Too Easy

Most models passed because the context is short. Five assignments and two updates occupy a tiny fraction of any model's context window. The information is right there, close together, easy to attend to.

The Song et al. paper points out that real-world working memory failures emerge when:

The context is thousands of tokens long
Updates are far from the original facts
Multiple conflicting updates accumulate over time
The task requires integrating information across distant parts of the context

A large context window does not equal good working memory. A 128K token window means the model can "see" 128K tokens, not that it can effectively track all the information within them. The question is where attention places its weights when original facts and their updates compete.

The Paper's Diagnosis: Text Alone Is Not Enough

Song et al. connect these three failure categories to a single root cause: the absence of embodied experience.

For Theory of Mind, humans run mental simulations. We put ourselves in another person's shoes and ask "what would I think if I were them?" LLMs instead rely on pattern matching. They learn that "Sally left the room, something moved, Sally does not know" leads to "Sally looks in the original place." This pattern works for the standard Sally-Anne test. It breaks when the nesting goes deeper than the patterns in training data.

For physical common sense, the gap is similar. Knowing that "objects fall in a vacuum at the same rate" as a textual fact is different from having an intuitive model of how gravity, buoyancy, and inertia interact. Text-based learning excels at memorizing famous physics results but fails to generalize to rare counter-intuitive scenarios.

For working memory, the issue connects to Transformer architecture. Self-attention computes relationships between all tokens in parallel, but it does not explicitly encode temporal priority. There is no built-in mechanism that says "this newer fact overrides that older one."

Mitigation Strategies

The paper outlines three broad approaches.

Multimodal training: Incorporating video, physics simulations, and robotic experience data alongside text gives models something closer to embodied knowledge. Omni-modal models like MiniCPM-o and GPT-4o are steps in this direction.

Neuro-symbolic reasoning: For problems like deep recursive belief tracking or physical simulation, pure neural approaches have limits. Hybrid systems that combine neural pattern recognition with symbolic logic modules can decompose "what Carol thinks Bob thinks Alice knows" into a formal structure and solve it step by step.

Explicit memory mechanisms: For working memory failures, research is exploring memory modules that explicitly manage temporal priority of in-context information. RAG (Retrieval Augmented Generation) serves as a form of external memory, but long-term solutions likely require internal state management within the model.

The Gap Between Pattern Matching and Understanding

The takeaway from these experiments is clear.

LLMs handle problems well when those problems exist as patterns in training data, but become fragile when genuine understanding is required. They pass Sally-Anne but break at 3rd-order beliefs. They ace Galileo's experiment but split on the helium balloon. They track short-context fact updates but face interference at scale.

This is not a verdict that models are bad. It is a precise map of where current LLMs excel and where they hit walls. Knowing these limits is the first step to using LLMs effectively and to building the next generation that might overcome them.

Series Index

Overview: Are LLMs Really Smart? A Complete Guide to AI Reasoning Failures
Part 1: Structural Limitations -- Reversal Curse, Counting, Compositional Reasoning
Part 2: Cognitive Biases -- Anchoring, Order Bias, Sycophancy, Confirmation Bias
Part 3: Common Sense and Cognition -- Theory of Mind, Physical Common Sense, Working Memory (this post)
Notebook: Full Experiment Code (Jupyter Notebook)

Reference: Song, P., Han, P., & Goodman, N. (2025). Large Language Model Reasoning Failures. Transactions on Machine Learning Research (TMLR), 2026.