LLM Reasoning Failures Part 1: Structural Limitations -- Scaling Won't Fix These

This is the first installment in our series dissecting LLM reasoning failures. In this post, we cover three fundamental limitations that persist no matter how much you scale the model or expand the training data.

The Reversal Curse
Counting Failures
The Compositional Reasoning Wall

These failures stem from the Transformer architecture itself. Prompt engineering and scaling cannot fundamentally resolve them. Drawing from the survey by Song, Han, and Goodman (2025), we present hands-on experiments across 7 models alongside the theoretical analysis.

1. The Reversal Curse

What the Paper Says

If a model has learned "A is B," can it infer "B is A"? Song et al. (2025) call this failure the **Reversal Curse**. The Transformer's next-token prediction objective (unidirectional training) strengthens weights only in the "A to B" direction. "B to A" cannot be inferred unless it was separately learned.

Critically, this problem resists scaling due to Zipf's law. The sentence "Tom Cruise's mother is Mary Lee Pfeiffer" may appear in training data, but "Mary Lee Pfeiffer's son is Tom Cruise" is far rarer. When a celebrity's name is the subject, data is abundant; when an obscure person's name is the subject, data is scarce. This distributional asymmetry is structural.

Our Experiment Results

**Test 1: Real Knowledge (Parametric Memory)**

Forward question: "Who is Tom Cruise's mother?"

Reverse question: "Who is Mary Lee Pfeiffer's son?"

All models answered the forward question correctly. The reverse direction reveals dramatic differences.

Claude Haiku 4.5 honestly said "I don't have reliable information." In some ways, this is the most honest failure. Gemini Flash answered "Joaquin Phoenix" -- likely associating the surname Pfeiffer with Michelle Pfeiffer, then jumping to Joaquin Phoenix (her former partner). Gemini Flash-Lite answered "Michelle Pfeiffer" even more directly, revealing a surname-matching heuristic at work.

These are not random wrong answers. They are surname-based associative errors. The model encounters "Pfeiffer" in the query, activates the most famous Pfeiffer (Michelle), then follows association chains to related individuals (Joaquin Phoenix). Rather than admitting "I don't know," the model fills the gap with the nearest plausible pattern -- a micro-instance of hallucination. Claude Haiku's "I don't have reliable information" is paradoxically the more desirable failure mode.

What this shows: large models (GPT-4o, Claude Sonnet) likely encountered the Tom Cruise / Mary Lee Pfeiffer relationship in both directions during training. But smaller or lightweight models exhibit the reversal curse clearly.

**Test 2: Synthetic Facts (In-Context / RAG Approach)**

If the reversal curse is truly structural, what happens when we provide information directly in context? We created 5 fictional facts, included them in the prompt, and tested both forward and reverse lookups.

Example facts:

"Zephyr Kowalski invented the Heliotrope Engine in 1847"
"The Crimson Lattice theory was proposed by Ondra Havel"

Forward: "Who invented the Heliotrope Engine?"

Reverse: "What did Zephyr Kowalski invent?"

Result: **All 7 models answered both forward and reverse questions perfectly.**

This is a critically important finding. When relevant information is provided in context via RAG (Retrieval-Augmented Generation), the reversal curse disappears completely. The model can reference text within the context window bidirectionally.

However, this does not mean the reversal curse is "solved." In parametric memory (knowledge stored in model weights), the asymmetry persists. RAG is a bandage, not a cure.

Technical Root Cause

Transformers learn via next-token prediction. Weights are updated to predict the next token from preceding tokens in the input sequence.

When the model learns "Tom Cruise's mother is Mary Lee Pfeiffer":

"Tom Cruise's mother is" -> "Mary" (reinforced)
"Mary Lee" -> "Pfeiffer" (reinforced)

But the reverse -- "Mary Lee Pfeiffer's son is Tom Cruise" -- is not reflected in the weights unless encountered as a separate training sentence. This is a fundamental architectural property.

Known Mitigations

**RAG**: Most effective. Providing relevant information in context enables reverse inference
**Bidirectional fact exposure**: Including both "A is B" and "B is A" in training data
**Data augmentation**: Generating training data expressing relationships in multiple directions
**Graph-structured reasoning**: Representing relationships as graphs, enabling directionless traversal

2. Counting Failures

What the Paper Says

Song et al. (2025) classify character counting and arithmetic operations as fundamental LLM limitations, explicitly stating these remain "challenging even for reasoning models."

The root cause is **tokenization**. LLMs process input as subword tokens, not individual characters. The model sees "strawberry" as token chunks like str + aw + berry, not as individual letters. It literally cannot see how many times 'r' appears.

Our Experiment Results

**Test 1: "How many r's in strawberry?"**

Only GPT-4o answered 2. The other 6 models all got it right. Reasoning models (o3-mini) and newer models appear to use an internal workaround of spelling out characters one by one.

But is this really "solved"? Let's try a harder word.

**Test 2: "How many i's in supercalifragilisticexpialidocious?"**

Correct answer: 7 (supercal**i**frag**i**l**i**st**i**cexp**i**al**i**doc**i**ous)

With a longer word, both GPT-4o and GPT-4o-mini failed with an answer of 3. The reasoning model o3-mini got it right by internally spelling out the word letter by letter.

The key insight here: reasoning models succeed not by overcoming the tokenization limitation, but by working around it. They perform additional computation, internally enumerating "s-u-p-e-r-c-a-l-i-..." character by character. This is similar to a human counting on their fingers -- a workaround, not a fundamental capability.

Technical Root Cause

BPE (Byte Pair Encoding) tokenization merges frequently co-occurring character sequences into single tokens.

```

"strawberry" -> ["str", "aw", "berry"] (3 tokens)

```

The model sees these 3 tokens. It cannot directly see the individual characters s, t, r, a, w, b, e, r, r, y. To answer "how many r's?", it must decompose tokens back into characters -- the inverse of tokenization -- which is not a natural operation for the model.

Longer words produce more tokens, and tracking character composition across more tokens becomes increasingly difficult. This is why "supercalifragilisticexpialidocious" causes more failures than "strawberry."

Known Mitigations

**Character-level tokenization**: Splitting tokens into individual characters (but sequence length explodes)
**Reasoning model spell-out**: Chain-of-Thought enumeration of characters one by one (workaround)
**External tool calls**: Delegating character counting to Python code execution or other tools
**Byte-level models**: Using character-level models like ByT5

None of these are perfect. Character-level tokenization is impractical due to sequence length. Reasoning model workarounds incur additional computation cost. External tools are not the model's own capability.

3. Compositional Reasoning

What the Paper Says

Song et al. (2025) observe that LLMs know individual facts but fail when combining them across multiple steps. They call this the limit of compositional reasoning. The core problem: LLMs lack **holistic planning** capability, and their compositionality is superficial.

Our Experiment Results

**Test 1: 1-Hop (Single Step)**

Question: "What is the capital of Japan?"

All models correct. Single-hop is not a problem.

**Test 2: 2-Hop (Two-Step Composition)**

Question: "What is the population of Japan's capital?"

Tokyo Metropolis has a population of roughly 13-14 million, so both figures are accepted as correct. GPT-4o answered with the Greater Tokyo Area population (37 million) -- a scope error, answering about the metropolitan area rather than the city proper. o3-mini returned an empty response entirely. The remaining 5 models got it right, but this question combines two well-known facts, so the difficulty is low.

**Test 3: 3-Hop (Three-Step Composition)**

Question: "What is the tallest building in the capital city of the country that hosted the 2021 Summer Olympics?"

Reasoning path: 2021 Olympics host -> Japan -> capital -> Tokyo -> tallest building -> Azabudai Hills Mori JP Tower (325m) or Tokyo Skytree (634m, if including broadcast towers)

Most answered correctly, though answers varied based on whether "building" includes broadcast towers. Gemini Flash-Lite answered "Tokyo Tower," a clear error. Degradation begins at 3 hops.

**Test 4: 2-Hop with Distractors**

Prompt:

"France is famous for its wine production. Germany is known for its engineering excellence. Japan is the country whose currency is the Yen. What is the capital of the country whose currency is the Yen?"

The key: France and Germany information are distractors. The actual question requires identifying "the country whose currency is the Yen" (Japan) and then finding its capital.

All models passed this specific case. However, when the volume and semantic similarity of distractors increase (e.g., listing currency information for multiple countries before asking about a specific country's capital), error rates rise sharply. The more the distractors resemble the target information, the more attention gets diffused.

Technical Root Cause

Compositional reasoning failure has three root causes.

**1) Absence of planning**: LLMs operate by predicting one token at a time. They cannot plan ahead -- "Let me prepare the information I'll need three steps from now." At each token generation step, they simply select the most plausible next token.

**2) Attention diffusion**: The Transformer's attention mechanism distributes attention across all tokens in the context. As distractors increase, attention allocated to critical information decreases. This is the direct mechanism by which distractors degrade performance.

**3) Superficial compositionality**: When a model correctly answers "the population of Japan's capital," it may be because "Japan-capital-Tokyo" and "Tokyo-population-14 million" frequently co-occur in training data. This is pattern matching of commonly co-occurring facts, not genuine compositional reasoning. The less common the combination in training data, the more dramatically failure rates increase.

Known Mitigations

**Chain-of-Thought (CoT)**: Having the model explicitly write intermediate steps improves performance
**Step-by-step decomposition**: Breaking complex questions into sequential sub-questions
**RAG**: Retrieving facts needed for each step and providing them in context
**Graph-based reasoning**: Using knowledge graphs to explicitly traverse relationship paths
**Multi-agent debate**: Multiple LLM instances cross-verifying each other's reasoning

Synthesis

A common thread runs through all three failures.

**All originate from the next-token prediction architecture.** The reversal curse comes from unidirectional training. Counting failures come from tokenization. Compositional reasoning failures come from plan-free sequential generation. All are rooted in the Transformer's fundamental design: predicting one token at a time.

**Scaling is not the answer.** Even GPT-4o, one of the largest models available, fails at counting and exhibits the reversal curse. Scaling increases training data coverage, making some specific cases appear solved, but the structural limitations remain.

**Workarounds exist, but fundamental solutions do not.** RAG bypasses the reversal curse. CoT bypasses counting limitations. Step-by-step decomposition bypasses compositional reasoning failures. But all of these incur additional costs and only work under specific conditions.

Coming Next

Part 2 covers Cognitive Biases: anchoring bias, order bias, sycophancy, and confirmation bias -- failures rooted in RLHF and biased training data that are improvable but currently observed across all models.

Series Index

Overview: Are LLMs Really Smart? A Complete Guide to AI Reasoning Failures
Part 1: Structural Limitations -- Reversal Curse, Counting, Compositional Reasoning (this post)
Part 2: Cognitive Biases -- Anchoring, Order Bias, Sycophancy, Confirmation Bias
Part 3: Common Sense and Cognition -- Theory of Mind, Physical Common Sense, Working Memory
Notebook: Full Experiment Code (Jupyter Notebook)

Reference: Song, P., Han, P., & Goodman, N. (2025). Large Language Model Reasoning Failures. Transactions on Machine Learning Research (TMLR), 2026.

LLM Reasoning Failures Part 1: Structural Limitations -- Scaling Won't Fix These

The Reversal Curse
Counting Failures
The Compositional Reasoning Wall

1. The Reversal Curse

What the Paper Says

Our Experiment Results

**Test 1: Real Knowledge (Parametric Memory)**

Forward question: "Who is Tom Cruise's mother?"

Reverse question: "Who is Mary Lee Pfeiffer's son?"

All models answered the forward question correctly. The reverse direction reveals dramatic differences.

**Test 2: Synthetic Facts (In-Context / RAG Approach)**

Example facts:

"Zephyr Kowalski invented the Heliotrope Engine in 1847"
"The Crimson Lattice theory was proposed by Ondra Havel"

Forward: "Who invented the Heliotrope Engine?"

Reverse: "What did Zephyr Kowalski invent?"

Result: **All 7 models answered both forward and reverse questions perfectly.**

However, this does not mean the reversal curse is "solved." In parametric memory (knowledge stored in model weights), the asymmetry persists. RAG is a bandage, not a cure.

Technical Root Cause

Transformers learn via next-token prediction. Weights are updated to predict the next token from preceding tokens in the input sequence.

When the model learns "Tom Cruise's mother is Mary Lee Pfeiffer":

"Tom Cruise's mother is" -> "Mary" (reinforced)
"Mary Lee" -> "Pfeiffer" (reinforced)

But the reverse -- "Mary Lee Pfeiffer's son is Tom Cruise" -- is not reflected in the weights unless encountered as a separate training sentence. This is a fundamental architectural property.

Known Mitigations

**RAG**: Most effective. Providing relevant information in context enables reverse inference
**Bidirectional fact exposure**: Including both "A is B" and "B is A" in training data
**Data augmentation**: Generating training data expressing relationships in multiple directions
**Graph-structured reasoning**: Representing relationships as graphs, enabling directionless traversal

2. Counting Failures

What the Paper Says

Song et al. (2025) classify character counting and arithmetic operations as fundamental LLM limitations, explicitly stating these remain "challenging even for reasoning models."

Our Experiment Results

**Test 1: "How many r's in strawberry?"**

Only GPT-4o answered 2. The other 6 models all got it right. Reasoning models (o3-mini) and newer models appear to use an internal workaround of spelling out characters one by one.

But is this really "solved"? Let's try a harder word.

**Test 2: "How many i's in supercalifragilisticexpialidocious?"**

Correct answer: 7 (supercal**i**frag**i**l**i**st**i**cexp**i**al**i**doc**i**ous)

With a longer word, both GPT-4o and GPT-4o-mini failed with an answer of 3. The reasoning model o3-mini got it right by internally spelling out the word letter by letter.

Technical Root Cause

BPE (Byte Pair Encoding) tokenization merges frequently co-occurring character sequences into single tokens.

```

"strawberry" -> ["str", "aw", "berry"] (3 tokens)

```

Known Mitigations

**Character-level tokenization**: Splitting tokens into individual characters (but sequence length explodes)
**Reasoning model spell-out**: Chain-of-Thought enumeration of characters one by one (workaround)
**External tool calls**: Delegating character counting to Python code execution or other tools
**Byte-level models**: Using character-level models like ByT5

3. Compositional Reasoning

What the Paper Says

Our Experiment Results

**Test 1: 1-Hop (Single Step)**

Question: "What is the capital of Japan?"

All models correct. Single-hop is not a problem.

**Test 2: 2-Hop (Two-Step Composition)**

Question: "What is the population of Japan's capital?"

**Test 3: 3-Hop (Three-Step Composition)**

Question: "What is the tallest building in the capital city of the country that hosted the 2021 Summer Olympics?"

Reasoning path: 2021 Olympics host -> Japan -> capital -> Tokyo -> tallest building -> Azabudai Hills Mori JP Tower (325m) or Tokyo Skytree (634m, if including broadcast towers)

Most answered correctly, though answers varied based on whether "building" includes broadcast towers. Gemini Flash-Lite answered "Tokyo Tower," a clear error. Degradation begins at 3 hops.

**Test 4: 2-Hop with Distractors**

Prompt:

The key: France and Germany information are distractors. The actual question requires identifying "the country whose currency is the Yen" (Japan) and then finding its capital.

Technical Root Cause

Compositional reasoning failure has three root causes.

Known Mitigations

**Chain-of-Thought (CoT)**: Having the model explicitly write intermediate steps improves performance
**Step-by-step decomposition**: Breaking complex questions into sequential sub-questions
**RAG**: Retrieving facts needed for each step and providing them in context
**Graph-based reasoning**: Using knowledge graphs to explicitly traverse relationship paths
**Multi-agent debate**: Multiple LLM instances cross-verifying each other's reasoning

Synthesis

A common thread runs through all three failures.

Coming Next

Series Index

Overview: Are LLMs Really Smart? A Complete Guide to AI Reasoning Failures
Part 1: Structural Limitations -- Reversal Curse, Counting, Compositional Reasoning (this post)
Part 2: Cognitive Biases -- Anchoring, Order Bias, Sycophancy, Confirmation Bias
Part 3: Common Sense and Cognition -- Theory of Mind, Physical Common Sense, Working Memory
Notebook: Full Experiment Code (Jupyter Notebook)

Reference: Song, P., Han, P., & Goodman, N. (2025). Large Language Model Reasoning Failures. Transactions on Machine Learning Research (TMLR), 2026.