AI ResearchKR

MIRAGE — Do Multimodal AIs Actually "See" Images?

GPT-5.1, Gemini 3 Pro, and Claude Opus 4.5 retain 70-80% of benchmark scores without any image input. A 3B text-only model outperforms all multimodal models and radiologists on chest X-ray benchmarks. Stanford MIRAGE paper review.

MIRAGE — Do Multimodal AIs Actually "See" Images?

MIRAGE — Do Multimodal AIs Actually "See" Images?

Show GPT-5.1 a chest X-ray and it says "consolidation is visible in the right lower lobe." Now remove the image and ask the same question. It gives nearly the same answer — with zero uncertainty, zero hedging.

A Stanford research team published *MIRAGE: The Illusion of Visual Understanding* on arXiv in March 2026, revealing one of the most uncomfortable findings in current AI research. When frontier multimodal models — GPT-5, Gemini 3 Pro, Claude Opus 4.5 — were given image-related questions without any image, they confidently described images they never received.

Without image access, models retained 70-80% of their benchmark scores. A 3-billion parameter text-only model outperformed every multimodal model — and radiologists — on a chest X-ray benchmark.

This paper reveals an uncomfortable truth: much of what we call "multimodal understanding" may actually be text pattern matching.

The Mirage Effect — Normal processing vs mirage behavior. Without images, models generate responses indistinguishable from real analysis.

1. What Is Mirage Reasoning?

The research team defines Mirage Reasoning as:

"An AI model generating an answer that describes non-existent visual inputs without expressing uncertainty, lack of confidence, or acknowledging an assumption or hypothetical scenario."

This is distinct from typical hallucination.

HallucinationMirage Reasoning
DefinitionProcessing real input but filling in wrong detailsFabricating the input itself and reasoning on top of it
Example"This X-ray shows a nodule in the left lung" (actually right)"This X-ray shows consolidation in the right lung" (no image provided at all)
Core problemWrong factsConstructing a false epistemic frame

Mirage isn't just getting the wrong answer. The model creates a false premise that it received an image, then builds elaborate reasoning on top of it — without expressing uncertainty or acknowledging the absence of visual input.

Concrete Examples

Mirage cases reported in the paper:

  • When asked to identify a tissue pathology slide, GPT-5 and Gemini 3 Pro confidently answered "This is cardiac muscle tissue" — with no image provided
  • Asked whether a bird is flying or sitting: "It is flying with wings spread"
  • Asked to identify a celestial body: "It is Mars"
  • Models even "read" license plates, expiration dates, and locations from non-existent images

Every response lacked any expression of uncertainty. They were indistinguishable from responses based on actual image analysis.

2. Experimental Setup

Models Tested

The paper conducted two distinct experiments with different model sets.

Phantom-0 (Mirage Rate Measurement) — how confidently models answer without images:

VendorModelMirage Rate
OpenAIGPT-4.143%
OpenAIGPT-570%
OpenAIGPT-5.193.5%
OpenAIGPT-5.1 Thinking90.5%
OpenAIGPT-5.270%
OpenAIGPT-5.2 Thinking73.5%
GoogleGemini 2.5 Pro58%
GoogleGemini 2.5 Pro Think47%
GoogleGemini 3 Pro81%
GoogleGemini 3 Pro Think80%
AnthropicClaude Opus 4.586%
AnthropicClaude Sonnet 4.575%

Benchmark Mirage Score Calculation — how much of benchmark accuracy survives without images:

  • GPT-5.1, Gemini 3 Pro, Gemini 2.5 Pro, Claude Opus 4.5 (4 models)
Note: GPT-5 and GPT-5.2 appeared in the Phantom-0 experiment but were not included in benchmark mirage score calculations.
Mirage rates across frontier models on Phantom-0. Every model scored above 43%, with GPT-5.1 reaching 93.5%.

Benchmarks

Medical:

  • VQA-RAD — Radiology visual question answering
  • MedXpertQA-MM — Medical multiple-choice questions
  • MicroVQA — Microscopy visual question answering
  • ReXVQA — Chest X-ray question answering

General-purpose:

  • MMMU-Pro — Academic questions across 30 subjects
  • Video-MMMU — Video understanding
  • Video-MME — Video multi-evaluation

Custom:

  • Phantom-0 — 200 questions across 20 categories, designed for image-absent evaluation

Method

Mirage-mode: Send the benchmark's visual questions as-is, but remove the image. Don't tell the model the image is missing. The model implicitly assumes images are present.

Guess-mode: Same conditions, but explicitly state: "The image has been removed. Take your best guess based on your knowledge."

3. Key Results

3-1. No Frontier Model Scored Below 43%

On the Phantom-0 benchmark, every frontier model showed at least a 43% mirage rate. GPT-5.1 reached 93.5%, Claude Opus 4.5 hit 86%.

An even more striking finding: adding standard system prompt instructions commonly used in evaluation workflows pushed mirage rates to 90-100%. Prompt tuning for evaluation actually amplifies this vulnerability.

3-2. 70-80% Benchmark Accuracy Without Images

Scores achieved without image access consistently exceeded the additional score gained from providing images.

Medical Benchmarks (Highest Susceptibility):

BenchmarkWithout ImageAdditional w/ ImageMirage Score
VQA-RAD85%+10%89%
MicroVQA75%+15%83%
MedXpertQA-MM80%+12%87%

General-Purpose Benchmarks:

BenchmarkWithout ImageAdditional w/ ImageMirage Score
MMMU-Pro65%+23%72%
Video-MMMU62%+29%68%
Video-MME63%+27%70%

Mirage Score = accuracy without images / accuracy with images. An ideal benchmark would have this near 0%. Reality: medical benchmarks 80-99%, general benchmarks 60-80%.

Benchmark mirage scores. Medical benchmarks: 80-99%. General benchmarks: 60-80%. Non-visual performance consistently exceeds the visual component.

3-3. A 3B Text Model Beat GPT-5.1 and Radiologists

The paper's most striking result:

A 3-billion parameter text-only model (Qwen2.5-3B-Instruct), fine-tuned on just 10,000 image-free samples from ReXVQA, outperformed every frontier multimodal model and the average radiologist by over 10% on the held-out chest X-ray benchmark, ranking #1. (Figure 3d)

This model never saw a single X-ray. It learned to predict answers purely from 10,000 question-answer text pairs. The "AI surpasses radiologists" headlines may have been measuring text pattern memorization, not visual understanding.

3-4. Pathology Bias: AI's Imagined Images Are Full of Disease

What makes mirage particularly dangerous is the content itself. When given medical questions without images, models diagnosed severe pathologies rather than normal findings:

  • Brain MRI → glioblastoma, meningioma, multiple sclerosis
  • Chest X-ray → pneumonia, pleural effusion
  • ECG → ST-elevation myocardial infarction (STEMI) — an emergency diagnosis requiring immediate intervention
  • Dermatology → malignant melanoma
  • Pathology → carcinoma

While "normal" appeared among responses, pathological diagnoses cumulatively dominated. In healthcare, this pattern directly translates to false positive diagnoses.

Image upload failures happen routinely in practice. If a patient sends a message without attaching a photo and the AI starts with "analyzing your image..." — someone will mistake this for a real analysis. The paper notes that over 230 million people use health-related AI services daily.

3-5. Mirage vs Guess: Two Operating Modes

The team's third key finding: when explicitly told "there's no image, just guess" (Guess-mode), GPT-5.1 scored significantly lower than in Mirage-mode. Mirage-mode outperformed in 23 of 28 MMMU-Pro categories.

This suggests a mechanism beyond simple probabilistic guessing:

  • Mirage-mode: Implicitly assumes images are present. Leverages "hidden patterns" from text cues, generating more confident and specific responses
  • Guess-mode: Recognizing it's "just guessing," the model switches to a conservative response mode. It doesn't access the same patterns

Strong evidence that two distinct operating modes exist within these models.

4. Thinking Mode Actually Amplifies Mirages

One more finding deserves special attention. When all models were tested in Extended Thinking mode, both original accuracy and mirage accuracy increased.

As shown in the infographic:

  • GPT-5.1 Thinking: 90.5% mirage rate (comparable to standard mode's 93.5%)
  • GPT-5.2 Thinking: 73.5% (actually higher than standard mode's 70%)
  • Gemini 3 Pro Think: 80% (nearly identical to standard mode's 81%)

The reasoning traces that appear transparent and trustworthy may actually be fabricated reasoning about non-existent images. The assumption that "showing the thinking process makes it more trustworthy" becomes dangerous in the mirage context. Long, detailed reasoning chains may be nothing more than elaborate fiction about images that were never provided.

5. Why Does This Happen?

Statistical Patterns in Training Data

Multimodal models learn statistical correlations between question patterns and answers during pretraining. If "pneumonia" is the correct answer for a high percentage of "What do you see in this chest X-ray?" questions, the model can exploit that pattern without looking at the image.

The researchers found that simply adding dataset names to benchmark questions significantly boosted model accuracy — strong evidence that models memorize problem patterns and structures during pretraining.

The Shortest Path Problem

Models choose the easiest path to the correct answer. If exploiting text cues is easier than actually analyzing the image, the model ignores the image. As the paper puts it: models "ignore the visual information and rely only on their vast prior knowledge, taking the shortest route to the correct answer."

Statistics-Dominated Medical Data

Medical data has particularly strong statistical patterns. Specific symptom combinations correlate highly with specific diagnoses, enabling high accuracy from text alone. This explains why medical benchmark Mirage Scores (80-99%) far exceed general benchmarks (60-80%).

Mirage Is Not a Bug — It's a Training Artifact

The paper's core insight: the mirage effect is a structural consequence of joint image-text training. During the process of learning from images and text together, models naturally acquire the pattern of responding confidently to visual questions even without images. This isn't a fixable bug — it's a structural limitation of current training approaches.

6. The Fix: B-Clean Benchmarks

The paper proposes B-Clean, a framework for removing questions solvable by text alone.

The B-Clean Solution — Remove non-visual cues, keep test sets private, always report Mirage Score. Three real-world risks and how to address them.

Core Principles

First, remove all text cues from benchmark questions that enable answers without images. This includes dataset names, domain hints, and structural patterns in questions.

Second, keep test sets private to prevent data contamination during pretraining. Currently all major benchmarks are public, allowing models to memorize question patterns during training.

Third, run every benchmark twice — with images and without — and report both scores transparently. A high Mirage Score means the benchmark is measuring text reasoning, not visual understanding.

Step-by-Step Process

Step 1 — Mirage-mode evaluation: Evaluate all candidate models on the benchmark without images. Identify questions answered correctly without visual input.

Step 2 — Remove compromised questions: Remove the union of all questions any model answered correctly without images.

Step 3 — Pattern removal (optional): Fine-tune a text-only model on the compromised data; remove any additional questions it gets right.

Step 4 — Vision-grounded evaluation: Re-evaluate with images on the cleaned B-Clean benchmark.

Post-B-Clean Scores (Figure 5)

ModelMMMU-Pro (orig → clean)MedXpertQA-MM (orig → clean)MicroVQA (orig → clean)
GPT-5.176.0% → 67.1%65.5% → 41.1%61.5% → 15.4%
Gemini 3 Pro81.0% → 72.8%77.8% → 52.3%68.8% → 23.2%
Gemini 2.5 Pro68.0% → 68.2%65.9% → 39.1%63.5% → 22.1%

GPT-5.1 dropping from 61.5% to 15.4% on MicroVQA means over 75% of its original score came from text patterns, not actual visual understanding. Most of its "microscopy analysis" was text pattern matching.

7. Why This Matters Now

The problem this paper raises isn't a technical bug. It's a structural trust issue.

"AI Surpasses Human Experts" Claims

Headlines claiming "AI exceeds human expert level" may have been inflated by the mirage effect. AI performance in sensitive domains like medical imaging and pathology diagnosis may have been overestimated. The 3B text-only model outperforming radiologists demonstrates that the "surpassing" was statistical pattern memorization, not visual analysis.

Image Upload Failures Are Common

If a patient sends a message without attaching a photo and the AI responds with "analyzing your image..." — someone will mistake this for real analysis. File attachment errors, network failures preventing image transmission — these are everyday occurrences in practice.

Benchmark Leaderboard Reliability

"81% on MMMU-Pro" may no longer mean "this model understands images well." In every single model-benchmark pair tested, non-visual performance exceeded the visual component. Leaderboard rankings likely reflect text reasoning ability, not visual understanding.

Modality Ablation as Standard Practice

The paper recommends making modality ablation — testing by removing input modalities one at a time — a standard practice in all multimodal evaluation. Simply reporting Mirage Scores alongside standard scores would provide tremendous insight into a benchmark's actual value.

Wrap-up

The core question MIRAGE poses is simple:

Do models score high because they understand images, or because they understand questions?

Most current benchmarks cannot distinguish between the two. B-Clean is the first step toward making that distinction.

As trust in AI systems grows, distinguishing between genuine visual understanding and sophisticated text reasoning isn't just an academic question. Especially for healthcare organizations actively deploying medical AI, this paper is essential reading.

If you're deploying multimodal AI in production — especially in high-stakes domains like healthcare, autonomous driving, or security — benchmark scores alone aren't enough. You need to verify what the model is actually "seeing."

Paper: MIRAGE: The Illusion of Visual Understanding — Asadi et al. (Stanford, 2026)
🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts