AI ResearchKR

Claude Sonnet 4.6: Opus-Level Performance, 40% Cheaper — Benchmark Deep Dive

Claude Sonnet 4.6 scores 79.6% on SWE-bench, 72.5% on OSWorld, and 1633 Elo on GDPval-AA — matching or beating Opus 4.6 on production tasks. $3/$15 vs $5/$25 per M tokens. Analysis of Adaptive Thinking, Context Compaction, and OSWorld growth trajectory.

Claude Sonnet 4.6: Opus-Level Performance, 40% Cheaper — Benchmark Deep Dive

Did Sonnet Just Beat Opus? — Claude Sonnet 4.6 Benchmark Deep Dive

Anthropic released Claude Sonnet 4.6 on February 17, and it outperforms the flagship Opus 4.6 on several key benchmarks. At roughly 40% less cost. The secret isn't a "cheaper knock-off" — it's architectural-level structural changes.

Opus vs Sonnet: What Changed?

The old Opus-Sonnet dynamic was straightforward. Opus was the full-spec brain; Sonnet was the compressed version. Same architecture, smaller size, naturally lower performance.

In the 4.6 generation, that formula breaks.

Where Sonnet Wins or Ties

BenchmarkSonnet 4.6Opus 4.6Gap
SWE-bench Verified (Coding)79.6%80.8%1.2%p (Tied)
OSWorld Verified (Computer Use)72.5%72.7%Effectively tied
GDPval-AA (Knowledge Work, Elo)16331606Sonnet wins
Finance Agent (Agentic Finance, Vals AI)63.30%60.05%Sonnet wins

In coding and agentic tasks, Sonnet matches or beats Opus. At $3/$15 per M tokens.

Where Opus Clearly Wins

BenchmarkSonnet 4.6Opus 4.6Gap
ARC-AGI-2 (Abstract Reasoning)58.3%68.8%Opus leads significantly
HLE without Tools (Hard Problems)33.2%40.0%Opus wins
HLE with Tools49.0%53.0%Opus wins
MRCR v2 1M (Long Context)76%Ref: Sonnet 4.5 = 18.5%
🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts