Claude Sonnet 4.6: Opus-Level Performance, 40% Cheaper — Benchmark Deep Dive
Claude Sonnet 4.6 scores 79.6% on SWE-bench, 72.5% on OSWorld, and 1633 Elo on GDPval-AA — matching or beating Opus 4.6 on production tasks. $3/$15 vs $5/$25 per M tokens. Analysis of Adaptive Thinking, Context Compaction, and OSWorld growth trajectory.

Did Sonnet Just Beat Opus? — Claude Sonnet 4.6 Benchmark Deep Dive
Anthropic released Claude Sonnet 4.6 on February 17, and it outperforms the flagship Opus 4.6 on several key benchmarks. At roughly 40% less cost. The secret isn't a "cheaper knock-off" — it's architectural-level structural changes.
Opus vs Sonnet: What Changed?
The old Opus-Sonnet dynamic was straightforward. Opus was the full-spec brain; Sonnet was the compressed version. Same architecture, smaller size, naturally lower performance.
In the 4.6 generation, that formula breaks.
Where Sonnet Wins or Ties
| Benchmark | Sonnet 4.6 | Opus 4.6 | Gap |
|---|---|---|---|
| SWE-bench Verified (Coding) | 79.6% | 80.8% | 1.2%p (Tied) |
| OSWorld Verified (Computer Use) | 72.5% | 72.7% | Effectively tied |
| GDPval-AA (Knowledge Work, Elo) | 1633 | 1606 | Sonnet wins |
| Finance Agent (Agentic Finance, Vals AI) | 63.30% | 60.05% | Sonnet wins |
In coding and agentic tasks, Sonnet matches or beats Opus. At $3/$15 per M tokens.
Where Opus Clearly Wins
| Benchmark | Sonnet 4.6 | Opus 4.6 | Gap |
|---|---|---|---|
| ARC-AGI-2 (Abstract Reasoning) | 58.3% | 68.8% | Opus leads significantly |
| HLE without Tools (Hard Problems) | 33.2% | 40.0% | Opus wins |
| HLE with Tools | 49.0% | 53.0% | Opus wins |
| MRCR v2 1M (Long Context) | — | 76% | Ref: Sonnet 4.5 = 18.5% |
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.