MiniMax M2.5: Opus-Level Performance for $1 per Hour

On February 12, 2026, Shanghai-based AI startup MiniMax released M2.5. SWE-bench Verified 80.2%, BrowseComp 76.3%, Multi-SWE-Bench 51.3%. All within 0.6%p of Claude Opus 4.6, at 1/20th the price.

The model is available as open weights on Hugging Face under a modified MIT license. It runs on a 230B parameter MoE architecture, activating only 10B at inference time. Running the 100 TPS (tokens per second) Lightning variant continuously for one hour costs about $1.

This post analyzes M2.5's architecture, training methodology, benchmark performance, and pricing structure, and examines what it means for the AI industry.

Architecture: 230B Total, 10B Active

MiniMax M2.5 uses a Mixture of Experts (MoE) architecture.

The core idea behind MoE: for each input token, only a subset of "expert" parameters are activated. This preserves the knowledge capacity of a 230B model while keeping actual compute at the level of a 10B model. That is the secret behind the price and speed.

It ships in two variants:

Lightning costs twice as much at twice the speed. Accuracy is identical.

Forge: A Reinforcement Learning Framework for Agents

The key to M2.5's performance is Forge, an in-house reinforcement learning (RL) framework.

Traditional LLM training works by "reading text and predicting the next token." Forge takes a different approach. It places the model in real environments and rewards it based on task completion.

Training environments:

Over 200,000 real code repositories
Web browsers (search, navigation, information gathering)
Office applications (Word, Excel, PowerPoint)
API endpoints and tool calls

Technical highlights of Forge:

CISPO (Clipping Importance Sampling Policy Optimization): An algorithm that ensures stability during large-scale RL training of MoE models. It addresses gradient imbalance across experts.
Process Reward: When the agent performs long tasks (tens of thousands of tokens), it evaluates not just the final result but also the quality of intermediate steps. This solves the credit assignment problem in long-context scenarios.
Asynchronous scheduling + tree-structured sample merging: Achieved roughly 40x training speedup.
Trajectory-based speed optimization: Trains the model to achieve the same performance with fewer tokens. 20% reduction in token usage compared to M2.1.

As a result, M2.5 was not trained as "a model that can write code" but as "an agent that designs and executes projects." It analyzes architecture, decomposes features, and designs UI before writing a single line of code.

Internally at MiniMax, M2.5-generated code reportedly accounts for 80% of newly committed code.

Benchmarks: What the Numbers Say

Coding Performance

Only 0.6%p separates M2.5 from Opus 4.6 on SWE-bench Verified. On Droid and OpenCode harnesses, M2.5 actually comes out ahead. Task completion time is on par with Opus and 37% faster than M2.1.

On Multi-SWE-Bench (multilingual coding), it ranks first in the industry. The effect of Forge training across 13 programming languages is clear.

Search and Tool Use

BrowseComp evaluates a model's ability to navigate the web and answer complex questions. M2.5 surpassed both GPT-5.2 and Gemini 3 Pro. It uses a strategy of discarding history when context exceeds 30% of the maximum length.

General Knowledge and Reasoning

In general reasoning, M2.5 falls behind Opus 4.6. A 9.3-point gap on AIME25, 4.8 points on GPQA-D. This is where M2.5 hits its ceiling. It matches Opus in coding and agentic tasks, but there is a clear gap in pure reasoning ability.

Office and Productivity

M2.5 scored a 59.0% win rate against mainstream models on GDPval-MM, a benchmark evaluating Word, PowerPoint, and Excel tasks. Through MiniMax Agent, it also offers automatic loading of Office Skills based on file type.

Pricing Comparison: The Real Story

More striking than the benchmark numbers is the pricing.

For M2.5 Standard:

vs. Opus 4.6: 1/33 input cost, 1/21 output cost. SWE-bench gap is just 0.6%p.
vs. GPT-5.2: 1/12 input cost, 1/12 output cost. SWE-bench is actually 0.2%p higher.
vs. Sonnet 4.5: 1/20 input cost, 1/13 output cost. SWE-bench is 3%p higher.

With the same budget ($100), M2.5 lets you process 20-30x more tokens than Opus. In agent workflows, this difference shifts the boundary between "possible" and "not possible."

Hourly cost for continuous operation:

Running 4 instances year-round: M2.5 costs roughly $10,000, Opus roughly $200,000.

Benchmark Caveats

There is important context when interpreting SWE-bench Verified/Pro scores.

These benchmarks measure the combined performance of "model + agent harness + tools + prompt + number of runs," not the model in isolation. The same model can vary by 5-10%p depending on which scaffold (agent framework) is used.

For example:

OpenAI reported GPT-5.2's SWE-bench Verified at 80%, marking it "not plotted" and describing the evaluation setup separately.
M2.5's 80.2% is based on MiniMax's own agent scaffold.
Scores may differ when measured through OpenHands (a third-party framework).

So rather than concluding "M2.5 = Opus," the real takeaway is that this level of performance is now achievable at this price point. It is important to distinguish between third-party measurements (Artificial Analysis, OpenHands Index, etc.) and vendor-reported numbers.

Budget Model Showdown: M2.5 vs Gemini 2.5 Flash vs Flash-Lite

M2.5's pricing is striking, but Google's Gemini lineup is no pushover either. Gemini 2.5 Flash targets the balanced middle ground, while Flash-Lite goes for ultra-low cost. The three models occupy entirely different positions.

Total cost is calculated at a 3:1 input:output ratio (a typical agent usage pattern). M2.5 costs half as much as Flash while scoring 2x on Intelligence Index and 1.5x on SWE-bench.

Artificial Analysis's Intelligence Index v4.0 aggregates 10 benchmarks including GDPval-AA, Terminal-Bench Hard, SciCode, GPQA Diamond, and Humanity's Last Exam. M2.5 (42) is 2x Flash (21) and 3x Flash-Lite (13).

Selection criteria for the three models:

Complex coding, agent workflows, hard reasoning -> MiniMax M2.5 (performance first)
Decent performance + fast responses + wide context -> Gemini 2.5 Flash (balanced)
High-volume simple classification, translation, summarization -> Gemini 2.5 Flash-Lite (cost first)

M2.5 is "affordable Opus," Flash is "affordable Sonnet," and Flash-Lite is "affordable Haiku."

What Open Weights Means

M2.5 is released on Hugging Face under a modified MIT license. There is one condition: commercial use requires displaying "MiniMax M2.5" in the UI.

Local deployment options:

Although it is a 230B MoE, the active parameter count is only 10B, making it feasible to run on consumer GPUs with appropriate quantization. Unsloth provides GGUF quantized versions.

Why this matters: you can run Opus 4.6-class coding performance without API calls, on your own infrastructure, without sending data externally. This becomes a meaningful option for environments with enterprise security requirements.

Limitations and Caveats

M2.5 is not a silver bullet. It has clear weaknesses.

Gap in pure reasoning: AIME25 86.3 vs Opus 95.6, GPQA-D 85.2 vs 90.0. In mathematical reasoning and science problems, it clearly trails the Western flagship models.

Real-world issues (per OpenHands reports):

Occasionally targets the wrong git branch
Misses instructions (ignores directives to use specific markup tags)
Inconsistent instruction following

Scaffold dependency: Benchmark performance depends heavily on the scaffold. MiniMax's own scaffold yields 80.2%, but other frameworks may produce different results.

China-based company risk: There are non-technical considerations around data sovereignty, regulatory environment changes, and service reliability. Being open weights, local deployment mitigates some of these concerns.

What Has Changed

M2.5 carries the slogan "Intelligence too cheap to meter" — a riff on the 1954 prediction that nuclear power would become "too cheap to meter."

It is an exaggerated slogan, but the direction is right:

Opus-level coding performance is now available at 1/20th the price.
An open-weights model has surpassed Claude Sonnet-level performance for the first time (per OpenHands).
Achieving frontier performance with 4% parameter activation (10B out of 230B) validates the efficiency of MoE architecture.
Forge's "learn in the environment" paradigm presents a training methodology suited for the agent era.

Six months ago, "SWE-bench 80% was only possible with Opus." Now it is possible at $0.15/M input tokens.

The price-performance curve for AI models is dropping faster than Moore's Law. M2.5 is the latest data point on that curve.

Summary

References

MiniMax, "MiniMax M2.5: Built for Real-World Productivity." MiniMax News, 2026.
OpenHands, "MiniMax M2.5: Open Weights Models Catch Up to Claude Sonnet." OpenHands Blog, 2026.
Artificial Analysis, "MiniMax-M2.5 - Intelligence, Performance & Price Analysis." 2026.
MiniMaxAI, "MiniMax-M2.5." Hugging Face Model Card, 2026.
VentureBeat, "MiniMax's new open M2.5 and M2.5 Lightning near state-of-the-art while costing 1/20th of Claude Opus 4.6." 2026.

MiniMax M2.5: Opus-Level Performance for $1 per Hour

On February 12, 2026, Shanghai-based AI startup MiniMax released M2.5. SWE-bench Verified 80.2%, BrowseComp 76.3%, Multi-SWE-Bench 51.3%. All within 0.6%p of Claude Opus 4.6, at 1/20th the price.

This post analyzes M2.5's architecture, training methodology, benchmark performance, and pricing structure, and examines what it means for the AI industry.

Architecture: 230B Total, 10B Active

MiniMax M2.5 uses a Mixture of Experts (MoE) architecture.

It ships in two variants:

Lightning costs twice as much at twice the speed. Accuracy is identical.

Forge: A Reinforcement Learning Framework for Agents

The key to M2.5's performance is Forge, an in-house reinforcement learning (RL) framework.

Traditional LLM training works by "reading text and predicting the next token." Forge takes a different approach. It places the model in real environments and rewards it based on task completion.

Training environments:

Over 200,000 real code repositories
Web browsers (search, navigation, information gathering)
Office applications (Word, Excel, PowerPoint)
API endpoints and tool calls

Technical highlights of Forge:

CISPO (Clipping Importance Sampling Policy Optimization): An algorithm that ensures stability during large-scale RL training of MoE models. It addresses gradient imbalance across experts.
Process Reward: When the agent performs long tasks (tens of thousands of tokens), it evaluates not just the final result but also the quality of intermediate steps. This solves the credit assignment problem in long-context scenarios.
Asynchronous scheduling + tree-structured sample merging: Achieved roughly 40x training speedup.
Trajectory-based speed optimization: Trains the model to achieve the same performance with fewer tokens. 20% reduction in token usage compared to M2.1.

Internally at MiniMax, M2.5-generated code reportedly accounts for 80% of newly committed code.

Benchmarks: What the Numbers Say

Coding Performance

Only 0.6%p separates M2.5 from Opus 4.6 on SWE-bench Verified. On Droid and OpenCode harnesses, M2.5 actually comes out ahead. Task completion time is on par with Opus and 37% faster than M2.1.

On Multi-SWE-Bench (multilingual coding), it ranks first in the industry. The effect of Forge training across 13 programming languages is clear.

Search and Tool Use

General Knowledge and Reasoning

Office and Productivity

Pricing Comparison: The Real Story

More striking than the benchmark numbers is the pricing.

For M2.5 Standard:

vs. Opus 4.6: 1/33 input cost, 1/21 output cost. SWE-bench gap is just 0.6%p.
vs. GPT-5.2: 1/12 input cost, 1/12 output cost. SWE-bench is actually 0.2%p higher.
vs. Sonnet 4.5: 1/20 input cost, 1/13 output cost. SWE-bench is 3%p higher.

With the same budget ($100), M2.5 lets you process 20-30x more tokens than Opus. In agent workflows, this difference shifts the boundary between "possible" and "not possible."

Hourly cost for continuous operation:

Running 4 instances year-round: M2.5 costs roughly $10,000, Opus roughly $200,000.

Benchmark Caveats

There is important context when interpreting SWE-bench Verified/Pro scores.

For example:

OpenAI reported GPT-5.2's SWE-bench Verified at 80%, marking it "not plotted" and describing the evaluation setup separately.
M2.5's 80.2% is based on MiniMax's own agent scaffold.
Scores may differ when measured through OpenHands (a third-party framework).

Budget Model Showdown: M2.5 vs Gemini 2.5 Flash vs Flash-Lite

Total cost is calculated at a 3:1 input:output ratio (a typical agent usage pattern). M2.5 costs half as much as Flash while scoring 2x on Intelligence Index and 1.5x on SWE-bench.

Selection criteria for the three models:

Complex coding, agent workflows, hard reasoning -> MiniMax M2.5 (performance first)
Decent performance + fast responses + wide context -> Gemini 2.5 Flash (balanced)
High-volume simple classification, translation, summarization -> Gemini 2.5 Flash-Lite (cost first)

M2.5 is "affordable Opus," Flash is "affordable Sonnet," and Flash-Lite is "affordable Haiku."

What Open Weights Means

M2.5 is released on Hugging Face under a modified MIT license. There is one condition: commercial use requires displaying "MiniMax M2.5" in the UI.

Local deployment options:

Although it is a 230B MoE, the active parameter count is only 10B, making it feasible to run on consumer GPUs with appropriate quantization. Unsloth provides GGUF quantized versions.

Limitations and Caveats

M2.5 is not a silver bullet. It has clear weaknesses.

Gap in pure reasoning: AIME25 86.3 vs Opus 95.6, GPQA-D 85.2 vs 90.0. In mathematical reasoning and science problems, it clearly trails the Western flagship models.

Real-world issues (per OpenHands reports):

Occasionally targets the wrong git branch
Misses instructions (ignores directives to use specific markup tags)
Inconsistent instruction following

Scaffold dependency: Benchmark performance depends heavily on the scaffold. MiniMax's own scaffold yields 80.2%, but other frameworks may produce different results.

What Has Changed

M2.5 carries the slogan "Intelligence too cheap to meter" — a riff on the 1954 prediction that nuclear power would become "too cheap to meter."

It is an exaggerated slogan, but the direction is right:

Opus-level coding performance is now available at 1/20th the price.
An open-weights model has surpassed Claude Sonnet-level performance for the first time (per OpenHands).
Achieving frontier performance with 4% parameter activation (10B out of 230B) validates the efficiency of MoE architecture.
Forge's "learn in the environment" paradigm presents a training methodology suited for the agent era.

Six months ago, "SWE-bench 80% was only possible with Opus." Now it is possible at $0.15/M input tokens.

The price-performance curve for AI models is dropping faster than Moore's Law. M2.5 is the latest data point on that curve.

Summary

References

MiniMax, "MiniMax M2.5: Built for Real-World Productivity." MiniMax News, 2026.
OpenHands, "MiniMax M2.5: Open Weights Models Catch Up to Claude Sonnet." OpenHands Blog, 2026.
Artificial Analysis, "MiniMax-M2.5 - Intelligence, Performance & Price Analysis." 2026.
MiniMaxAI, "MiniMax-M2.5." Hugging Face Model Card, 2026.
VentureBeat, "MiniMax's new open M2.5 and M2.5 Lightning near state-of-the-art while costing 1/20th of Claude Opus 4.6." 2026.

MiniMax M2.5: Opus-Level Performance at $1 per Hour

MiniMax M2.5: Opus-Level Performance for $1 per Hour

Architecture: 230B Total, 10B Active

Forge: A Reinforcement Learning Framework for Agents

Benchmarks: What the Numbers Say

Coding Performance

Search and Tool Use

General Knowledge and Reasoning

Office and Productivity

Pricing Comparison: The Real Story

Benchmark Caveats

Budget Model Showdown: M2.5 vs Gemini 2.5 Flash vs Flash-Lite

What Open Weights Means

Limitations and Caveats

What Has Changed

Summary

References

MiniMax M2.5: Opus-Level Performance at $1 per Hour

MiniMax M2.5: Opus-Level Performance for $1 per Hour

Architecture: 230B Total, 10B Active

Forge: A Reinforcement Learning Framework for Agents

Benchmarks: What the Numbers Say

Coding Performance

Search and Tool Use

General Knowledge and Reasoning

Office and Productivity

Pricing Comparison: The Real Story

Benchmark Caveats

Budget Model Showdown: M2.5 vs Gemini 2.5 Flash vs Flash-Lite

What Open Weights Means

Limitations and Caveats

What Has Changed

Summary

References