Anthropic released Opus 4.6 three days ago. MiniMax dropped M2.5 yesterday. The tech community lost its mind.
Not because Opus 4.6 is bad. It's excellent. But because M2.5 costs one-twentieth the price and matches it on coding benchmarks. On SWE-Bench Verified, M2.5 scored 80.2% while Opus 4.6 hit around the same range. That's not supposed to happen. Frontier models from Anthropic are supposed to dominate. They usually do.
You've probably been watching your API bills climb. i know i have. Every time you spin up an agent loop, every time you run a coding task, the dollars add up. And you think "well, i need the best model." But what if you don't?
What if the open-source model that costs $0.72 per task delivers the same result as the $5.63 proprietary one ?
Why everyone's talking about this
M2.5 isn't just cheap. It's fast too.
Runs 37% faster than M2.1 on complex agentic tasks. Matches Opus 4.6's speed while costing dramatically less. The throughput is wild. 100 tokens per second on the Lightning version. That's double what most frontier models deliver.
And the context window. 204.8K tokens. Not the million-token monster that Opus 4.6 has, but enough for most real work.
Here's what actually matters though. On Multi-SWE-Bench, M2.5 scored 51.3%. That's a benchmark where models have to fix bugs across multiple files and repos. It's hard. Most models fail badly. M2.5 doesn't.
One developer ran both models on the same coding test. M2.5 cost $0.72. Opus 4.6 cost $5.63. Both got 100% completion.
i used to think price reflected quality. Pay more, get better results. Sometimes that's true. Not here.
The benchmarks that matter
Let me tell you about SWE-Bench Verified. It's the test everyone watches.
Real GitHub issues. Real codebases. Models have to understand the problem, write the fix, and make it work. No hand-holding. Opus 4.6 does great on this. So does M2.5. They're basically tied.
But then you look at BrowseComp. This tests how well models navigate web interfaces and gather information. M2.5 hit 76.3% with context management. That's strong. Not the best, but strong.
The first time i tried running agentic tasks on M2.1, it burned through rounds inefficiently. Kept searching and re-searching. Made the same API calls twice. Annoying.
M2.5 learned. Uses 20% fewer rounds than M2.1 to solve the same problems. Thinks before it acts. Plans better. That's the difference between a model that works and one you'd actually use in production.
Here's a question people always ask: does it hallucinate?
Yes. All models do. But one Reddit user noted Opus 4.6 tends to hallucinate a bit less. The user experience feels cleaner. M2.5 will confidently give you wrong answers sometimes. So will Opus. Just slightly less often.
Where M2.5 actually wins
Speed and cost. That's the real story.
$1 to run the model continuously for an hour at 100 tokens per second.
At 50 tokens per second, it costs $0.30 per hour. Compare that to Opus 4.6. You're paying $25 per million output tokens on Opus. M2.5 charges $2.40 per million output tokens on Lightning. That's roughly one-tenth the cost.
What actually happens is this. You build an agent. It needs to loop through tasks. Call APIs. Write code. Review it. Fix bugs. That's a lot of tokens. On Opus, you're nervous about the bill. On M2.5, you just let it run.
Most tutorials tell you to optimize for quality first. Then worry about cost. That made sense when cheap models sucked. Now the cheap model performs like the expensive one. The math changed.
My coworker tried to migrate an entire agentic pipeline from Opus 4.5 to M2.5 last week. Took him four hours. Works fine. Costs dropped 85%. He's happy. Finance is happy.
Coding and agentic tasks
M2.5 does something interesting before writing code.
It plans like an architect. Writes specs first. Breaks down features. Thinks about structure and UI design. Then codes. That's not how most models work. They just start typing.
The problem isn't what you think. It's not about syntax errors or missing semicolons. It's about whether the model understands what you're building. Whether it can hold context across multiple files. Whether it remembers what it did three steps ago.
M2.5 handles full-stack projects. Web, Android, iOS, Windows. Server-side APIs. Databases. Business logic. Not just frontend demos. The VIBE benchmark tests this. M2.5 performs on par with Opus 4.5.
Look, one developer tested it on novel writing. Failed hard. Couldn't handle a simple prompt asking for one plot file split into five chapters. So it's not magic. It's optimized for coding and agents, not creative writing.
The open-source angle
Here's where it gets weird. M2.5 has open weights.
You can download it. Run it locally. Modify it. That's not true for Opus 4.6. Anthropic keeps that locked down. For some teams, that matters. For others, it doesn't.
If you're running sensitive workloads, you might want the model on your infrastructure. No API calls. No data leaving your network. M2.5 lets you do that. Opus doesn't.
But running models locally is annoying. You need GPUs. You need memory. You need to deal with quantization and inference optimization. Most developers don't want that headache. They just want an API that works.
The "open-source" label matters less than whether the model solves your problem at a price you can afford.
The tangent about naming
Why do Chinese AI labs pick names like this?
MiniMax. DeepSeek. GLM. They're trying to sound international. Accessible. But "MiniMax" sounds like a budget airline or a storage unit company. Not a frontier AI model.
Anthropic named their models after family members. Opus. Sonnet. Haiku. It's pretentious but memorable. OpenAI went with GPT and a number. Boring, but clear.
MiniMax went with M2.5. At least it's versioned. You know M2.5 is better than M2.1. But "MiniMax" itself? i still can't take it seriously. Every time i type it, i think about discount furniture stores.
The honest truth
Most people don't need Opus 4.6.
If you're building agents, M2.5 will probably work fine. If you're running coding tasks, M2.5 will probably work fine. If you need the absolute best quality and hallucinate less and money isn't a concern, pay for Opus.
But money is always a concern. Even when people say it isn't.
This is overkill for simple chatbots. For basic Q&A. For anything that doesn't require deep reasoning or multi-step task chains. You don't need either model for that.
The biggest misconception is that open-source means worse. It used to. Not anymore. M2.5 proved that. The gap closed. Fast.
If you're locked into Anthropic's ecosystem, switching is hard. If you're just starting, M2.5 makes sense. If you're cost-sensitive and need production-ready agents, M2.5 makes a lot of sense.
So what now
i still think Opus 4.6 is the better model. Slightly.
The user experience is smoother. It catches its own mistakes better. It handles edge cases more reliably. But "better" doesn't mean "worth 10x more." Not for most use cases.
You'll choose based on what you're building. If you need reliability above all else, go Opus. If you need speed and cost-efficiency, go M2.5. If you're experimenting, definitely go M2.5.
The next few months will be interesting. Anthropic will probably drop their prices. MiniMax will probably release M2.6. The benchmarks will shift. But right now, today, M2.5 is the better deal for most developers.
And that's annoying if you just bought a bunch of Opus credits.
