When Gemini 3.1 Pro dropped on February 19, the developer community split within hours.
One side called it a massive leap. The other side said it's overpriced for what you actually get in production. Both had data. That's the problem.
Two weeks before Gemini showed up, Anthropic released Claude Opus 4.6. It jumped from 18.5% to 76% on long-context retrieval. It added a 1M token context window and four adaptive thinking modes. Developers were already shipping with it when Google complicated the comparison.ssntpl+1
And the math got complicated immediately. Gemini 3.1 Pro doubled its ARC-AGI-2 reasoning score in roughly three months. It set the highest score ever on GPQA Diamond, a graduate-level science benchmark. It fixed the output truncation bug that made the previous version painful in production. On paper, clear winner.datacamp+1
But then people actually used both. And the benchmark story stopped being enough.
This isn't about which model scored 0.3% higher on an eval you'll never run. It's about what happens when you push something into production and need it to hold. You've probably picked a model based on a benchmark table, built half a pipeline, then switched. i know i have.
Reasoning: great on paper
I used to think doubling an abstract reasoning score meant doubling real-world usefulness.
Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, double what Gemini 3 Pro managed. It also set the highest GPQA Diamond score ever recorded at 94.3%. Claude Opus 4.6 scored 91.3% on the same test.limitededitionjonathan.substack+1
Those numbers look decisive. Gemini wins. Close the tab.
Except a developer running a real multi-phase planning task got a different story. Opus 4.6 produced an eight-phase plan with tasks, subtasks, and detailed explanations. They used it directly. Gemini 3.1 Pro gave something shorter, skipped edge cases, and barely touched the code.
Other engineers noticed the same thing. Gemini's code reviews are brief. They skip edge cases. You can prompt it to be verbose and it helps, but you shouldn't have to.
The benchmark said Gemini won. The codebase had other opinions.
What actually got fixed
The first time i tried Gemini 3 Pro on a real pipeline, it stopped mid-response. Just cut off. No error. No warning. I had to manually chunk the output, stitch it back together, and waste an hour on something the model should have handled.
Gemini 3.1 Pro fixed that. Users after launch reported completing massive outputs in a single run. JetBrains confirmed it: fewer output tokens needed, more reliable results overall.
That fix matters more than the reasoning jump for most production workflows.
Here's what changed in 3.1:
ARC-AGI-2 jumped to 77.1%, doubled from Gemini 3 Pro
GPQA Diamond hit 94.3%, highest ever recorded
Output truncation is gone
Context window is 1M tokens native
But on agentic benchmarks, 3.1 Pro dropped from rank 7 to rank 19 compared to 3 Pro. It also costs more than double the previous version. Better at one-shot tasks. Worse at multi-step agent workflows.
That's not a footnote. For agent builders, that's the whole story.
Where Claude keeps winning
Here's a question people ask every time a new benchmark drops: if Gemini is cheaper and scores higher, why use Claude at all?
Because Claude wins where the charts stop.
BrowseComp tests autonomous web search. Gemini 3.1 Pro scored 85.9%. Opus 4.6 scored 84.0%. Basically tied.
But GDPval-AA tests expert tasks: data analysis, report writing, complex decision-making. Claude hit an Elo of 1606. Gemini landed at 1317. That's not a close race.
Companies running Opus 4.6 in production noticed the gap. Warp called it "the new frontier on long-running tasks." Shortcut AI said work that used to be hard for Opus 4.5 "suddenly became easy." Devin reported higher bug-catching rates with 4.6 in code review.anthropic+1
And Claude ships with four thinking modes now. Low for simple tasks. Max for hard problems. It allocates compute on its own without configuration.
For anything agent-heavy, Claude is still the model teams trust when it counts.
The cost math
Most tutorials tell you to ignore pricing until you're scaling. That's bad advice when you're picking a foundation.
Gemini 3.1 Pro: \(2 per million input tokens, \)12 output. Claude Opus 4.6: \(5 per million input tokens, \)25 output. Gemini is 2.5x cheaper on input, 2.1x cheaper on output. At 100 million tokens per month, Gemini saves around \(550. At 1 billion tokens, that's \)5,500 per month.
That's not abstract. That's a server. That's a seat.
But Claude supports 128K output tokens. Gemini caps at 64K. For tasks that produce full reports or large code files, Claude requires fewer API calls per task. The price gap narrows.
The cheapest model is the one that does the job in the fewest calls. Actually run that math before you decide.
A naming problem
i once spent an entire afternoon naming a new testing project. Not building it. Just naming it.
"llm-router" sounded too corporate. "brain-switch" sounded like 2018. I landed on "gemini-compare-v7-final-REAL-final" and my teammate refused to open that folder for three weeks.
There's something genuinely strange about naming AI tooling. The name feels like a promise. "agent-core" sounds intentional and serious. "gemini-compare-v7-final-REAL-final" sounds like something that survived.
These two models have clean names by comparison. "Gemini 3.1 Pro" sounds like a version number. "Claude Opus 4.6" sounds like a classical composition. Neither name tells you if the model handles your specific weird edge case at 2am when something breaks.
Which is true of every model name ever. And yet every comparison post leads with the names like they mean something. This post included.
Who shouldn't bother
Most people don't need Opus 4.6.
If you're building a RAG pipeline over standard-size documents, Gemini 3.1 Pro handles it fine and saves you real money. If you're running a customer support chatbot, you don't need the heavy reasoning Opus 4.6 does well. You're paying for capability you won't touch.
Opus 4.6 is for long-running agents. Complex codebases. Reports where missing an edge case is a real problem. Workflows that run for minutes, not seconds.
Gemini 3.1 Pro is a solid starting point for most builds. The reasoning jump is real. The truncation fix solves a genuine production headache. And at \(2 input versus \)5, the savings at scale aren't abstract.aifreeapi+1
But if your agents start degrading at step 7 of a 10-step workflow, you'll know. Switch then. Don't start with the expensive option hoping it justifies itself.
A developer on Reddit described Opus 4.6 as "like a junior colleague who has done all the work and presents a clear plan." That comparison has stuck with me.
Because the best model isn't the one with the highest score on a benchmark PDF. It's the one you trust to finish the job without you checking in every five minutes.
Figure out which tasks you'd actually hand off without babysitting. Start there. Everything else is just a number on a chart.
