ForgeCode Honest Review: Speed Tests, Real Trade-offs

Apr 12, 2026Dishant Sharma6 min read

The Coding Harness Nobody Asked For But Everybody's Trying

A dev on dev.to tested ForgeCode against Claude Code on the same tasks. Same model, same repo. ForgeCode finished in under 30 seconds. Claude Code took 90. He ran it again. Same result. He ran it a third time. Still faster. His verdict? "i could feel it in my workflow."

That kind of real-world speed difference is rare in AI tools. Usually you get marginal gains dressed up as breakthroughs. ForgeCode felt like something actually different.

What ForgeCode actually is

ForgeCode is not an AI model. It's a harness. Think of it as a terminal app that lets you plug in whatever model you want. Claude, GPT, DeepSeek, Gemini, local models. Three hundred plus models through OpenRouter or direct API keys.

It's open source, Apache 2.0, written in Rust. Launched in January 2025. Hit v2.8.0 by April 2026 with over 6,000 stars on GitHub. The org behind it is tailcallhq (formerly antinomyhq).

Three agents, three jobs:

forge writes and edits code
sage does read-only research, can't modify files
muse generates plans and writes them to a plans/ directory

The Zsh integration is the thing that gets people. You type : add error handling to the auth middleware and it runs inline. No context switching, no new window, no vscode. Your aliases and oh my zsh plugins keep working.

Why it's fast

Part of it is the Rust binary. Claude Code is TypeScript, so startup and memory are heavier. Rust wins on raw speed for that kind of thing. But that's not the whole story.

The context engine is what actually matters. Instead of dumping raw files into the context window, ForgeCode indexes function signatures and module boundaries. The agent pulls only what it needs. Their own estimates say this cuts context size by about 90%. Smaller context means faster responses. Same model, different harness, different speed.

i tried to think of a good analogy. It's like the difference between reading every file in a project before fixing a bug versus jumping straight to the relevant function. One approach is thorough. The other is fast. For coding tasks where you already know roughly where the problem is, fast wins.

The benchmark problem

Here's where things get messy. ForgeCode claims to be the world's top-ranked coding harness. Their site says "#1 on TermBench 2.0" with a score of 81.8%. Claude Code sits at 58%, ranked 39th. A 24-point gap. Sounds devastating.

But TermBench 2.0 is ForgeCode's own benchmark, hosted at tbench.ai, submitted by ForgeCode itself. Not neutral. Not independent.

On SWE-bench Verified, the independent benchmark from Princeton and UChicago, the gap shrinks to 2.4 points. ForgeCode plus Claude 4 scored 72.7%. Claude 3.7 Sonnet with extended thinking scored 70.3%. A real gap, sure. But not a 24-point gap.

The r/ClaudeCode community called it "benchmaxxed." Which is funny and kind of fair. ForgeCode's blog documents four harness changes that specifically target benchmark performance. Reordered JSON schema fields to reduce GPT 5.4 tool-call errors. Flattened nested schemas. Added truncation reminders for partial file reads. Added a mandatory verification pass where a reviewer skill checks completion before the agent stops.

These are real engineering improvements. They're also benchmark-specific optimizations. Both things can be true.

GPT 5.4 was a disaster

The dev.to reviewer tried GPT 5.4 through ForgeCode. Asked it to research the architecture of a small repo. Fifteen minutes. Tool calls failing. Agent retrying and spinning. He killed it.

So the speed story needs a qualifier. ForgeCode with Claude Opus 4.6 is fast. ForgeCode with GPT 5.4 was borderline unusable. The harness doesn't magically make every model good. It makes good models faster. Bad models stay bad.

The multi-agent setup actually makes sense

Most coding agents are single-shot. You ask, it does. ForgeCode splits work across three agents. forge handles implementation. sage handles research. muse handles planning.

The separation matters. When sage researches something, it can't accidentally modify your files. It's read-only by design. When muse writes a plan, it goes to a plans/ directory. It doesn't touch your code. Only forge can edit files, and only when you tell it to.

One reviewer on dev.to noted that muse's plans felt "more detailed and verbose" than Claude Code's plan mode. Whether that's good depends on what you want. If you like thorough breakdowns before touching code, it's a win. If you want to move fast and skip the paperwork, it might feel slow.

i think the separation is smart for teams. Junior devs can use sage to understand a codebase without breaking anything. Senior devs can use forge for quick edits. The plan stays documented. Nobody overwrites someone else's work because the agent got confused about what it was supposed to do.

The sandbox trick

One feature that doesn't get enough attention is the --sandbox flag. It creates an isolated git worktree and branch. You can try something risky, let the agent go wild, and only merge back what works. Your main tree stays clean.

That's genuinely useful. i've had Claude Code modify files i didn't want touched. A sandbox mode would have saved me a git checkout or two. Claude Code doesn't have this built in the same way. You can replicate it manually with git branches, but having it as a flag is the kind of thing that removes just enough friction to change how you experiment.

One less thing to think about. One more reason to try something bold instead of playing it safe.

The honesty thing

ForgeCode explicitly says they've hired zero paid influencers. Their social media presence is intentionally low. In an industry where every "honest review" has affiliate links and sponsored segments, that's suspiciously refreshing.

But it also means most people find out about ForgeCode through benchmarks or word of mouth. The benchmark problem i mentioned earlier? It matters more when you don't have marketing to back up your claims. The numbers have to speak for themselves. And on independent benchmarks, they whisper instead of shout.

Who should actually use this

If you're happy with Claude Code, there's no urgent reason to switch. The speed difference is real but not life-changing for most workflows. If you're spending less than an hour a day in a coding agent, you won't notice.

If you're in your terminal all day and latency is driving you insane, try it. The install is one line. The Zsh plugin works exactly as advertised. And model flexibility means you're not locked into Anthropic's pricing.

If you want to run local models or mix providers per task, ForgeCode is the clear pick over Claude Code. That's not even close. Claude Code is Claude or nothing. ForgeCode is whatever you want.

The thing i keep thinking about

Benchmarks are weird. They shape perception more than reality. ForgeCode's 81.8% on their own benchmark became the headline everywhere. The 2.4-point lead on SWE-bench barely got mentioned. Both numbers are true. One got shared. The other didn't.

i'm not saying ForgeCode is overrated. The speed gains are real. The architecture is solid. The multi-provider support is genuinely useful. But the way it's marketed, starting with that big bold "#1" on their homepage, tells you something about how this space works. The best number wins. Even when it's your own number.

i think about this every time i see a new tool launch with a benchmark headline. the number goes in the readme. the blog post leads with it. the twitter thread starts with it. by the time anyone checks the methodology, the impression is already set.

ForgeCode is a good tool. Maybe even a great tool for the right person. But the marketing and the product are two different things. the product stands on its own. the marketing leans on a number that only tells part of the story.