GLM 5.1: The Open-Source Model That Almost Beats Claude Opus at Coding

Mar 31, 2026Dishant Sharma7 min read

When Zhipu's global head Zixuan Li tweeted "Don't panic. GLM-5.1 will be open source," developers started paying attention. The tweet came March 20th. A week later, on March 27th, Z.ai quietly dropped GLM-5.1 with just two lines in their release notes. No blog post. No launch event. Just a model key: GLM-5.1.

That's it. That's the whole announcement.

And r/LocalLLM immediately started asking: any good?

i've been building with open-source models for a while now. Every few months something drops that looks impressive. You integrate it. Then you find the weird edge case where it falls apart. You document it. You move on. But this one felt different, and not for the reasons you'd expect.

Let me explain what actually happened here, and why you should care if you build AI agents or write a lot of code.

What GLM 5.1 actually is

i used to think "point releases" were just marketing. A 5.0 to 5.1 update usually means minor bug fixes. Maybe a context tweak. Nothing worth talking about.

GLM-5.1 broke that assumption.

The base architecture is the same as GLM-5. Same 744 billion total parameters. Same Mixture-of-Experts setup with 40 billion active per token. Same 200K context window. Z.ai didn't rebuild the model. They retargeted the reinforcement learning pipeline specifically at coding task distributions. That sounds like a small change.

It moved the coding benchmark score from 35.4 to 45.3. That's a 28% jump. In one point release.

For reference, Claude Opus 4.6 scores 47.9 on the same benchmark. GLM-5.1 sits at 94.6% of that. The base model, GLM-5, already had 77.8% on SWE-bench Verified, the highest score among open-source models. So GLM-5.1 builds on a base that was already legitimately good.

The Reddit test

Here's a question people always ask: does it actually work, or is this just benchmark theater?

The r/LocalLLM thread had a pretty telling comment. Someone wrote: "I had an issue with my mac and used glm 5.1 in claude code to fix it. Solved it properly and felt like claude sonnet. I like it. And i hate glm usually."

That's the honest version of a five-star review.

Other users noted it can remember context from about ten steps back without getting lost. It manages multi-step workflows. It debugs its own errors without you having to hold its hand. These are real agentic behaviors, not demo tricks.

but there's also a Chinese tech article from the same day that noted something else. The GLM Coding Plan sold out instantly. That's not a marketing stat. That's demand. Real developers with real money clicked "subscribe" the moment it went live.

How to actually use it

Most tutorials tell you to just swap the model name. That's almost correct.

If you're on the GLM Coding Plan, you set the model to glm-5.1 in your config file. On Claude Code, that means editing ~/.claude/settings.json. That's the whole setup. The model key is GLM-5.1. Z.ai confirmed it.

Here's what the plan includes:

GLM-5.1 (latest, coding score 45.3)
GLM-5 (score 35.4)
GLM-5-Turbo (faster, lighter)

The plan starts at $10/month. For context, running Claude Opus through the API for heavy coding work costs significantly more than that.

The math makes sense. You're getting 94.6% of Opus performance at a fraction of the price.

but that brings us to the part nobody is talking about.

The 100K context problem

The problem isn't what you think when you first read the benchmarks.

A user on r/ZaiGLM posted this three days ago: "GLM 5.1 suffer from the same useless insanity as GLM 5.0, once reaches 100k context use." They weren't being dramatic. When you push the model past roughly 100K tokens of context, behavior changes. It gets weird. Unstable.

i've seen this pattern before in other MoE models. Long context is hard. Advertising 200K and reliably using 200K are two different things.

The workaround some users found: trigger auto-compaction at half the advertised max. Set your context limit to around 105K in your config. That keeps it stable. Annoying? Yes. Dealbreaker? Probably not for most tasks.

Here's what that config looks like:

"glm-5.1": {
  "limit": {
    "context": 105000,
    "output": 8192
  }
}

Small change. Saves you hours of debugging weird model behavior in long sessions.

A quick tangent about model naming

This is not about GLM 5.1, but i keep thinking about it.

Zhipu AI renamed itself Z.ai. The models are now under Z.ai branding. GLM still stands for "General Language Model." But the company wanted a sleeker global name. So now you have Z.ai releasing models called GLM. Which is a bit like if Apple changed its name to "Device" but kept calling things iPhones.

i don't know who makes these naming decisions. Somewhere there's a product manager who spent three weeks on this. The model is still good. The naming is still confusing. These things can coexist.

The open-source weights for GLM-5 are already on Hugging Face under MIT license. GLM-5.1 weights are confirmed coming under the same license. Just no date yet.

Who this is actually for

Get honest here: most people don't need this right now.

If you're building small apps, doing light automation, or not running agents at scale, GLM-5.1 on the coding plan adds friction. You have to set up a new API endpoint. You have to manage the 100K context limit. You have to wait for the open weights to actually drop before you can self-host.

That's not nothing.

But if you're running Claude Code or OpenCode heavily, and cost is a real concern, this is worth testing this week.

The use case where it shines is clear:

Longer coding tasks with multiple files
Agent workflows with tool use
Debugging sessions that span many turns

It also matters for teams outside the US. Z.ai explicitly targets markets that US export controls have pushed away from Anthropic and OpenAI. The MIT license means you can run it on your own infra, in your own country, with zero dependency on American API providers.

That's not a small thing.

The benchmark skepticism is fair

One reviewer said it plainly: Chinese labs have a track record of impressive self-reported numbers that look less exciting once independent testing happens.

That's fair. We don't have fully independent benchmarks for GLM-5.1 yet. The 45.3 coding score is Z.ai's own evaluation. But the base model GLM-5 did hold up externally. It really did score 77.8% on SWE-bench Verified. That gives GLM-5.1 some credibility even before external tests arrive.

Give it a few weeks. The community will stress-test it properly.

One month from now, either the open weights drop and developers start integrating this everywhere, or the independent benchmarks come back underwhelming and the thread goes quiet. Both outcomes are useful data.

Someone on Reddit said they hated GLM before and this changed their mind. That's not a common thing to admit. Worth noting.

The plan is $10/month. The context window works if you cap it at half. The coding improvements are real, even if the final verdict isn't in yet.

Try it on a task you'd normally throw at Sonnet. See what happens.