Back to articles

Step 3.7 Flash: The 198B MoE Model Everyone Is Actually Running

May 30, 2026Dishant Sharma7 min read
Step 3.7 Flash: The 198B MoE Model Everyone Is Actually Running

someone on X posted a photo of a DGX Spark sitting on a regular desk with a terminal window running step 3.7 flash. a 198 billion parameter vision model. on a box that fits next to a monitor. and it was not a flex post. it was a "here is the config that saved me three hours" post.

that is the kind of energy around this release.

stepfun dropped step 3.7 flash on may 29. two days ago as i write this. and the AI community has been processing it ever since. not because it is the biggest model ever made. but because of how much it does for how little it costs.

a 198B model that acts like an 11B model at inference time.

here is what that means in practice.

The architecture that makes the cost work

step 3.7 flash is a sparse mixture of experts model. 198 billion total parameters but only about 11 billion activate per token. the other 187 billion just sit there waiting for the right input. this is not new. MoE is well understood by now. but the execution matters.

the model activates 8 out of 288 experts per token. the routing decides which experts fire based on what the input actually needs. so the compute cost per token looks much closer to a dense 11B model than a 198B one. the result is up to 400 tokens per second throughput. that is genuinely fast for a model with this capability profile.

on the vision side, stepfun added a 1.8B ViT encoder on top of the language backbone. that means native image understanding without the full multimodal tax that some larger models pay. it handles UI wireframes, charts, dense documents, and natural scenes. then it writes code or calls tools based on what it sees.

the vision results back this up. SimpleVQA with search tools hits 79.2, first place in that category, ahead of GPT 5.5 at 79.1. V* with Python tools reaches 95.3, competitive with Kimi K2.6 at 96.9 and Gemini 3 Flash at 96.3. these are flash tier results matching pro tier models on visual tasks.

Why people are excited

the X feed has been full of people actually running this thing. not theorizing. not reviewing benchmarks from a blog post. loading it on hardware and seeing what happens.

sudo su posted his exact llama.cpp config for running it on a DGX Spark. someone else posted benchmark comparisons with claude opus 4.6. multiple people noted that step 3.5 flash was already their most used model for agentic coding. now 3.7 is out and it is a meaningful jump.

the numbers that keep getting shared:

  • ClawEval-1.1: 67.1 (leads the benchmark, next best is 59.8)
  • SWE-Bench Verified with Advisor Mode: 76.3% at $0.19 per task
  • Claude Opus 4.6: 78.7% at $1.76 per task
  • SimpleVQA with Search: 79.2 (beat GPT 5.5 at 79.1)
  • BrowseComp: 75.82% (close to Claude Opus 4.7 at 79.3%)

that is 97% of claude opus 4.6's coding performance at one ninth the cost.

people are not excited because it beats everything. they are excited because it changes the math on what you can put in production.

Advisor Mode is the actual trick

here is what makes the cost number possible.

the basic problem with long agentic runs is that most of the work is routine. tool calls, reading results, iterating on straightforward steps. you do not need a frontier model for that. but occasionally the agent hits a genuinely hard decision point - a planning step that could send the whole trajectory wrong, or a recovery from repeated failures.

advisor mode keeps step 3.7 flash in control of the full run. it calls tools, reads results, and iterates end to end. but at specific inflection points where its own judgment is not sufficient, it consults a larger advisor model and then continues executing on that guidance.

the expensive model gets used only where it actually matters. the small model stays cheap through most of the run. the advisor cost gets amortized across many steps rather than paid on every token.

stepfun describes this as their implementation of the executor-advisor strategy that anthropic wrote about. it is not a new idea. but the execution and the pricing make it work in practice.

Where it actually falls short

terminal-bench 2.1 is the clearest gap. step 3.7 flash scores 59.5 against GPT 5.5 at 82.7 and gemini 3.5 flash at 76.2. that is not a close race. for workflows that depend heavily on terminal interaction and complex command execution, this model is not the right choice right now.

GDPval across 44 occupations shows 45.8% against GPT 5.5 and claude opus 4.7 at 63%. that is a significant gap for general professional task coverage.

HLE with tools at 47.2% trails Claude Opus 4.7 at 54.7% and GPT 5.5 at 52.2%. for the hardest reasoning tasks with tool use, frontier models still have a clear edge.

and all benchmark numbers here are from stepfun's own evaluation unless otherwise noted. self-reported results should be read with that caveat.

That time I spent 2 hours reading the wrong docs

speaking of caveats, i once spent 2 hours reading the wrong version of an API doc because the url had "v2" but the page said "v3". the entire time i was confused about why the endpoints did not match. turns out v2 was the current version and v3 was the upcoming one. the docs were published early by accident.

this happens to me more than i want to admit. which is why when i see a model release with clear deployment instructions and multiple quantization formats on day one, i pay attention. stepfun shipped BF16, FP8, NVFP4, and GGUF weights on huggingface simultaneously. that is not nothing. that is a team that knows people will actually try to run this thing.

and they did. people are running it on DGX Spark, on mac studios with 128GB unified memory, on cloud instances through nim and openrouter. the ecosystem support at launch matters.

The honest take

most people do not need step 3.7 flash.

if you are building a chatbot that answers questions from a knowledge base, a smaller dense model will serve you better. if you need the absolute best terminal agent performance, frontier models are still ahead. if you have fewer than 10k requests a day, the pricing difference does not matter enough to justify the infrastructure complexity.

but if you are building production agentic workflows where tool orchestration reliability and cost per task matter more than raw benchmark ceiling, this model is worth serious evaluation. the ClawEval lead and the advisor mode cost profile are both genuinely differentiated.

the cross-harness consistency is also underrated. step 3.7 flash narrowed per-harness variance from 43-73% (step 3.5) down to 64.5-71.5%. that means more predictable agent behavior across different scaffolds. in production, that matters more than a 2% benchmark gain.

What I keep thinking about

i keep coming back to that X post. someone running a 198B vision model on a desktop machine. sharing their exact config so others do not have to spend 3 hours debugging.

that is the real story here. not the benchmark scores. not the pricing. the fact that this level of capability is becoming something you can just run.

the gap between frontier and open models keeps shrinking. but the gap between "technically available" and "actually usable" is closing faster. step 3.7 flash is a good model. but the ecosystem support, the quantization options, the clear deployment docs - that is what makes it useful.

i still wonder about the terminal-bench gap though. if they close that in 3.9 flash, the conversation gets a lot more interesting.

Recent posts

View all posts