Flash-MoE: How to Run a 397B Model on a Laptop

Mar 22, 2026Dishant Sharma7 min read

Reddit user Several-Tax31 had one response when Flash-MoE dropped this week. He compared running a 397B model on a laptop to discussing perpetual motion machines. "The second principle of local inference," he wrote, "states that a model needs to fit in RAM and VRAM to run at decent speed."

He's been right for years. A 397B parameter model takes 209GB on disk. Consumer hardware gives you 48GB on a good MacBook Pro. The math has never worked.

And then Dan Woods did it anyway.

On March 18, 2026, Woods, VP of AI Platforms at CVS Health, ran Qwen3.5-397B-A17B on a 48GB MacBook Pro M3 Max. No API. No server. Just a laptop and an NVMe SSD. The model ran at 5.5 tokens per second.

The project is called Flash-MoE. It hit GitHub Trending and Hacker News. The r/LocalLLaMA subreddit, 266,000 members deep, started arguing immediately.

You've probably seen the reactions. Half say "impressive." The other half say "4.4 tokens per second is useless." Both are right. But neither gets to the actually interesting part.

The interesting part is what Flash-MoE does to the memory rule.

What MoE actually means

I used to think "397 billion parameters" meant 397 billion things getting computed per token. That's how dense models work. Every parameter fires on every input.

MoE is different. Mixture-of-Experts splits a model into specialized sub-networks called experts. Each token only routes to a few of them.

Qwen3.5-397B has 512 experts per layer. But per token, it only activates 10. At inference time, most of the model is just sitting there. Not running. Not in memory.

That's the whole trick. 397 billion parameters. 17 billion active.

The rest can stay on disk.

The paper Apple sat on

Here's a question people always ask: why hasn't anyone done this before?

Apple did. In December 2023, they published "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory." It showed 4-5x speedups on CPU and 20-25x on GPU by streaming weights from SSD instead of loading everything into RAM.

The paper mostly covered dense models. Apple published it. Then they built hardware that made the technique possible. Then they mostly didn't implement it in their own products.

Dan Woods read the paper. He figured the technique mapped better onto MoE architectures. Because with MoE, you know in advance which experts won't be needed. You don't stream everything. You stream only what activates.

Here's what that looks like in practice:

5.5GB of RAM for non-expert components, like attention layers and routing matrices
Expert weights streaming from SSD at up to 17.5 GB/s on demand
OS page cache holding about 35GB of warm experts at a 71% hit rate

The resident memory footprint is tiny. The 209GB model just lives on disk.

What actually happens per token

What actually happens is most of the time spent per token is waiting for the SSD.

Each layer takes about 4.28ms to process at 4-bit. GPU attention and computation: 1.22ms. GPU projections and routing: 0.55ms. SSD expert loading: 2.41ms.

The SSD is the bottleneck. Not the GPU. Not the model size.

Flash-MoE also prunes Qwen3.5's default of 10 active experts down to 4 per token. That's a 60% reduction in expert data loaded per token. And there's no measurable quality loss according to benchmarks.

That's what gets you to 4.4 tokens per second on hardware that "can't" run the model.

One more thing worth noting. On Apple Silicon, SSD DMA and GPU compute share the same memory controller bandwidth. Running them at the same time causes GPU latency spikes. Flash-MoE uses a serial pipeline: GPU first, then SSD load, then GPU again. Counterintuitive. But hardware-optimal.

The part that's easy to miss

Most tutorials tell you to evaluate a project by what it can do right now.

This one is better evaluated by what it proves.

Cloud inference of Qwen3.5-397B runs at 84.5 tokens per second. Flash-MoE runs at 4.4. That's about 20 times slower. If you want speed, use an API.

But there are real cases where you can't or won't:

Sending proprietary code to an API is a compliance problem at most companies
Zero per-token cost matters on long document analysis jobs
Offline access matters more than most people admit until they actually need it

Running a frontier model locally isn't faster. But it is private. And it's free after the hardware.

And none of the code was written by a human. Woods fed Claude Code the Apple paper plus Karpathy's autoresearch pattern. Claude ran 90 experiments over roughly 8 hours, documented 58 of them, and wrote the full PDF paper. About 7,000 lines of C, Objective-C, and Metal shaders. Zero lines by hand.

Some experiments failed. LZ4 compression added 13% overhead instead of saving time. Temporal expert prediction was 18% slower. Those dead ends are documented openly in the repo.

A small tangent about naming things

This has nothing to do with Flash-MoE. But i keep thinking about it.

Dan Woods named the project "flash-moe." It's named after the paper that inspired it and the architecture it targets. Flash storage. MoE model. Done.

Half the open-source tools i use every week are named like filesystem errors. "MXPQ." "bitsandbytes." Random acronyms that stand for nothing obvious. i've spent more time than i'd like to admit staring at tool names trying to figure out what they even do before reading the README.

There's something satisfying about a name that just describes the project. No metaphor. No brand play. Just: here's what this is.

Most projects don't do this. Flash-MoE does.

Anyway.

The honest part

Most people don't need this.

4.4 tokens per second is usable if you're doing long analysis and can walk away. A 500-token response takes about two minutes. That's not interactive. That's not a coding assistant you use mid-flow.

If you're building a product, use an API. If you're doing quick experiments, use a 7B model. If you need local inference at scale, use something that fits in memory.

Flash-MoE also only works on Apple Silicon. MacBook Pro M3 Max or newer, 48GB RAM, fast NVMe. No Windows support. No Linux support. And right now, it only runs one model: Qwen3.5-397B-A17B.

This isn't software you use in a workflow. It's software that proves something.

But what it proves matters. The memory rule isn't physics. It's a hardware constraint. And MoE architectures create an exploit: most of the model never needs to touch RAM at all.

Several-Tax31 is still probably right for most cases.

The second law of local inference holds. Fit the model in memory if you want speed. That's not changing.

But Woods found the exception. A model so sparse that most of it never needs to leave the disk. And the dead ends are sitting right there in the repo. LZ4 tried and failed. Speculative decoding broke even at best.

i keep thinking about that more than the result itself. A 397B model running on a laptop, built entirely by an AI reading a research paper, with every failure logged.

Not because it's practical yet. But because it wasn't supposed to be possible at all.