12M Context Window and some some sprinkle of lies?

Spent some time on the SubQ launch today. Some things don't line up.
Tweet sells a 12M context window. Blog says 12M is a research result and the production model is SubQ 1M-Preview. Those are different things, my senses already tingling!

RULER is reported at 128K, that's well below where sparse attention actually has to prove itself. Standard long-context evals should run at the lengths the marketing is advertising.

MRCR v2 at 1M: research model 83, production 65.9. The drop alone says something about how the architecture survives serving. And 65.9 is below Opus 4.6 (78.3) and GPT-5.5 (74) on the same benchmark. The homepage says "without quality loss" but those numbers don't fit that claim.

There's also a comparator selection issue. The blog prose cites Opus 4.7 at 32.2. The homepage table also lists Opus 4.6 at 78.3. Only the favorable one ends up in the narrative.
Pricing doesn't agree either. Homepage: 1/5 the cost. Launch thread: less than 5% of Opus. Factor of 4 between the two.

The 52x faster than FlashAttention is a kernel-level comparison, not end-to-end inference. Fair architecture result on its own but most people will read it as wall-clock speed unless that's labeled.
Sparse attention has a known failure mode: fast until the task depends on a connection the router pruned, and obviously research-to-production MRCR drop is that pattern!
The work could be real and the team is credentialed. The numbers just don't match the marketing yet. And my senses are hyper tingling. Taking this with more than a grain of salt, until it can be peer-reviewed.

When does the technical report drop? I have no idea. But my bullshit radar is high on this one

submitted by /u/prokajevo
[link] [comments]

Leave a Comment