| Quick context: both vendors publish SWE-bench scores on their own scaffolds. The same model swings 22 points depending on harness design — more than the gap between any two frontier models. Since January 2026, Claude plans are restricted to Anthropic's native apps. Kilocode, Aider, any third-party tool: can't consume a Max subscription. Codex went the opposite direction and opened up to external integration. Result: every "Claude crushes GPT on code" comparison is Claude on Claude Code vs GPT on a random harness. That's not a model comparison. That's a product comparison. Opus 4.7 leads GPT-5.4 on SWE-bench Pro, 64.3% vs 57.7%, and third-party reproductions broadly confirm it. The model is probably good. But the subscription doesn't just fund the model — it locks you into the harness that makes the benchmark possible. Ended up going with ChatGPT Pro myself. Not because GPT-5.4 is objectively better. Because it's the only $100/month plan that actually works with my open-source toolchain 🤷 [link] [comments] |