GPT vs Claude in a bomberman-style 1v1 game

A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments.

I'm a big fan of these kinds of benchmarks as IMO they reveal so much more about the capabilities and limits of agentic AI than static Q&A benchmarks. They are also more intuitive to understand when you are able to actually see how the model behaves in these environments.

I wanted to build something in that spirit, but with an environment that pits two LLMs against each other. My criteria were:

Strategic & Real-time. The game had to create genuine tradeoffs between speed and quality of reasoning. Smaller models can make more moves but less strategic ones; larger models move slower but smarter.
Good harness. I deliberately avoided visual inputs — models are still too slow and not accurate enough with them (see: Claude playing Pokémon). Instead, a harness translates the game state into structured text, and the game engine renders the agents' responses as fluid animations.
Fun to watch. Because benchmarks don't need to be dry bread :)

The end result is a Bomberman-style 1v1 game where two agents compete by destroying bricks and trying to bomb each other.

It’s open-source here: github
Would love to hear what you think!

submitted by /u/Significant-Pair-275
[link] [comments]

Leave a Comment