You can play them here: https://fatheredpuma81.github.io/LLM_Racing_Games/
This started out as a simple test for Qwen3 Coder Next vs Qwen3.5 4B because they have similar benchmark numbers and then I just kept trying other models and decided I might as well share it even if I'm not that happy with how I did it.
Read the "How this works" in the top right if you want to know how it was but the TLDR is: Disabled vision, sent same initial prompt in Plan mode, enabled Playwright MCP and sent the same start prompt, and then spent 3 turns testing the games and pointing out what issues I saw to the LLMs.
There's a ton of things I'd do differently if I ever got around to redoing this. Keeping and showing all 4 versions of the HTML for 1, not disabling Vision which hindered Qwen 27B a ton (it was only disabled for an apples to apples comparison between 4B and Coder), and idk I had a bunch more thoughts on it but I'm too tired to remember them.
Some interesting notes:
- Qwen3 Coder Next's game does appear to have a track but it's made up of invisible walls.
- Gemma 4 31B and Qwen3.5 27B both output the full code on every turn while the rest all primarily edited the code.
- Gemma 4 31B's game actually had a road at one point.
- Qwen3.5 27B Accidentally disabling Playwright MCP on the final turn is what gave us a car that actually moves and steers at a decent speed. The only thing that really changed between the 1st HTML and last was it added trees.
- Gemma 4 26B was the only one to add sound.
- Gemma 4 26B added a Team Rocket car blasting off again when you touched a wall but then OpenCode more or less crashed in the middle of it so I had to roll back which resulted in the less interesting Sound version.
- GLM 4.7 Flash and Gemma 4 26B were the only ones to spawn a subagent. GLM used it for research during Planning and Gemma used it to implement sound on the final turn.
- Found out GLM 4.7 Flash can't do Q8_0 K Cache Quantization without breaking.
- Qwen3.5 4B installed its own version of Playwright using NPX and then it started using both on bugfix turn 2/3.
- GLM 4.7 Flash failed its final output to a white screen so I jumped back a turn and asked it to output the code full again. So it only got 2 turns I guess?
- Qwen3.6 35B's game actually regressed in a lot of ways from the start. There was no screen jitter, the track was a lot more narrow, and the hit boxes were spot on with the walls. The minimap was a lot more broken though I think it got confused between Minimap Track and physical track.
submitted by