spent the last 3 days really pushing GPT, Claude and Gemini on some gnarly logic puzzles i wanted to see how they’d handle increasing complexity under simulated pressure and some of the results were definitely not what i expected. I ran each prompt through prompt optimizer first to normalize the structure so i was comparing model performance not prompt quality.
Basically i fed them a series of logic grid puzzles, starting simple and getting progressively harder- the twist was i added a time limit for each query not to the models themselves (obviously) but to me responding to their output to simulate a real-time decision-making scenario. I recorded how many could solve it correctly within a reasonable response window and where they tripped up.
Claude was the most consistent performer on the initial and mid-tier puzzles, it was fast and accurate often spitting out the correct grid configuration without much fuss. I’d say it solved maybe 80% of the puzzles up to a certain complexity correctly even when the constraints started getting layered pretty thick.
GPT was a strong contender as expected, it handled most of the puzzles wel but it started to noticeably slow down its reasoning as the puzzles got more convoluted By the really complex ones (think 10+ people 10+ attributes) it began to make minor errors in the deductions- it got about 70% correct.
Gemini… well it buckled pretty hard on the harder puzzles. It wasn't just slower it started outputting outright incorrect answers or refusing to answer confidently claiming ambiguity where there wasn't any. It got maybe 50% of the complex ones right it seemed like the time pressure even simulated really threw its reasoning style off. I’d get these super long explanations that eventually led to the wrong conclusion.
My biggest surprise? GPT holding its own so well against Claude on this specific task. It’s possible my test set was a bit narrow n favoring structured deduction. Has anyone else found GPT to be surprisingly resilient on logic heavy tasks, or did your testing show similar results with Opus taking the lead?
[link] [comments]