Benchmark scores vs arena.ai performance [D]

By /u/we_are_mammals / May 12, 2026

If you take two open-source models: Gemma4-31B and Qwen3.5-27B, you might notice that Qwen beats Gemma on almost all benchmark sets. On the other hand, the situation is reversed on arena.ai: Gemma beats Qwen quite decisively, by about 50 ELO points in different categories. Is there a good explanation for this apparent discrepancy?

submitted by /u/we_are_mammals
[link] [comments]

Leave a Comment