Benchmark scores vs arena.ai performance [D]

If you take two open-source models: Gemma4-31B and Qwen3.5-27B, you might notice that Qwen beats Gemma on almost all benchmark sets. On the other hand, the situation is reversed on arena.ai: Gemma beats Qwen quite decisively, by about 50 ELO points in different categories. Is there a good explanation for this apparent discrepancy?

submitted by /u/we_are_mammals
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top