If you take two open-source models: Gemma4-31B and Qwen3.5-27B, you might notice that Qwen beats Gemma on almost all benchmark sets. On the other hand, the situation is reversed on arena.ai: Gemma beats Qwen quite decisively, by about 50 ELO points in different categories. Is there a good explanation for this apparent discrepancy?
[link] [comments]