I built a small website called LLM Win:
It turns LLM benchmark results into a directed graph:
text If model A beats model B on benchmark X, add an edge A -> B.
Then it searches for the shortest transitive chain between two models.
The meme version is:
text Can LLaMA 2 7B beat Claude Opus 4.7?
In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot:
Weak-to-strong reachability is high. I checked
126,937pairs where the source model has lower Intelligence Index than the target model.119,514of them are reachable through benchmark win chains, for a reachable rate of94.2%.Most paths are short. Among reachable weak-to-strong pairs:
2-3 hoppaths account for91.4%. So this is not mostly long-chain cherry-picking.Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about
119kdirect weak-over-strong triples of the form:(source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark.Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode
Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking.
My current interpretation:
LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise.
The next question is whether reversal structure can help build better evaluation metrics:
- identify specialist models;
- identify volatile benchmarks;
- build robust generalist scores;
- select complementary benchmark sets;
- decompose models into capability fingerprints.
Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks?
[link] [comments]