LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

I built a small website called LLM Win:

It turns LLM benchmark results into a directed graph:

text If model A beats model B on benchmark X, add an edge A -> B.

Then it searches for the shortest transitive chain between two models.

The meme version is:

text Can LLaMA 2 7B beat Claude Opus 4.7?

In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot:

Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%.
Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking.
Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark.
Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode
Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking.

My current interpretation:

LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise.

The next question is whether reversal structure can help build better evaluation metrics:

identify specialist models;
identify volatile benchmarks;
build robust generalist scores;
select complementary benchmark sets;
decompose models into capability fingerprints.

Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks?

submitted by /u/Spico197
[link] [comments]

Leave a Comment