LLM Benchmarks Are Junk Science

By Kaushik Rajan / April 1, 2026

An Oxford review of 445 benchmarks found 84% lack basic statistical testing. Models score 90% on standard tests but 2% on unseen problems…