Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
arXiv:2605.00674v1 Announce Type: new
Abstract: Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly …