MathDuels: Evaluating LLMs as Problem Posers and Solvers
arXiv:2604.21916v1 Announce Type: new
Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast …