cs.CL, cs.SE

MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv:2604.21916v1 Announce Type: new
Abstract: As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast …