cs.AI, cs.CL, cs.LG

Robust Reasoning Benchmark

arXiv:2604.08571v1 Announce Type: new
Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a pe…