Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
arXiv:2604.14634v1 Announce Type: new
Abstract: Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Th…