NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models
arXiv:2605.07051v1 Announce Type: new
Abstract: Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evalua…