CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
arXiv:2604.10825v1 Announce Type: new
Abstract: We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, …