Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

arXiv:2506.19089v5 Announce Type: replace Abstract: We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us to find evidence of heuristic behavior and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

Leave a Comment