ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
arXiv:2509.25868v3 Announce Type: replace
Abstract: The mechanisms underlying scientific confabulation in Large Language Models (LLMs) remain poorly understood. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs with span-level error annotations derived from Reddit's r/AskScience. Evaluating 9 state-of-the-art LLMs reveals two critical limitations. First, models exhibit a dominant "salient distractor" failure mode: 61% of incorrect span predictions are semantically unrelated to actual errors. Crucially, this pattern persists across all model scales (1B to 70B), indicating a fundamental semantic grounding deficit that scaling alone fails to resolve. Second, we find that comparative judgment is paradoxically harder than independent detection, even GPT-4o's F1 score drops from 0.67 to 0.53 when comparing answers side-by-side. These findings directly challenge the reliability of LLM-as-Judge paradigms for scientific factuality. Code and data are released at https://github.com/ddz5431/ReFACT.