SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
arXiv:2605.12022v1 Announce Type: new
Abstract: Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that te…