Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios
arXiv:2512.00920v5 Announce Type: replace
Abstract: Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in g…