Robust Reward Modeling for Large Language Models via Causal Decomposition
arXiv:2604.13833v2 Announce Type: replace
Abstract: Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by …