Misaligned by Reward: Socially Undesirable Preferences in LLMs
arXiv:2605.05003v1 Announce Type: new
Abstract: Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following…