cs.AI, cs.CL, cs.CY

Misaligned by Reward: Socially Undesirable Preferences in LLMs

arXiv:2605.05003v1 Announce Type: new
Abstract: Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following…