Gayane Ghazaryan, Esra D\"onmez

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Gayane Ghazaryan, Esra D\"onmez / May 7, 2026

arXiv:2605.05003v1 Announce Type: new
Abstract: Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following…

Author name: Gayane Ghazaryan, Esra D\"onmez

Misaligned by Reward: Socially Undesirable Preferences in LLMs