Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
arXiv:2604.01312v1 Announce Type: new
Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study inv…