Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson

What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson / April 14, 2026

arXiv:2510.26202v2 Announce Type: replace
Abstract: Human feedback can alter language models in unpredictable and undesirable ways, as practitioners lack a clear understanding of what feedback data encodes. While prior work studies preferences over ce…

Author name: Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson

What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data