Three Models of RLHF Annotation: Extension, Evidence, and Authority
arXiv:2604.25895v1 Announce Type: cross
Abstract: Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the norma…