cs.AI, cs.CL, cs.CY

Three Models of RLHF Annotation: Extension, Evidence, and Authority

arXiv:2604.25895v1 Announce Type: cross
Abstract: Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the norma…