Mitigating Cognitive Bias in RLHF by Altering Rationality
arXiv:2605.06895v1 Announce Type: new
Abstract: How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns sc…