Explaining and Preventing Alignment Collapse in Iterative RLHF
arXiv:2605.04266v1 Announce Type: cross
Abstract: Reinforcement learning from human feedback (RLHF) typically assumes a static or non-strategic reward model (RM). In iterative deployment, however, the policy generates the data on which the RM is retra…