cs.LG

Regret Bounds for Reinforcement Learning from Multi-Source Imperfect Preferences

arXiv:2603.20453v2 Announce Type: replace
Abstract: Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated …