Efficient Preference Poisoning Attack on Offline RLHF
arXiv:2605.02495v1 Announce Type: cross
Abstract: Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference p…