cs.LG

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

arXiv:2604.19024v1 Announce Type: new
Abstract: Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding help…