DDO-RM: Distribution-Level Policy Improvement after Reward Learning
arXiv:2604.11119v2 Announce Type: replace-cross
Abstract: Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We prop…