Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang

DDO-RM: Distribution-Level Policy Improvement after Reward Learning

Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang / May 1, 2026

arXiv:2604.11119v2 Announce Type: replace-cross
Abstract: Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We prop…

Author name: Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang

DDO-RM: Distribution-Level Policy Improvement after Reward Learning