Nirmal Patel, Fei Wang, Inderjit S. Dhillon

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel, Fei Wang, Inderjit S. Dhillon / May 18, 2026

arXiv:2605.12667v2 Announce Type: replace-cross
Abstract: The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction…

Author name: Nirmal Patel, Fei Wang, Inderjit S. Dhillon

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization