cs.AI, cs.LG, stat.ML

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

arXiv:2604.25872v1 Announce Type: new
Abstract: Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics …