Prakhar Gupta, Vaibhav Gupta

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

Prakhar Gupta, Vaibhav Gupta / May 6, 2026

arXiv:2512.04277v3 Announce Type: replace-cross
Abstract: Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical…

Author name: Prakhar Gupta, Vaibhav Gupta

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order