cs.AI, cs.LG

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

arXiv:2512.04277v3 Announce Type: replace-cross
Abstract: Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical…