Self-Aligned Reward: Towards Effective and Efficient Reasoners
arXiv:2509.05489v2 Announce Type: replace
Abstract: Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This li…