Process Supervision of Confidence Margin for Calibrated LLM Reasoning
arXiv:2604.23333v1 Announce Type: cross
Abstract: Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes mode…