cs.CL, cs.LG

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

arXiv:2604.23333v1 Announce Type: cross
Abstract: Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes mode…