cs.AI, cs.LG

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

arXiv:2604.17912v1 Announce Type: new
Abstract: State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model …