ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning
arXiv:2601.08310v2 Announce Type: replace
Abstract: Recent Large Reasoning Models (LRMs) achieve strong performance by leveraging long-form Chain-of-Thought (CoT) reasoning, but uniformly applying overlong reasoning at inference time incurs substantia…