When Can LLMs Learn to Reason with Weak Supervision?
arXiv:2604.18574v1 Announce Type: new
Abstract: Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward sign…