Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
arXiv:2605.05566v1 Announce Type: cross
Abstract: Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, i…