SSPO: Subsentence-level Policy Optimization
arXiv:2511.04256v2 Announce Type: replace
Abstract: As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorith…