Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
arXiv:2604.08926v1 Announce Type: new
Abstract: Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suf…