DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training
arXiv:2602.05890v2 Announce Type: replace
Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Rece…