Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
arXiv:2605.14220v1 Announce Type: cross
Abstract: Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make…