cs.AI, cs.CL, cs.LG

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

arXiv:2605.14220v1 Announce Type: cross
Abstract: Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make…