On the “Causality” Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
arXiv:2604.04686v1 Announce Type: new
Abstract: In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by “causality,” that the full return may be replaced by th…