Functional Natural Policy Gradients
arXiv:2603.28681v2 Announce Type: replace-cross
Abstract: We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexi…