Beyond Pessimism: Offline Learning in KL-regularized Games

arXiv:2604.06738v2 Announce Type: replace-cross Abstract: We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized with respect to a fixed reference policy through KL regularization. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields, to our knowledge, the first pessimism-free offline learning guarantee for KL-regularized games, with a fast $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that its last iterate preserves the same pessimism-free statistical guarantee up to a controlled optimization error.

Leave a Comment