High entropy leads to symmetry equivariant policies in Dec-POMDPs
arXiv:2511.22581v3 Announce Type: replace
Abstract: We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through empirical evaluation of independent PPO, arguably the most widely used deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi, we achieve close to perfect scores in inter-seed cross-play this way. Despite clear limitations of this approach, which we point out, both our theoretical and empirical results indicate that during hyperparameter sweeps in Dec-POMDPs, one should consider far higher entropy coefficients than is typically done.