Neural Policy Composition from Free Energy Minimization
arXiv:2512.04745v3 Announce Type: replace-cross
Abstract: The ability to flexibly compose previously acquired skills to execute intelligent behaviors is a hallmark of natural intelligence. Such compositional flexibility is often attributed to context-dependent gating mechanisms that determine how multiple policies or behavioral primitives are combined. Yet, despite remarkable efforts, the normative objective from which such gating rules should arise, and the neural computations capable of implementing them, remain unclear. Existing approaches typically rely on prespecified design choices for the gating rules, and remain tied to specific architectures, learning paradigms, or datasets. Here, we introduce a normative framework in which policy composition emerges from the minimization of a variational free energy, providing a principled and broadly applicable objective for gating. Based on this framework, we derive a continuous-time gradient flow whose trajectories are guaranteed to converge, with explicit rate, to the optimal composition of primitives. We further show that this dynamics admits a mechanistic neural implementation as a soft-competitive recurrent circuit with context-sensitive local interactions. We evaluate the model on emerging flocking behaviors in multi-agent systems, human decision-making in bandit tasks, and control benchmarks in layered architectures. Across these settings, the model provides interpretable mechanistic accounts of policy composition, reproduces key behavioral signatures, yields insights into data, and matches or outperforms established models.