cs.AI, cs.LG, math.OC

Finite-Time Analysis of Gradient Descent for Shallow Transformers

arXiv:2601.16514v2 Announce Type: replace
Abstract: Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by…