cs.LG, stat.ML

The Effect of Attention Head Count on Transformer Approximation

arXiv:2510.06662v2 Announce Type: replace
Abstract: Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we stud…