cs.LG

Transformers Learn Latent Mixture Models In-Context via Mirror Descent

arXiv:2604.10848v1 Announce Type: new
Abstract: Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying lea…