Uniform Scaling Limits in AdamW-Trained Transformers
arXiv:2605.11059v1 Announce Type: new
Abstract: We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriat…