cs.LG, math.PR, stat.ML

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

arXiv:2604.26898v1 Announce Type: cross
Abstract: We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting …