Homogenized Transformers
arXiv:2604.01978v1 Announce Type: cross
Abstract: We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, t…