Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
arXiv:2605.10504v1 Announce Type: new
Abstract: A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns…