Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets
arXiv:2605.15787v1 Announce Type: new
Abstract: Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discover…