Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
arXiv:2511.01292v2 Announce Type: replace
Abstract: Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment set…