Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
arXiv:2510.17196v3 Announce Type: replace-cross
Abstract: Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative arch…