Long Context Pre-Training with Lighthouse Attention
arXiv:2605.06554v1 Announce Type: new
Abstract: Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training…