LinearARD: Linear-Memory Attention Distillation for RoPE Restoration
arXiv:2604.00004v1 Announce Type: new
Abstract: The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing lon…