BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
arXiv:2604.05942v1 Announce Type: new
Abstract: Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridizatio…