cs.AI, cs.CL

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

arXiv:2604.08558v1 Announce Type: new
Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In …