Qwen Introduced FlashQLA

Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

2–3× forward speedup. 2× backward speedup.

💻 Purpose-built for agentic AI on your personal devices.

Key insights:

Gate-driven automatic intra-card CP.
Hardware-friendly algebraic reformulation.
TileLang fused warp-specialized kernels.

FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads.

Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads.

The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups.

We hope this is useful to the community!

Learn more:

📖 Blog: https://qwen.ai/blog?id=flashqla

💻 Code: https://github.com/QwenLM/FlashQLA

submitted by /u/ResearchCrafty1804
[link] [comments]

Leave a Comment