cs.AI, cs.LG

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

arXiv:2604.03950v1 Announce Type: cross
Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic…