cs.LG

Projection-Free Transformers via Gaussian Kernel Attention

arXiv:2605.02144v1 Announce Type: new
Abstract: Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether th…