cs.CL, cs.LG

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

arXiv:2603.08343v2 Announce Type: replace
Abstract: The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacin…