cs.AI, cs.LG, stat.ML

Revisiting Transformer Layer Parameterization Through Causal Energy Minimization

arXiv:2605.07588v1 Announce Type: cross
Abstract: Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empiric…