Lower bounds for one-layer transformers that compute parity
arXiv:2605.12171v1 Announce Type: new
Abstract: This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing f…