Dynamic sparsity in tree-structured feed-forward layers at scale
arXiv:2604.08565v1 Announce Type: cross
Abstract: At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer’s compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured…