cs.AI, cs.CL, cs.LG

Dynamic sparsity in tree-structured feed-forward layers at scale

arXiv:2604.08565v1 Announce Type: cross
Abstract: At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer’s compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured…