Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
arXiv:2604.17228v1 Announce Type: new
Abstract: Conditional depth execution routes a subset of tokens through a lightweight cheap FFN while the remainder execute the standard full FFN at each controlled layer. The central difficulty is gate training: …