MachineLearning

Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

After ~3 weeks of experimentation in OpenAI's Parameter Golf competition, I wrote up why SSMs are structurally disadvantaged relative to transformers in a time- and size-constrained regime (10 min training, 16MB artifact, 25M parameters) on 8xH100s…