A 15B-parameter token-mixer supernet with 8 optimized deployment presets spanning 1.0× to 10.7× decode throughput at 32K sequence length, all from a single checkpoint. Derived from Apriel-1.6 through stochastic distillation and targeted supervised fine-tuning.
- Model Size: 15B parameters
- Layers: 48 decoder layers, each with 4 mixer variants
- Context Length: 262K positions (runtime dependent)
- Languages: English (best)
Highlights
- Flexible deployment from a single checkpoint: multiple presets trading throughput for quality
- Four mixer types per layer: Full Attention (FA), Sliding Window Attention (SWA), Gated DeltaNet (GDN), Kimi Delta Attention (KDA)
- Instruction-tuned: targeted SFT with multiple Pareto-optimal placements
- Speculative decoding support: use all-attention as target with efficient placements as drafts from the same checkpoint
submitted by