REAP-pruned Nemotron-3-Super (512 -> 256 experts) + GRPO fine-tune + FP8/AWQ. AIME 2026 90%+. Benchmark inside.

Hey r/LocalLLaMA,

Dropping a release I've been working on during AIMO3 (Kaggle competition). Took NVIDIA's Nemotron-3-Super-120B-A12B (latent MoE + Mamba2 hybrid), REAP-pruned from 512->256 experts (removed MTP layer too), LoRA-RL fine-tuned on ~270 AIMO3 + AstralMath problems with GRPO, then quantized to AWQ and FP8 for inference.

Result: 120B -> 64B, runs on a single H100/RTX PRO 6000 Blackwell at 90%+ on AIME 2026.

Models

AIME 2026 (30 problems, avg of 4 attempts, system-role prompt)

Variant avg@4 pass@4 tool use
120B Base model (MathArena leaderboard) 0.9000 n/a no
Our AWQ 0.9083 0.9333 no
Our FP8 0.9167 0.9667 no

Although the benchmark was run without a tool, the model is good at python tool-integrated reasoning!

AWQ vs FP8 trade-off

FP8 has ~40% lower tokens/s throughput than AWQ, but wins on quality (+1 problem cracked on pass@4, better numerics on the hardest problem). FP8 also converges to answers faster, partially offsetting the throughput hit.

vLLM patch needed

vLLM's fused `grouped_topk` CUDA kernel crashes with illegal memory access when experts_per_group > 128 (our model has 256 after pruning, n_group=1). Repo includes a small patch that skips the fused kernel in that case.

Links

Hardware: 1× RTX PRO 6000 Blackwell, vLLM 0.19.1.

Happy to answer questions on the pipeline (REAP -> GRPO -> AWQ/FP8).

submitted by /u/max6296
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top