Hey r/LocalLLaMA,
Dropping a release I've been working on during AIMO3 (Kaggle competition). Took NVIDIA's Nemotron-3-Super-120B-A12B (latent MoE + Mamba2 hybrid), REAP-pruned from 512->256 experts (removed MTP layer too), LoRA-RL fine-tuned on ~270 AIMO3 + AstralMath problems with GRPO, then quantized to AWQ and FP8 for inference.
Result: 120B -> 64B, runs on a single H100/RTX PRO 6000 Blackwell at 90%+ on AIME 2026.
Models
- BF16 (full weights, ~129GB VRAM): Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16
- FP8 dynamic (W8A8, ~72GB VRAM): Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-FP8
- AWQ (W4A16, ~43GB VRAM): Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-AWQ
AIME 2026 (30 problems, avg of 4 attempts, system-role prompt)
| Variant | avg@4 | pass@4 | tool use |
|---|---|---|---|
| 120B Base model (MathArena leaderboard) | 0.9000 | n/a | no |
| Our AWQ | 0.9083 | 0.9333 | no |
| Our FP8 | 0.9167 | 0.9667 | no |
Although the benchmark was run without a tool, the model is good at python tool-integrated reasoning!
AWQ vs FP8 trade-off
FP8 has ~40% lower tokens/s throughput than AWQ, but wins on quality (+1 problem cracked on pass@4, better numerics on the hardest problem). FP8 also converges to answers faster, partially offsetting the throughput hit.
vLLM patch needed
vLLM's fused `grouped_topk` CUDA kernel crashes with illegal memory access when experts_per_group > 128 (our model has 256 after pruning, n_group=1). Repo includes a small patch that skips the fused kernel in that case.
Links
- Benchmark repo: https://github.com/madmax0404/nemotron-3-super-reap-pruned-awq-and-fp8-aime-2026-benchmarks
- HF: https://huggingface.co/Max-and-Omnis
Hardware: 1× RTX PRO 6000 Blackwell, vLLM 0.19.1.
Happy to answer questions on the pipeline (REAP -> GRPO -> AWQ/FP8).
[link] [comments]