Hey r/LocalLLaMA,

Dropping a release I've been working on during AIMO3 (Kaggle competition). Took NVIDIA's Nemotron-3-Super-120B-A12B (latent MoE + Mamba2 hybrid), REAP-pruned from 512->256 experts (removed MTP layer too), LoRA-RL fine-tuned on ~270 AIMO3 + AstralMath problems with GRPO, then quantized to AWQ and FP8 for inference.

Result: 120B -> 64B, runs on a single H100/RTX PRO 6000 Blackwell at 90%+ on AIME 2026.

Models

BF16 (full weights, ~129GB VRAM): Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16
FP8 dynamic (W8A8, ~72GB VRAM): Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-FP8
AWQ (W4A16, ~43GB VRAM): Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-AWQ

AIME 2026 (30 problems, avg of 4 attempts, system-role prompt)

Variant	avg@4	pass@4	tool use
120B Base model (MathArena leaderboard)	0.9000	n/a	no
Our AWQ	0.9083	0.9333	no
Our FP8	0.9167	0.9667	no

Although the benchmark was run without a tool, the model is good at python tool-integrated reasoning!

AWQ vs FP8 trade-off

FP8 has ~40% lower tokens/s throughput than AWQ, but wins on quality (+1 problem cracked on pass@4, better numerics on the hardest problem). FP8 also converges to answers faster, partially offsetting the throughput hit.

vLLM patch needed

vLLM's fused `grouped_topk` CUDA kernel crashes with illegal memory access when experts_per_group > 128 (our model has 256 after pruning, n_group=1). Repo includes a small patch that skips the fused kernel in that case.

Links

Hardware: 1× RTX PRO 6000 Blackwell, vLLM 0.19.1.

Happy to answer questions on the pipeline (REAP -> GRPO -> AWQ/FP8).

submitted by /u/max6296
[link] [comments]

REAP-pruned Nemotron-3-Super (512 -> 256 experts) + GRPO fine-tune + FP8/AWQ. AIME 2026 90%+. Benchmark inside.

Models

AIME 2026 (30 problems, avg of 4 attempts, system-role prompt)

AWQ vs FP8 trade-off

vLLM patch needed

Links

Leave a Comment