NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

I saw this on another sub and didn't see it posted here, it looks awesome, and can definitely be run local. I guess it was released 11 days ago, but it never hit the top of my feed (which I look at way too often), so posting it again.

NVIDIA just released Star Elastic — and the inference strategy alone is worth understanding.

Here's what's actually interesting from the technical side:

One checkpoint. Three models.

Star Elastic applies a post-training method to Nemotron Nano v3 that nests 23B and 12B submodels can be extracted zero-shot from the parent checkpoint the 30B parent. All three live in a single checkpoint in BF16, FP8, and NVFP4.

The router learns the architecture, not just the weights.

A learnable router trained via Gumbel-Softmax maps any target parameter budget to the optimal nested configuration across all elastic axes — attention heads, Mamba SSM heads, MoE experts, FFN channels, embedding dimensions. The importance-based ranking that orders these components is computed before training begins.

Use a smaller model for thinking. Use the full model for the answer.

This is the finding we found most interesting. Elastic budget control assigns the 23B submodel to the thinking phase and the 30B model to the final answer. Reasoning traces are high-volume but tolerant of lower capacity. The final answer is low-volume but requires precision. Matching model size to phase complexity gives:

→ +16% accuracy vs. standard budget control

→ 1.9× lower latency

Measured on AIME-2025, GPQA, LiveCodeBench v5, and MMLU-Pro.

The cost reduction is significant.

→ 360× fewer tokens vs. pretraining each variant from scratch

→ 7× fewer tokens vs. state-of-the-art sequential compression

→ The 23B and 12B nested models match or outperform independently trained baselines of comparable size

Hardware accessibility.

The 12B NVFP4 variant runs on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000 it reaches 7,426 tokens/s — 3.4× the throughput of the 30B BF16 baseline.

Read the full analysis which also has an interactive step-by-step code guide here: https://www.marktechpost.com/2026/05/09/nvidia-ai-releases-star-elastic-one-checkpoint-that-contains-30b-23b-and-12b-reasoning-models-with-zero-shot-slicing/

3-in-1 model in BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16

3-in-1 model in FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8

3-in-1 model in NVFP4: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4

Related Papers: https://arxiv.org/abs/2511.16664 There's also a new one called "Star Elastic: Many-in-One Reasoning {LLMs} with Efficient Budget Control" but I can't find it.

submitted by /u/phazei
[link] [comments]

Leave a Comment