NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Reddit r/LocalLLaMA 05/10/26, 12:48 AM Models

nvidia llm model-release efficient-inference elastic-models reasoning nested-architecture

Summary

NVIDIA releases Star Elastic, a novel AI architecture allowing a single checkpoint to function as 30B, 23B, and 12B models via zero-shot slicing. This approach enables dynamic budget control for reasoning tasks, significantly reducing latency and compute costs while maintaining accuracy.

I saw this on another sub and didn't see it posted here, it looks awesome, and can definitely be run local. I guess it was released 11 days ago, but it never hit the top of my feed (which I look at way too often), so posting it again. # This is my take on it: Think of this as like scalable video coding, you have a UHD stream, but strip some layers and you have a HD, or SD stream, it's all a single file stream, not multiple ones. Like nested models, rather than 3 different sets, and they can share their KV cache so the model can adjust speed like a sliding scale. You get an idea with a 30B model, then scale down and permutate all the thinking at 7000t/s on the 12B model, generating a book of reasoning in seconds, then slide up to 30B again to evaluate what's good. You could have a 30B kind of guide the smaller ones back and forth. Maybe it's somewhat of a hybrid between Dense and MoE, it's like MoE but with 3 dense models that are like russian dolls. # Original Post: NVIDIA just released Star Elastic — and the inference strategy alone is worth understanding. Here's what's actually interesting from the technical side: 1. One checkpoint. Three models. Star Elastic applies a post-training method to Nemotron Nano v3 that nests 23B and 12B submodels can be extracted zero-shot from the parent checkpoint the 30B parent. All three live in a single checkpoint in BF16, FP8, and NVFP4. 2. The router learns the architecture, not just the weights. A learnable router trained via Gumbel-Softmax maps any target parameter budget to the optimal nested configuration across all elastic axes — attention heads, Mamba SSM heads, MoE experts, FFN channels, embedding dimensions. The importance-based ranking that orders these components is computed before training begins. 3. Use a smaller model for thinking. Use the full model for the answer. This is the finding we found most interesting. Elastic budget control assigns the 23B submodel to the thinking phase and the 30B model to the final answer. Reasoning traces are high-volume but tolerant of lower capacity. The final answer is low-volume but requires precision. Matching model size to phase complexity gives: → +16% accuracy vs. standard budget control → 1.9× lower latency Measured on AIME-2025, GPQA, LiveCodeBench v5, and MMLU-Pro. 4. The cost reduction is significant. → 360× fewer tokens vs. pretraining each variant from scratch → 7× fewer tokens vs. state-of-the-art sequential compression → The 23B and 12B nested models match or outperform independently trained baselines of comparable size 5. Hardware accessibility. The 12B NVFP4 variant runs on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000 it reaches 7,426 tokens/s — 3.4× the throughput of the 30B BF16 baseline. Read the full analysis which also has an interactive step-by-step code guide here: https://www.marktechpost.com/2026/05/09/nvidia-ai-releases-star-elastic-one-checkpoint-that-contains-30b-23b-and-12b-reasoning-models-with-zero-shot-slicing/ 3-in-1 model in BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 3-in-1 model in FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8 3-in-1 model in NVFP4: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4 Related Papers: https://arxiv.org/abs/2511.16664 There's also a new one called "Star Elastic: Many-in-One Reasoning {LLMs} with Efficient Budget Control" but I can't find it.

Original Article

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Similar Articles

@dhruvtwt_: Why is no one talking about this? @nvidia is offering around 80 AI models via hosted APIs absolutely for free. You get …

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language for up to 9x More Efficient AI Agents

@k1rallik: NVIDIA IS LITERALLY GIVING AWAY FREE AI INFERENCE I literally set it up in 5 minutes and couldn't believe it was free D…

OpenAI o1-mini

Submit Feedback

Similar Articles

@dhruvtwt_: Why is no one talking about this? @nvidia is offering around 80 AI models via hosted APIs absolutely for free. You get …

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language for up to 9x More Efficient AI Agents

@k1rallik: NVIDIA IS LITERALLY GIVING AWAY FREE AI INFERENCE I literally set it up in 5 minutes and couldn't believe it was free D…