NVIDIA releases Star Elastic, a novel AI architecture allowing a single checkpoint to function as 30B, 23B, and 12B models via zero-shot slicing. This approach enables dynamic budget control for reasoning tasks, significantly reducing latency and compute costs while maintaining accuracy.
I saw this on another sub and didn't see it posted here, it looks awesome, and can definitely be run local. I guess it was released 11 days ago, but it never hit the top of my feed (which I look at way too often), so posting it again. # This is my take on it: Think of this as like scalable video coding, you have a UHD stream, but strip some layers and you have a HD, or SD stream, it's all a single file stream, not multiple ones. Like nested models, rather than 3 different sets, and they can share their KV cache so the model can adjust speed like a sliding scale. You get an idea with a 30B model, then scale down and permutate all the thinking at 7000t/s on the 12B model, generating a book of reasoning in seconds, then slide up to 30B again to evaluate what's good. You could have a 30B kind of guide the smaller ones back and forth. Maybe it's somewhat of a hybrid between Dense and MoE, it's like MoE but with 3 dense models that are like russian dolls. # Original Post: NVIDIA just released Star Elastic — and the inference strategy alone is worth understanding. Here's what's actually interesting from the technical side: 1. One checkpoint. Three models. Star Elastic applies a post-training method to Nemotron Nano v3 that nests 23B and 12B submodels can be extracted zero-shot from the parent checkpoint the 30B parent. All three live in a single checkpoint in BF16, FP8, and NVFP4. 2. The router learns the architecture, not just the weights. A learnable router trained via Gumbel-Softmax maps any target parameter budget to the optimal nested configuration across all elastic axes — attention heads, Mamba SSM heads, MoE experts, FFN channels, embedding dimensions. The importance-based ranking that orders these components is computed before training begins. 3. Use a smaller model for thinking. Use the full model for the answer. This is the finding we found most interesting. Elastic budget control assigns the 23B submodel to the thinking phase and the 30B model to the final answer. Reasoning traces are high-volume but tolerant of lower capacity. The final answer is low-volume but requires precision. Matching model size to phase complexity gives: → +16% accuracy vs. standard budget control → 1.9× lower latency Measured on AIME-2025, GPQA, LiveCodeBench v5, and MMLU-Pro. 4. The cost reduction is significant. → 360× fewer tokens vs. pretraining each variant from scratch → 7× fewer tokens vs. state-of-the-art sequential compression → The 23B and 12B nested models match or outperform independently trained baselines of comparable size 5. Hardware accessibility. The 12B NVFP4 variant runs on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000 it reaches 7,426 tokens/s — 3.4× the throughput of the 30B BF16 baseline. Read the full analysis which also has an interactive step-by-step code guide here: https://www.marktechpost.com/2026/05/09/nvidia-ai-releases-star-elastic-one-checkpoint-that-contains-30b-23b-and-12b-reasoning-models-with-zero-shot-slicing/ 3-in-1 model in BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 3-in-1 model in FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8 3-in-1 model in NVFP4: https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4 Related Papers: https://arxiv.org/abs/2511.16664 There's also a new one called "Star Elastic: Many-in-One Reasoning {LLMs} with Efficient Budget Control" but I can't find it.
Nvidia quietly provides ~80 free hosted AI model APIs including MiniMax M2.7, GLM 5.1, Kimi 2.5, DeepSeek 3.2, GPT-OSS-120B, ready to integrate with popular dev tools like OpenClaude and Zed IDE.
NVIDIA releases Nemotron 3 Nano Omni, a 30B parameter multimodal model capable of processing video, audio, image, and text with integrated reasoning capabilities for enterprise workflows.
NVIDIA announces Nemotron 3 Nano Omni, an open multimodal model that unifies vision, audio, and language processing to enable faster and more efficient AI agents, achieving up to 9x higher throughput compared to other open omni models.
NVIDIA offers free AI inference via DGX Cloud with OpenAI-compatible API for popular models like DeepSeek, MiniMax, Kimi, GLM, and Llama, claimable in 5 minutes.
OpenAI releases o1-mini, a cost-efficient reasoning model that matches o1 performance on STEM tasks like math and coding while being 80% cheaper. The model is optimized for reasoning-heavy applications and is now available to API users and ChatGPT Plus/Team/Enterprise/Edu subscribers.