"How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"

Reddit r/LocalLLaMA Models

Summary

NVIDIA released the Nemotron 3 open model, offering three sizes: Nano, Super, and Ultra. It optimizes hardware efficiency through architectural innovations such as hybrid Mamba Transformer, latent MoE, and multi-token prediction, and adopts the Open MDW 1.1 open license.

No content available
Original Article
View Cached Full Text

Cached at: 06/11/26, 02:02 PM

TL;DR: Nvidia released three sizes of Nemotron 3 models (Nano, Super, Ultra) based on common user compute configurations, and optimized hardware utilization through a hybrid Mamba Transformer, Latent MoE, and Multi-Token Prediction (MTP), while adopting an open license. ## Why Three Variants? — Hardware Determines Model Scale Nvidia’s Joey explained in an interview that the three variants of Nemotron 3 (Nano, Super, Ultra) were determined based on the most common compute configurations in Nvidia’s installed base, aiming to provide multiple trade-offs between accuracy and cost/latency/throughput. - **Nano** (300B total parameters, 30B activated parameters) is aimed at consumer hardware. On release it uses MVFP4 format, reducing the required VRAM to about 15 GB (half of FP8). - **Super** (1200B total parameters, 100B activated parameters) is close to server‑grade GPUs (e.g., H100, A100). It can run in FP8 on an 80 GB H100, and gets compute acceleration through MVFP4 on Blackwell architecture (B200, B300); it also runs on DGX Spark. - **Ultra** (5500B total parameters, 500B activated parameters) matches DeepSeek V3 in scale and requires multiple GPUs (AI factory level). It achieved “the highest accuracy score ever recorded for any published base model” in base model accuracy evaluations (next‑token prediction). ## Model Architecture: Hybrid Mamba, Latent MoE, and Multi‑Token Prediction Nvidia made three key architectural decisions at the model layer to fully exploit the underlying hardware. ### Hybrid Mamba Transformer Traditional attention has a computational cost that grows quadratically with context window size. In enterprise scenarios (which require processing large amounts of context) this significantly increases memory footprint and reduces the number of concurrent queries. Nvidia adopted a design that interleaves Mamba 2 (a state‑space model, SSM) with full attention layers. - Mamba 2 does not rely on a KV cache; instead it compresses the path into a fixed‑size matrix that updates as tokens are processed – similar to an RNN hidden state. The difference is that Mamba 2 uses hardware‑friendly algorithms that allow efficient parallelization of matrix multiplications. - This results in constant memory requirements, growing linearly rather than quadratically, enabling a 1‑million‑token context window. Interleaving with full attention layers preserves the ability to capture long‑range dependencies. ### Latent MoE Mixture‑of‑Experts (MoE) reduces the data transfer from HBM to SRAM through sparse activation (only 10% of weights are activated). By using expert parallelism to distribute experts across multiple GPUs (e.g., 8 “funnels” with 3–8 TB/s bandwidth), the read time of a large model (e.g., 5500B parameters, ~275 GB at 4‑bit precision) drops from 91–34 ms to about 35 GB transferred per GPU. Nemotron’s Latent MoE takes this further: it first projects each token down to a smaller latent representation in the dimension, then activates experts on that latent representation, reducing the memory bandwidth and compute required for routing and calculation. Nvidia uses this “surplus” to pack more experts, allowing each token to weigh more experts and always select the best combination. ### Multi‑Token Prediction (MTP) Traditional autoregressive generation predicts one token at a time. MTP lets the model simultaneously predict multiple future tokens (e.g., 5), improving expressiveness and forward‑looking ability during training. During inference, MTP can be used with speculative decoding: the model drafts several tokens, then checks and retains the matching ones in one pass, skipping multiple tokens and speeding up generation. Nemotron 3 supports this usage. ## Open License: Open MDW 1.1 The openness of AI models often causes confusion due to vague licenses. Apache 2.0 was originally designed for software and does not cover model weights, code, documentation, RL environments, training recipes, and other artifacts. The Linux Foundation revised the Open MDW license (1.0 → 1.1) to clarify its terms. Nvidia adopted this license for the Nemotron models as well as for projects like Cosmos and Isaac Groots. --- Source: How NVIDIA Built Nemotron 3 Open Model | Caleb Writes Code x Joey Conway (https://www.youtube.com/watch?v=wzHXUtkoY-c)

Similar Articles

NVIDIA Nemotron 3 Ultra is out.

Reddit r/LocalLLaMA

NVIDIA has released Nemotron 3 Ultra, a new model designed to power faster and more efficient reasoning for long-running AI agents.

Nemotron 3 Ultra by NVIDIA

Product Hunt

NVIDIA introduces Nemotron 3 Ultra, a new AI model designed to enable faster and more efficient reasoning for long-running agents.