nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Reddit r/LocalLLaMA 06/04/26, 11:48 AM Models

large-language-model mixture-of-experts open-weights reasoning long-context agentic-ai nvidia

Summary

NVIDIA releases Nemotron-3-Ultra-550B-A55B, a 550B parameter (55B active) frontier LLM featuring a hybrid LatentMoE architecture combining Mamba-2, MoE, and Attention layers, with up to 1M token context length and configurable reasoning mode. It supports 11 languages and is optimized for complex agentic workflows, long-context analysis, and high-accuracy reasoning.

# Model Summary |**Total Parameters**|550B (55B active)| |:-|:-| |**Architecture**|LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP)| |**Context Length**|Up to 1M tokens| |**Minimum GPU Requirement**|8x GB200/B200/GB300/B300, 16x H100, 8x H200| |**Supported Languages**|English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese| |**Best For**|Frontier reasoning, complex agentic workflows, long-context analysis, tool use, multilingual reasoning, high-stakes RAG| |**Reasoning Mode**|Configurable on/off via chat template (`enable_thinking=True/False`)| |**License**|[OpenMDW License Agreement, version 1.1](https://raw.githubusercontent.com/OpenMDW/OpenMDW/refs/heads/main/1.1/LICENSE.OpenMDW-1.1)| |**Release Date**|June 4, 2026| # What is Nemotron? NVIDIA Nemotron™ is a family of open models with open weights, training data, and recipes, delivering leading efficiency and accuracy for building specialized AI agents. # Description **Nemotron-3-Ultra-550B-A55B-BF16** is a frontier-scale large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for the most demanding workloads, including complex multi-step agents, long-context analysis, and high-accuracy reasoning over code, math, and science. Like other models in the family, it responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template. The model employs a hybrid **Latent Mixture-of-Experts (LatentMoE)** architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Like the Super model, the Ultra model incorporates **Multi-Token Prediction (MTP)** layers for faster text generation and improved quality, and it is trained using an **NVFP4** pre-training recipe to maximize compute efficiency. The model has **55B active parameters** and **550B parameters in total**. The supported languages include: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese. This model is ready for commercial and non-commercial use. **Too big to run locally on my setup, 8xH200 anyone?**

Original Article

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Similar Articles

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

@mervenoyann: NVIDIA Nemotron Ultra is here > 55B/550B a hybrid MoE with 1M context window > supports MTP speculative decoding > da…

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Nemotron 3 Ultra. 550 billion parameters, 55B active. 1 million context

Nemotron 3 Ultra by NVIDIA

Submit Feedback

Similar Articles

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

@mervenoyann: NVIDIA Nemotron Ultra is here > 55B/550B a hybrid MoE with 1M context window > supports MTP speculative decoding > da…

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Nemotron 3 Ultra. 550 billion parameters, 55B active. 1 million context