Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Hugging Face Blog Models

Summary

NVIDIA releases Nemotron 3 Nano Omni, a new long-context multimodal AI model capable of processing documents, audio, video, and text with high accuracy and efficiency.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 09:00 AM

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Source: https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence Back to Articles

Benchmark highlights Building on Nemotron Nano V2 VL, Nemotron 3 Nano Omni delivers substantial visual gains and adds entirely new audio and video+audio capabilities - while also leading another open-weights omni model, Qwen3-Omni, in many domains.

TaskBenchmarkNemotron 3 Nano OmniNemotron Nano V2 VLQwen3-Omni 30B-A3BDocument understandingOCRBenchV2-En65.861.2-MMLongBench-Doc57.538.049.5CharXiv reasoning63.641.361.1GUIScreenSpot-Pro57.85.559.7OSWorld47.411.029.0Video understandingVideo-MME72.263.070.5Video + Audio understandingWorldSense55.4-54.0DailyOmni74.1-73.6Voice interactionVoiceBench89.4-88.8ASRHF Open ASR (lower is better)5.95-6.55 Efficiency highlights Compared to other open omni models with the same interactivity, Nemotron 3 Nano Omni delivers 7.4x higher system efficiency for multi-document use cases and 9.2x higher system efficiency for video use cases Efficiency-PlotsFigure 1. Total system throughput for multi-document and video use cases sustained by each model at a fixed per‑user interactivity threshold (tokens/sec/user)

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#what-nemotron-3-nano-omni-is-designed-forWhat Nemotron 3 Nano Omni is designed for

At a high level, Nemotron 3 Nano Omni is aimed at five classes of workloads:

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#1-real-world-document-analysis1. Real-world document analysis

This is not only about OCR. The model is positioned for long, messy, high-value documents where understanding depends on layout, tables, figures, formulas, section structure, and cross-page references. Think contracts, technical papers, reports, manuals, multi-page forms, or compliance packets. The model can handle 100+ page documents.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#2-automatic-speech-recognition2. Automatic Speech Recognition

Nemotron 3 Nano Omni includes strong speech understanding capabilities that enable high-quality transcription across diverse audio conditions. It handles long-form audio with varying speakers, accents, and background noise. These capabilities can be integrated into broader workflows, allowing spoken content to be transcribed, analyzed, and combined with other modalities for tasks like summarization, question answering, and cross-modal reasoning.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#3-long-audio-video-understanding3. Long audio-video understanding

Many enterprise and developer workflows depend on mixed audio and visual evidence: screen recordings with narration, training videos, meetings with slides, tutorials, product demos, customer support captures, and long-form video archives. Nemotron 3 Nano Omni is built to reason over those inputs jointly.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#4-agentic-computer-use4. Agentic computer use

The Nemotron 3 Nano Omni model is specifically trained for agentic computer use, enabling it to assist with tasks in graphical user interface (GUI) environments. Its capabilities include interpreting screenshots, monitoring the state of the user interface, grounding its reasoning in on-screen visuals, and helping with action selection or workflow automation.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#5-general-multimodal-reasoning5. General multimodal reasoning

The model is designed for more than perception. It excels at reasoning-intensive tasks that require synthesizing information across long context windows, multiple modalities, and structured or semi-structured evidence. It can carry out multi-step reasoning, perform calculations, and connect signals from text, images, tables, and other inputs to arrive at coherent, well-supported answers.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#model-architecture-and-key-innovationsModel architecture and key innovations

Nemotron 3 Nano Omni uses a unifiedencoder-projector-decoderdesign. The language backbone is Nemotron 3 Nano 30B-A3B, paired with the C-RADIOv4-H vision encoder and the Parakeet-TDT-0.6B-v2 audio encoder. The modality-specific encoders connect into the LLM backbone through lightweight projectors.

Nemotron_arch_v3_reducedFigure 2. Model architecture of NVIDIA Nemotron 3 Nano Omni 30B-A3B

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#a-hybrid-mamba-transformer-moe-backbone-for-long-multimodal-contextA hybrid Mamba-Transformer-MoE backbone for long multimodal context

The model backbone interleaves three key components:23 Mamba selective state-space layersfor efficient long-context processing;23 MoE layerswith128 experts, top-6 routing, and ashared expertfor conditional capacity; and6 grouped-query attention layersto preserve strong global interaction and expressivity.

Nemotron 3 Nano Omni combines state-space models, attention, and MoE in a unified design that maintains strong reasoning performance while remaining practical for long, multimodal contexts.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#dynamic-resolution-for-dense-documents-charts-and-screensDynamic resolution for dense documents, charts, and screens

On the vision side, the Nemotron 3 Nano Omni replaces the tiling strategy used in the v2 model withdynamic resolution processing at native aspect ratio. Each image can be represented using a variable number of 16 x 16 patches, witha minimum of 1,024 to a maximum of 13,312 visual patches per image. For square images, this is equivalent to 512 x 512 and 1840 x 1840, respectively.

That flexibility is critical for handling high-resolution, complex visual inputs such as OCR-heavy documents, financial tables, slides, research figures, screenshots, and GUI layouts—especially when both fine details and overall structure need to be understood together.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#conv3d-temporal-compression-for-videoConv3D temporal compression for video

For video, Nemotron 3 Nano Omni uses a dedicatedConv3D tubelet embeddingpath. Instead of embedding each frame independently, every pair of consecutive frames is fused into a single “tubelet” before the ViT, halving the number of vision tokens the language model has to attend to. This allows us to either double the number of frames with the same token budget, or halve the number of tokens with the same number of frames

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#evs–efficient-video-samplingEVS — Efficient Video Sampling

EVS is an important feature, enabled during inference time, that drops redundant video tokens after the vision encoder. This reduces latency and improves throughput while maintaining accuracy. The first frame of the video is kept entirely, then for each subsequent frame, EVS keeps the “dynamic” tokens where the video is changing and drops the “static” ones where nothing has changed from the previous frame. We combine this with Conv3D to enable superior compression: Conv3D fuses tokens from pairs of frames into one, and then EVS prunes redundant static information.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#native-audio-input-not-just-text-transcriptsNative audio input, not just text transcripts

The audio side is powered byParakeet-TDT-0.6B-v2, connected to the backbone through its own 2-layer MLP projector. Audio is sampled at16 kHz, and the model is trained with inputs up to1,200 seconds (20 minutes), while the LLM max context length supports 5+ hours.

This represents a shift from traditional VLM pipelines by enabling native audio processing within a shared multimodal sequence, allowing audio, visual, and text tokens to be jointly modeled. This is crucial for scenarios like narrated screen recordings, video Q&A where speech alters visual meaning, long-form instructional or meeting content, and tasks requiring temporally grounded multimodal reasoning.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#lightweight-modality-projectors-and-unified-token-interleavingLightweight modality projectors and unified token interleaving

Each encoder is connected to the LLM with a lightweight2-layer MLP projectorthat maps encoder features into the shared embedding space. Once projected,vision, audio, and text tokens are interleaved and processed jointly.

This design keeps the overall system modular while still enabling genuine cross-modal reasoning inside the backbone itself.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#training-data-infrastructure-and-systems-storyTraining data, infrastructure and systems story

The SFT stages are trained onNVIDIA H100, scaling from32 to 128 nodesdepending on the stage. The stack usesMegatron-LM,Transformer Engine, andMegatron Energon, with tensor parallelism, expert parallelism, sequence parallelism, context parallelism for the long context stages, online sequence packing, and selective activation recomputation.

Post-SFT reinforcement learning usesNeMo-RLandNeMo Gymwith a Megatron backend. The RL infrastructure used aRay-based distributed setupacrossB200 and H100 clusters, plus multimodal deduplication, so repeated rollouts do not multiply image, video, and audio memory.

We open-source substantial parts of our training code.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#using-rl-to-shape-reliable-multimodal-behaviorUsing RL to shape reliable multimodal behavior

We introduce multi-environment text and omni training in Nemotron 3 Nano Omni. Our text RL training stage happens across diverse environments in Nemo-Gym, which evaluates the model’s ability to perform sequences of actions such as tool calling, writing code, and multi-part planning that satisfy the verifiable criteria.

Omni RL trains the model to reasonacross images, video, audio, and textwithin a unified framework, covering tasks fromsingle-modality to fully multimodal scenarios. A diverse verifier suite evaluates outputs across formats like multiple-choice, math, GUI grounding, and ASR, while intentionally including unanswerable cases to teach the model to abstain when evidence is insufficient rather than hallucinate.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#data-and-data-pipelinesData and data pipelines

Nemotron 3 Nano Omni is trained on an enhanced dataset that emphasizes high-quality reasoning across multiple modalities. We significantly expand task coverage and introduce synthetic data for complex reasoning scenarios where public datasets are limited. To support this, we build task-specific, multi-stage pipelines for scalable synthetic data generation.

As one example, we generated approximately 11.4M synthetic QA pairs (~45B tokens) from a large corpus of real-world PDFs usingNeMo Data Designer. This dataset is used to strengthen long-context document reasoning during post-training and delivers a 2.19× improvement in overall accuracy on MMLongBench-Doc.

We detail the full pipeline evolution, including failure analysis and key lessons learned, in ourData Designer developer note. The note also includesnine runnable pipeline recipesthat can serve as starting points for building your own document-understanding datasets.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-workflowsExample workflows

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-1-long-multi-page-document-analysisExample 1: Long multi-page document analysis

Nemotron 3 Nano Omni can analyze and reason over long documents such as financial reports, academic papers, product manuals, etc. The following example retrieves financial metrics across a 100+ page document to calculate another metric:

adobe-multipage-reasoning-visual-v6Figure 3: Stylized example from the MMLongBench-Doc benchmark

The model was prompted to answer the question as follows:

Extract information from all the given images, then answer the question using a single word or phrase. Return 'Not answerable' if the answer cannot be derived from the the images.

The model is capable of long-context retrieval, structured extraction, table/chart reading and multi-page reasoning all in one go.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-2-video–audio-understandingExample 2: Video + audio understanding

Nemotron-3 Nano Omni performs joint audio-visual analysis, both locally for specific scenes and globally across the entire video. This enables it to answer complex questions that require cross-modal reasoning—such as identifying specific visuals shown exactly when a certain topic is mentioned in the audio.

Video

Question

Watch the video and listen to the narration.
1.What structure is on fire as shown in the video and how much money was being spent in its renovation project ? 
2. Describe in short what visuals are shown when the eye-witness narrates her experience?

Answer

1. The structure on fire is the Notre Dame Cathedral, which was undergoing an almost $7 million renovation project.

2. When the eye-witness narrates her experience, the visuals show the cathedral engulfed in flames and thick smoke, scaffolding surrounding the burning roof, firefighters using ladders to spray water, a crowd of people watching from the riverbank, and views of the burning structure against the Paris skyline.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-3-agentic-computer-use–screenshot-reasoningExample 3: Agentic computer use / screenshot reasoning

Nemotron 3 Omni can be integrated into agentic computer-use systems to reason over user intents, analyze GUI elements, and execute actions to accomplish tasks. animation

The model was given access to a web environment with the following system prompt:

You are a GUI agent. You are given an instruction, a screenshot of the screen and your previous interactions with the computer. You need to perform a series of actions to complete the task. The passoword of the computer is {password}.

For each step, provide your response in this format:
{thought}
## Action:
{action}
## Code:
{code}

In the code section, the code should be either pyautogui code or one of the following functions wrapped in the code block:
- {"name": "computer.wait", "description": "Make the computer wait for 20 seconds for installation, running code, etc.", "parameters": {"type": "object", "properties": {}, "required": []}}
- {"name": "computer.terminate", "description": "Terminate the current task and report its completion status", "parameters": {"type": "object", "properties": {"status": {"type": "string", "enum": ["success", "failure"], "description": "The status of the task"}, "answer": {"type": "string", "description": "The answer of the task"}}, "required": ["status"]}}

User Instruction

Find the Driver License Eligibility Requirements

Model Response: The GIF above shows the actions that the agent takes to fulfill the user’s request

  1. It clicks on the “License & IDs” tab
  2. Then it clicks on “Driver’s License” from the dropdown menu
  3. Then it clicks on “Driver’s License Eligibility” on the left
  4. It scrolls down on this page and back up
  5. Finally it produces the following response for the user:
Successfully found the Driver License Eligibility Requirements page on the Virginia DMV website. The page contains comprehensive information about driver license eligibility including:

**General Requirements:**
- Must be a resident of the Commonwealth of Virginia
- Must be at least 16 years and 3 months of age

**First Time Drivers:**
- Must complete a state-approved driver education program
- Must obtain a learner's permit at least 60 days before taking the road skills test
- Must be at least 15 years and 6 months of age for a learner's permit
- No exceptions to age requirements

**New Residents:**
- Temporary residents: must obtain license within 6 months
- Permanent residents: must obtain license within 1 year

**Military Personnel:**
- Active-duty members of the Armed Forces stationed in Virginia
- Spouses and dependents 16 years and 3 months or older may drive with valid out-of-state license
- If vehicle is co-owned and out-of-state plates are used, all co-owners must be active duty members

**Out-of-State Students:**
- Full-time students enrolled in accredited Virginia schools (not employed) may drive with valid out-of-state license
- Full-time students who are employed are considered Virginia residents for vehicle laws

**Non-Residents:**
- Non-residents temporarily living in Virginia may drive with home state license for no more than 6 months
- Note: This does not apply to commercial vehicle drivers
- If becoming gainfully employed, must hold Virginia driver's license

**Identification and Residency Requirements:**
- Proof of identity (acceptable documents listed in DMV 141)
- Proof of legal presence in the U.S.
- Proof of Virginia residency
- Proof of Social Security number
- Proof of name change (if applicable)

Additional resources available: interactive document guide, "Acceptable Documents for Obtaining a Driver's License or Photo ID Card" (DMV 141), and Spanish language version.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-4-mixed-document–chart–narration-reasoningExample 4: Mixed document + chart + narration reasoning

Nemotron-3 Nano Omni can analyze charts, figures shown in images, along with audio files to generate any commonalities, discrepancies across the media.

ImageScreenshot 2026-04-27 at 12.29.11 PM

Audio

Question

Use both the slide content and the spoken commentary.
List out any extra points mentioned by the speaker that is not shown in the slide

Answer

Based on the speaker's commentary and the visual content of the slide, here are the extra points mentioned that are not explicitly shown or detailed in the slide:

*   **The Topic:** The speaker explicitly states she will be covering **multitask learning** with neural networks, while the slide title is simply "Some notation" (though it contains the mathematical definitions for tasks).
*   **Nature of the Output:** She describes the output ($y$) as a **probability distribution** over possible labels (e.g., tiger, tiger cat, lynx, cat), whereas the slide only shows a bar chart with class labels without explicitly labeling the vector as a distribution.
*   **Specific Task Example (Paper Titles):** She introduces a concrete example where the input is the **title of a paper** and the neural network is tasked with estimating the **length of the paper**. The slide shows "length of paper" near the bar chart, but does not visually distinguish how a paper title maps to these bars versus representing animal classes.
*   **Optimization Techniques:** She mentions that the loss function is typically optimized using **stochastic gradient descent** or a **"favorite optimizer,"** which are procedural steps not written on the slide (which only defines the loss function $\mathcal{L}$ to minimize).
*   **Input Diversity:** She specifies that inputs can be a **"wide range of unstructured inputs,"** specifically mentioning **images** and **text**. The slide visually displays a single image (a structured input) and uses a generic variable $x$ for the input class.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-5-soundscapeExample 5: Soundscape

Nemotron-3 Nano Omni can analyze charts, figures shown in images, along with audio files to generate any commonalities, discrepancies across the media.

Audio

Question

What is producing the main sound in this audio, and what does the acoustic environment suggest about where it was recorded?

Answer

The main sound is a bird chirping, and the presence of multiple bird calls and a natural, open acoustic environment suggests the recording was made outdoors in a natural setting like a forest or park.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#example-6-musicExample 6: Music

Nemotron-3-Nano-Omni can interpret and reason about general audio, including soundscapes and environmental sounds.

Audio

Question

What's the overall vibe of this song, and what kind of scenario would it complement well?

Answer

The song has a calm, reflective, and slightly melancholic vibe, with a gentle piano melody and soft string accompaniment. It would complement a quiet, introspective scene such as a rainy evening, a peaceful walk, or a moment of personal reflection.

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#getting-started-with-nemotron-3-nano-omniGetting started with Nemotron 3 Nano Omni

https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence#referencesReferences

Similar Articles

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Hugging Face Models Trending

NVIDIA releases Nemotron 3 Nano Omni, a 30B parameter multimodal model capable of processing video, audio, image, and text with integrated reasoning capabilities for enterprise workflows.

Building a Fast Multilingual OCR Model with Synthetic Data

Hugging Face Blog

NVIDIA introduces Nemotron OCR v2, a fast multilingual OCR model built using synthetic data generation. The model achieves 34.7 pages/second on a single A100 GPU by using a unified FOTS-based architecture with feature reuse across detection, recognition, and relational components.

Introducing Nano Banana Pro

Google DeepMind Blog

Google DeepMind introduces Nano Banana Pro, a new state-of-the-art image generation and editing model built on Gemini 3 Pro. The model offers improved text rendering, enhanced world knowledge integration, and high-fidelity visual capabilities available across Google products.