nvidia/nemotron-3.5-asr-streaming-0.6b

Hugging Face Models Trending 05/15/26, 09:52 PM Models

automatic-speech-recognition multilingual streaming asr nvidia nemotron speech-recognition

Summary

NVIDIA releases Nemotron 3.5 ASR, a 600M parameter multilingual streaming speech recognition model supporting 40 language-locales with a Cache-Aware FastConformer-RNNT architecture for low-latency transcription. The model supports configurable chunk sizes and is ready for commercial use under the OpenMDW-1.1 license.

Task: automatic-speech-recognition Tags: nemo, speech-recognition, cache-aware ASR, automatic-speech-recognition, streaming-asr, multilingual, speech, audio, FastConformer, RNNT, Parakeet, ASR, pytorch, NeMo, en, es, de, fr, it, ar, ja, ko, pt, ru, hi, zh, vi, he, nl, cs, da, pl, no, sv, th, tr, bg, el, et, fi, hr, hu, lt, lv, ro, sk, uk, mt, sl, dataset:nvidia/Granary, dataset:multilingual_librispeech, dataset:fleurs, dataset:mozilla-foundation/common_voice_8_0, dataset:voxpopuli, dataset:europarl, arxiv:2312.17279, arxiv:2305.05084, license:other, model-index, region:us

Original Article

View Cached Full Text

Cached at: 06/05/26, 02:21 AM

nvidia/nemotron-3.5-asr-streaming-0.6b · Hugging Face

Source: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b

Nemotron 3.5 ASR overview: multilingual audio across 40 language-locales is transcribed by a cache-aware FastConformer-RNNT model with language-ID prompting into punctuated text with an automatic language tag

This model is the multilingual extension ofnvidia/nemotron-speech-streaming-en-0.6b, adding language-ID prompt conditioning to support transcription across40 language-localesfrom a single model.

Nemotron 3.5 ASRis a multilingual, streaming Automatic Speech Recognition (ASR) model engineered to deliver high-quality multilingual transcription across both low-latency streaming and high-throughput batch workloads. Developed by NVIDIA, this 600M parameter model transcribes speech into text with native support for punctuation and capitalization, and offers runtime flexibility with configurable chunk sizes, including 80ms, 160ms, 320ms, 560ms, and 1120ms.

By leveraging a state-of-the-artCache-Aware FastConformer-RNNTarchitecture, the model eliminates redundant overlapping computations common in traditional “buffered” streaming. This allows it to process only new audio chunks while reusing cached encoder context, significantly improving computational efficiency and minimizing end-to-end delay without sacrificing accuracy.

It was trained on a massive ASR dataset and is engineered to perform across diverse and challenging acoustic conditions.

This model is ready for commercial use.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#licenseterms-of-useLicense/Terms of Use

Governing Terms: Use of the model is governed by theOpenMDW-1.1license.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#deployment-geographyDeployment Geography

Global

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#use-caseUse Case

This model is for transcription of multilingual audio.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#release-dateRelease Date

Hugging Face [06/04/2026] viahttps://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#referencesReferences

[1]Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

[2]Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[3]NVIDIA Granary

[4]NVIDIA NeMo Framework

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#why-choose-nemotron-35-asrWhy Choose Nemotron 3.5 ASR?

🌍**Single Multilingual Model:**Transcribes 40 language-locales from one model through language-ID prompt conditioning, with optional automatic language detection.
⚡**Native Streaming Architecture:**Cache-aware design enables efficient processing of continuous audio streams, designed and optimized for low-latency voice agent applications.
💰**Improved Operational Efficiency:**Delivers superior throughput compared to traditional buffered streaming approaches. This allows for a higher number of parallel streams within the same GPU memory constraints, directly reducing operational costs for production environments.
🎛️**Dynamic Runtime Flexibility:**Choose the optimal operating point on the latency-accuracy Pareto curve at inference time. No re-training is required to adjust for different use-case requirements.
📝**Punctuation & Capitalization:**Built-in support for punctuation and capitalization in output text.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#supported-languagesSupported Languages

The model supports40 language-localesin total, across three tiers:

**Transcription-ready (19 locales):**highest-accuracy ASR, ready out of the box.
**Broad-coverage (13 locales):**production ASR across an additional 13 locales.
**Adaptation-ready (8 locales):**recognized by the tokenizer; fine-tune on in-domain data to unlock full transcription.

TierLanguages (locales)**Transcription-ready (19 locales)**English (en-US, en-GB), Spanish (es-US, es-ES), French (fr-FR, fr-CA), Italian (it-IT), Portuguese (pt-BR, pt-PT), Dutch (nl-NL), German (de-DE), Turkish (tr-TR), Russian (ru-RU), Arabic (ar-AR), Hindi (hi-IN), Japanese (ja-JP), Korean (ko-KR), Vietnamese (vi-VN), Ukrainian (uk-UA)**Broad-coverage (13 locales)**Polish (pl-PL), Swedish (sv-SE), Czech (cs-CZ), Norwegian Bokmål (nb-NO), Danish (da-DK), Bulgarian (bg-BG), Finnish (fi-FI), Croatian (hr-HR), Slovak (sk-SK), Mandarin (zh-CN), Hungarian (hu-HU), Romanian (ro-RO), Estonian (et-EE)**Adaptation-ready (8 locales)**Greek (el-GR), Lithuanian (lt-LT), Latvian (lv-LV), Maltese (mt-MT), Slovenian (sl-SI), Hebrew (he-IL), Thai (th-TH), Norwegian Nynorsk (nn-NO)

**Note:**Transcription-ready and broad-coverage locales (32 total) produce ASR transcription out of the box; adaptation-ready locales require fine-tuning on in-domain data to enable full transcription. The model supports uppercase and lowercase letters, punctuation, spaces, and apostrophes.

**Note:**We would recommendNemotron ASR Streaming (English)model for English-only transcription use cases. For all other transcription ready locales, we recommend Nemotron 3.5 ASR to leverage its expanded multilingual capabilities.

Automatic language detection / language tagging:When run withtarget\_lang=auto, the model detects the spoken language and emits the correspondinglanguage code/tagin the output following the terminal punctuation. This lets a single deployment transcribe mixed-language traffic and automatically label each utterance with its detected language — no separate language-ID component required.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#model-architectureModel Architecture

**Architecture Type:**FastConformer-CacheAware-RNNT with Prompt

This model consists of a cache-aware streaming Parakeet (FastConformer) encoder with an RNN-T decoder and language-ID prompt conditioning. It is based on the Cache-Aware[1]FastConformer[2]architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping. This model leverages prompts to guide the transcription process, enabling language-specific transcription from a single ASR model through language ID conditioning.

Nemotron 3.5 ASR architecture: FastConformer encoder and language-ID encoding are concatenated, projected, and fed to the RNNT decoder

The language-ID prompt is fused with the acoustic representation as follows:

FastConformer encoderprocesses audio into an acoustic embedding of shape (D=512, T).
Language Encodingexpands a 128-dim one-hot language vector across the time axis → (K=128, T), broadcasting the language identity to every frame.
Concatenationalong the feature axis → fused tensor (D + K, T).
Projection layermaps the fused features to the RNNT decoder.

Network Architecture:

Encoder: Cache-Aware FastConformer with 24 layers
Decoder: RNNT (Recurrent Neural Network Transducer)
Parameters: 600M

This model was developed based onnvidia/nemotron-speech-streaming-en-0.6b.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#results-at-a-glanceResults at a Glance

ASR performance is measured using Word Error Rate (WER) on theFLEURStest sets. Accuracy stays strong across both modes and improves as the chunk size grows, while remaining competitive even at the lowest-latency 80ms setting. Full tables are inPerformance.

FLEURS average WER vs streaming chunk size (LangID vs Auto-detect)

FLEURS WER by language: LangID vs Auto-detect at 320ms chunk

**Note:**Japanese and Korean are measured using Character Error Rate (CER) rather than WER, as is standard for these languages.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#throughput–efficiencyThroughput & Efficiency

Despite beingroughly half the size(0.6B vs. 1.1B), Nemotron 3.5 ASR servesfar more concurrent streams at far lower latencythan theParakeet RNNT 1.1B multilingual model, which runs on buffered streaming. The cache-aware streaming design avoids the redundant recomputation of buffered inference, so a single H100 can sustain dramatically higher concurrency at every chunk size — directly lowering the cost per stream in production. At the lowest-latency 80ms setting, Nemotron sustains**~17× more concurrent streams**(240 vs. 14); at the 1120ms setting it sustains6× more(2,400 vs. 400). The latency-vs-concurrency curves tell the same story: Nemotron (solid green) holds low final-token latency well past 1,000 parallel requests, while Parakeet RNNT 1.1B (dashed blue) saturates after only a few hundred.

Concurrent streams supported on a single H100: Nemotron ASR streaming vs Parakeet RNNT, across chunk sizes

Median final-token latency vs number of parallel requests on a single H100, Nemotron vs Parakeet RNNT across chunk sizes

Measured on a single NVIDIA H100. Throughput is the number of real-time streams sustainable in parallel; latency is the median final-token latency at a given level of concurrency.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#explore-more-from-nvidiaExplore more from NVIDIA

For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal atdeveloper.nvidia.com. Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Speech NIM, and foundation models.

Also, check out the following NVIDIA speech models:

Nemotron ASR Streaming (English) (Nemotron 3 ASR) -https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b
Multitalker Parakeet Streaming -https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1
Parakeet Realtime EOU -https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#nvidia-nemoNVIDIA NeMo

To train, fine-tune or perform inference with this model, you will need to installNVIDIA NeMo [4]. We recommend you install it after you’ve installed Cython and latest PyTorch version.

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#how-to-use-this-modelHow to Use this Model

The model is available for use in the NeMo Framework, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#loading-the-modelLoading the Model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/nemotron-3.5-asr-streaming-0.6b")

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#streaming-inferenceStreaming Inference

You can use the cache-aware streaming inference script from NeMo -NeMo/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

This is a prompt-conditioned multilingual model: pass the target language withtarget\_lang(e.g.en\-US,es\-ES,de\-DE), or usetarget\_lang=autofor automatic language detection.

cd NeMo
python examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
    model_path=<model_path> \
    dataset_manifest=<dataset_manifest> \
    batch_size=<batch_size> \
    target_lang=<lang_id> \ #language key (e.g. en-US) or "auto" for automatic language detection
    att_context_size="[56,13]" \ #set the second value to the desired right context from {0,1,3,6,13}
    strip_lang_tags=true \ #true: remove the detected language tag from the text; false: keep it in the output
    output_path=<output_folder>

**strip\_lang\_tags**controls how the detected language tag is handled in the output. The model appends a language tag (e.g.<en\-US\>) after the transcript’s terminal punctuation:

strip\_lang\_tags=false(keep): the tag is left in the output, so you can read the detected language directly from each utterance — useful for mixed-language traffic and language labeling.
strip\_lang\_tags=true(remove): the tag is stripped, leaving only the clean transcript text — useful when you only need the spoken words.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#setting-up-streaming-configurationSetting up Streaming Configuration

Latency is defined by theatt\_context\_sizeparam, where att_context_size =\{num\_frames\_left\_context, num\_frame\_right\_context\}, all measured in80ms frames:

[56, 0]: Chunk size = 1 (1 × 80ms = 0.08s)
[56, 1]: Chunk size = 2 (2 × 80ms = 0.16s)
[56, 3]: Chunk size = 4 (4 × 80ms = 0.32s)
[56, 6]: Chunk size = 7 (7 × 80ms = 0.56s)
[56, 13]: Chunk size = 14 (14 × 80ms = 1.12s)

Here, chunk size = current frame + right context; each chunk is processed in non-overlapping fashion.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#inputs-Input(s):

**Input Type(s):**Audio, Lang ID

**Input Format(s):**wav, string

**Input Parameters:**One-Dimensional (1D) for audio and One-Dimensional (1D) for Lang ID

**Other Properties Related to Input:**Maximum Length in seconds specific to GPU Memory, No Pre-Processing Needed, Mono channel is required.

By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#outputOutput

**Output Type(s):**Text String in Input Language

**Output Format(s):**String

**Output Parameters:**One-Dimensional (1D)

**Other Properties Related to Output:**No Maximum Character Length, transcribe punctuation and capitalization.

By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#software-integrationSoftware Integration

**Runtime Engine:**NeMo 26.06

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Jetson
NVIDIA Lovelace
NVIDIA Turing
NVIDIA Volta

Supported Operating System(s):

Linux
Linux 4 Tegra

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#model-versionsModel Version(s):

nemotron-3.5-asr-streaming-0.6b-v1

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#training-and-evaluation-datasetsTraining and Evaluation Datasets:

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#training-datasetsTraining Datasets

It was trained on speech data across 40 language-locales. The training data is a dynamic blend of public and proprietary internal datasets normalized to have spoken forms in text with punctuation and capitalization, including:

NVIDIA Riva multilingual ASR training set (Proprietary)
NVIDIA Granary[3]
Multilingual LibriSpeech (MLS)
Mozilla Common Voice
FLEURS
VoxPopuli / Europarl-ASR

** Data Modality: Audio

** Audio Training Data Size: 10,000 to 1 Million Hours

** Data Collection Method by dataset

Human

** Labeling Method by dataset

Human
Synthetic: Synthetic labels were generated from an ensemble of ASR models (NVIDIA Canary,Parakeet Multilingual 1.1B RNNT,Parakeet CTC 1.1B,OpenAI Whisper, andFunASR), with punctuation and capitalization (PnC) generated fromQwen3-32B.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#evaluation-datasetsEvaluation Datasets

The model was evaluated on multilingual ASR benchmarks:

FLEURS
Mozilla Common Voice (MCV)
Multilingual LibriSpeech (MLS)
NVIDIA internal multilingual evaluation sets

** Data Collection Method by dataset

Human

** Labeling Method by dataset

Human

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#performancePerformance

ASR performance is measured using the Word Error Rate (WER). The tables below report WER (%) on theFLEURStest sets across configurable streaming chunk sizes, in two modes:

**Language Input (LangID):**the target language is provided to the model.
**Auto-detect:**the model automatically detects the spoken language.

**Note:**Japanese, Korean, and Mandarin are evaluated using Character Error Rate (CER) rather than WER, as is standard for these languages.**Note on text normalization:**WER/CER are computed after text normalization that aligns the reference and hypothesis (e.g., casing, punctuation, numerals, and formatting conventions). Normalization is not perfect across all 40 language-locales, and residual mismatches between normalized text can inflate the reported error rates — actual transcription quality may be somewhat better than the numbers suggest.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#transcription-ready-19-localesTranscription-ready (19 locales)

Languages are ordered by accuracy (lowest WER first).

LanguageLanguage Input (LangID)Auto-detect80ms160ms320ms560ms1.12s80ms160ms320ms560ms1.12sSpanish (es-US, es-ES)4.874.644.394.264.115.044.824.484.344.13Italian (it-IT)5.234.854.834.414.255.284.894.844.474.32Portuguese (pt-BR, pt-PT)6.296.105.815.655.486.416.195.825.575.47Hindi (hi-IN)8.137.977.417.056.8111.4710.839.889.268.23Korean (ko-KR)7.597.707.277.187.128.318.187.817.497.30English (en-US, en-GB)9.438.888.277.997.919.729.348.848.808.84German (de-DE)9.819.218.838.428.319.909.378.878.588.22French (fr-FR, fr-CA)10.9710.609.799.459.0311.0310.609.849.469.02Russian (ru-RU)10.8410.739.879.609.1712.4712.0911.0110.5710.03Turkish (tr-TR)12.3412.3312.0511.3411.1712.6112.2811.9311.5111.32Vietnamese (vi-VN)13.4112.8712.2911.7811.1813.5913.0212.4012.0211.22Dutch (nl-NL)14.0313.4312.1711.9711.4614.0913.8012.6212.2411.70Japanese (ja-JP)13.8712.9012.2211.9111.4814.9713.8513.0012.3811.66Arabic (ar-AR)13.1712.6512.5512.1312.0313.4712.8512.6712.1812.06Ukrainian (uk-UA)15.7015.2114.5513.6713.0718.8117.9616.7915.6014.59Average10.3810.009.499.128.8411.1410.6710.059.639.21### https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#broad-coverage-13-localesBroad-coverage (13 locales)

Languages are ordered by accuracy (lowest WER first).

LanguageLanguage Input (LangID)Auto-detect80ms160ms320ms560ms1.12s80ms160ms320ms560ms1.12sPolish (pl-PL)19.8818.9217.4816.6115.1522.6521.6320.0518.5216.55Norwegian Bokmål (nb-NO)20.4320.0718.9018.4418.1020.9120.1919.2918.7618.01Finnish (fi-FI)21.1920.5720.0518.9418.3421.6120.8820.4019.3618.72Mandarin (zh-CN)20.5620.2220.0319.5119.2822.4521.0720.5920.4019.87Czech (cs-CZ)24.1823.2022.4121.0420.4125.8125.1223.6822.5521.45Bulgarian (bg-BG)24.5023.5822.8021.7020.5328.2827.2225.5424.0521.84Slovak (sk-SK)25.0824.1423.7322.5121.2827.5926.0625.6124.1522.68Swedish (sv-SE)25.6124.8523.6322.7222.1726.2825.5624.1823.5722.53Croatian (hr-HR)27.9227.0925.7924.9223.9732.1331.2029.6528.9527.46Romanian (ro-RO)31.5230.9329.0427.7725.9034.2233.2630.9729.8426.88Estonian (et-EE)29.9529.6628.5927.3726.3530.5830.0928.7228.0327.19Danish (da-DK)32.6231.5130.0028.9227.4933.1531.7730.2229.3327.81Hungarian (hu-HU)32.7032.0330.9229.7228.6833.4032.3931.4930.2029.18Average25.8625.1424.1123.0922.1327.6226.6525.4124.4423.09### https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#adaptation-ready-languages-fine-tune-to-enableAdaptation-ready languages (fine-tune to enable)

These8 language-localesare recognized by the tokenizer but are not tuned for production transcription out of the box:Greek (el-GR), Hebrew (he-IL), Lithuanian (lt-LT), Slovenian (sl-SI), Latvian (lv-LV), Maltese (mt-MT), Thai (th-TH), and Norwegian Nynorsk (nn-NO). Fine-tuning on in-domain data is recommended to bring them to production quality.

Check ourblog postofhow to fine-tune Nemotron 3.5 ASR to improve these languages, including before/after results.

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#ethical-considerationsEthical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concernshere.

nvidia/nemotron-3.5-asr-streaming-0.6b

nvidia/nemotron-3.5-asr-streaming-0.6b · Hugging Face

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#licenseterms-of-useLicense/Terms of Use

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#deployment-geographyDeployment Geography

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#use-caseUse Case

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#release-dateRelease Date

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#referencesReferences

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#why-choose-nemotron-35-asrWhy Choose Nemotron 3.5 ASR?

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#supported-languagesSupported Languages

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#model-architectureModel Architecture

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#results-at-a-glanceResults at a Glance

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#throughput–efficiencyThroughput & Efficiency

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#explore-more-from-nvidiaExplore more from NVIDIA

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#nvidia-nemoNVIDIA NeMo

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#how-to-use-this-modelHow to Use this Model

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#loading-the-modelLoading the Model

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#streaming-inferenceStreaming Inference

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#setting-up-streaming-configurationSetting up Streaming Configuration

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#inputs-Input(s):

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#outputOutput

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#software-integrationSoftware Integration

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#model-versionsModel Version(s):

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#training-and-evaluation-datasetsTraining and Evaluation Datasets:

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#training-datasetsTraining Datasets

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#evaluation-datasetsEvaluation Datasets

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#performancePerformance

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#transcription-ready-19-localesTranscription-ready (19 locales)

https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b#ethical-considerationsEthical Considerations

Similar Articles

@DataChaz: @NVIDIA just quietly dropped an incredibly impressive speech recognition model that completely changes the math for loc…

@kwindla: https://x.com/kwindla/status/2062544580105359686

@HuggingApps: NVIDIA Nemotron just dropped an audio-native model that hears the world, not just words transcription, translation, sou…

nvidia/Nemotron-Labs-Audex-30B-A3B · Hugging Face

nvidia/Nemotron-3-Embed-1B-BF16

Submit Feedback

Similar Articles

@DataChaz: @NVIDIA just quietly dropped an incredibly impressive speech recognition model that completely changes the math for loc…

@kwindla: https://x.com/kwindla/status/2062544580105359686

@HuggingApps: NVIDIA Nemotron just dropped an audio-native model that hears the world, not just words transcription, translation, sou…

nvidia/Nemotron-Labs-Audex-30B-A3B · Hugging Face

nvidia/Nemotron-3-Embed-1B-BF16