Comparison of four large language models (≤120B parameters) on deep context performance using Strix Halo hardware. Nemotron Super excels in prompt processing speed at deep context depths compared to GPT-OSS and Qwen models.
Comparison was done on Strix Halo 128gb shared memory, Ubuntu 26.04, Lemonade Server, Vulkan backend. I often run larger models like gpt-oss 120B or qwen but their performance seems to degrate quickly once in deep waters... ah.. deep context. The most important quality to me is prompt processing - we are talking existing code and context quickly fills up when analyzing it for a change request / bugfix. In existing code, I think 95-99% is PP and 1-5% is TG of the total time. I tried Nemotron Super (120B) recently and liked the quality, speed was decent but to my surprise I felt it handled deeper context (\~100k) way better than what I am used to with similar models. To falsify that subjective impression, ran llama-bench with the three competitors in the 120B class (GPT-OSS, qwen 3.5, and Nemotron) and, mostly as a comparison, the popular smaller/weaker/faster Qwen 3.6 35B model. As a subjective baseline I set 100 TPS PP as "usable" and stopped the benchmark if the model fell below it. Also, I should mention that the max context varies by model: GPT-OSS can handle max \~128K, Qwen 3.5/6 can handle \~256K, but Nemotron up to 400k Tokens context depth. My main conclusions are: My feeling was right, Nemotron Super handles deep context exceptionally well, compared to the others. The "speed king" GPT-OSS 120B looses speed so fast that Nemotron Super surpasses it in PP at 32K depth. QWEN 3.5 122B A10B is surpassed almost immediatelly at 16K depth. Even Qwen 3.6 35B A3B's PP is on par at the model's max context of \~256k context, surprisingly. At token generation speed (IMO not as important), Nemotron Super starts usable (IMO >\~10 TG TPS) but not yet really "fun" (IMO >\~20 TG TPS) to use. It degrates slowly to "barely usable" according to that definition at \~400k context depth - which is stll impressive if you ask me. The most direct competitor Qwen 3.5 122B A10B is about as slow at 128k context. Note that I didn't enable MTP, though. If you need high TG, Nemotron is not the best model for context below 128k; if you mainly need PP and a larger model, Nemotron seems a reasonable choice. The fallback if you don't need that large a model is obviously the smaller Qwen 3.6 variants like 35B. Has anyone different results? Maybe with rocm? Any tweaking I didn't consider?
NVIDIA releases Nemotron 3 Ultra, a massive 550 billion parameter mixture-of-experts model with 55B active parameters and a 1 million token context window.
NVIDIA announces Nemotron 3 Super (120B) and Nemotron 3 Ultra (~500B) models, pretrained on 25T tokens using NVFP4 precision, emphasizing accelerated computing and efficiency improvements.
Community release of REAP-pruned Nemotron-3-Super-120B to 64B, GRPO fine-tuned on math, quantized to AWQ/FP8, hitting 90%+ on AIME 2026 and runnable on a single H100/RTX PRO 6000.
Nemotron 3 Ultra is a 550B parameter hybrid Mamba-Attention mixture-of-experts language model, pre-trained on 20T tokens, extended to 1M context, and post-trained with SFT, RL, and MOPD. It achieves up to 6x higher inference throughput than state-of-the-art LLMs with comparable accuracy, and is open-sourced.