Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Reddit r/LocalLLaMA 05/12/26, 02:27 AM Tools

tensorrt-llm blackwell nvfp4 local-llm benchmarking gpu-optimization

Summary

A developer toolkit providing configurations, wheels, and benchmarks for running large language models with NVFP4 precision on Nvidia Blackwell GPUs using TensorRT-LLM.

Was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and was able to get across a few hurdles and have configs + wheels set up & ran benchmarks while i was at it. hopefully this helps some folks out. This should work on all the Nvidia Blackwell cards. 5090, 5080, 5070ti etc. as long as the models fit. (like maybe stack 2x 5070TI's) Anyhow, here's the repo of things: [https://github.com/elsung/blackwell-llm-toolkit](https://github.com/elsung/blackwell-llm-toolkit) # Gotchas & solutions * **TRT-LLM launch flags** * Some obscure settings had to be enabled to make TensorRT-LLM run the newer Mamba-hybrid models. YAML file in the repo at \`configs/trtllm/nemotron-omni-v3-sm120.yaml\`. * **LMCache** * Offloading context to SSD to make space for model on VRAM. The PyPI wheel was crashing on Blackwell (missing sm\_120 cubins), so I rebuilt it from source. Works great on my Optane drive. Both the prebuilt wheel and the build script are in the repo. * **Research docs** * AI-outputted deep-dives on what's actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). Helpful reference. The Qwen 3.5/3.6 one in particular saved me from a nasty trap — they look like renamed Qwen3-VL but are completely different architecture under the hood. # Benchmark highlights single RTX Pro 6000 96GB, no tensor parallelism. Speed numbers are sustained decode tok/s (median of 3 runs, 500-token completions). **Nemotron-3-Nano-Omni V3 (multimodal — image/video/audio + text)** * NVFP4 quant, tested at 8k context → \*\*270 tok/s\*\* * Fastest + handles all modalities. Needs TRT-LLM v1.3.0rc13. * [https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) **Nemotron-3-Nano (text only)** * NVFP4 quant, tested at 8k context → \*\*249 tok/s\*\* * Best for tool-calling agents (10/10 on tools). * [https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) **DeepSeek-V4-Flash** * IQ2\_XXS-XL GGUF, tested at 65k context → \*\*31 tok/s\*\* * Best for complex reasoning (9/10 intel + 10/10 tools + 13/13 calibration). * [https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF](https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF) (IQ2\_XXS-XL) **MiniMax-M2.7-REAP-172B** * Q3\_K\_S GGUF, tested at 196k context → \*\*117 tok/s\*\* * Long conversations. * [https://huggingface.co/exdysa/MiniMax-M2.7-REAP-172B-A10B-GGUF](https://huggingface.co/exdysa/MiniMax-M2.7-REAP-172B-A10B-GGUF) (Q3\_K\_S) **MiniMax-M2.7 W4A16 (with LMCache → Optane SSD)** * W4A16 AutoRound, tested at 154k context → \*\*20-22 tok/s\*\* * Long-ctx with W4A16-quality answers, KV cache offloaded to SSD. * [https://huggingface.co/MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16](https://huggingface.co/MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16) **MiniMax-M2.7 W4A16 (short ctx, no LMCache)** * Same model as above, tested at 64k context → \*\*22-25 tok/s\*\* * Highest-quality short answers (10/10 intel). *Full table with TTFT, prefill speeds, concurrency numbers, and all quality eval scores → bench/results.md in the repo:* [*https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md*](https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md) # Bench tools used to validate * \`rapid\_bench.py\` — 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing) * \`bench\_harness.py\` — sustained decode + TTFT + prefill + concurrency, plus a \`--prompt-tokens N\` mode for the 154k long-ctx mjpansa runs *Apache 2.0, PRs welcome — especially benchmark contributions from other Blackwell GPU folks (RTX 5090/5080/5070TI) so the comparison fills out across different hardware.*

Original Article

Blackwell LLM Toolkit - NVFP4 Config +Wheels + Benchmarks for Blackwell GPUs via TensorRT-LLM - 270 tk/s Nemotron 3 Omni

Similar Articles

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0

Benchmarks of 20 small LLMs on a 6GB RTX 4050

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

club-5060ti: practical RTX 5060 Ti local LLM notes and configs

Submit Feedback

Similar Articles

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0

Benchmarks of 20 small LLMs on a 6GB RTX 4050

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

club-5060ti: practical RTX 5060 Ti local LLM notes and configs