A developer toolkit providing configurations, wheels, and benchmarks for running large language models with NVFP4 precision on Nvidia Blackwell GPUs using TensorRT-LLM.
Was trying to get a good set of models with NVFP4 to leverage the RTX Pro 6000 and was able to get across a few hurdles and have configs + wheels set up & ran benchmarks while i was at it. hopefully this helps some folks out. This should work on all the Nvidia Blackwell cards. 5090, 5080, 5070ti etc. as long as the models fit. (like maybe stack 2x 5070TI's) Anyhow, here's the repo of things: [https://github.com/elsung/blackwell-llm-toolkit](https://github.com/elsung/blackwell-llm-toolkit) # Gotchas & solutions * **TRT-LLM launch flags** * Some obscure settings had to be enabled to make TensorRT-LLM run the newer Mamba-hybrid models. YAML file in the repo at \`configs/trtllm/nemotron-omni-v3-sm120.yaml\`. * **LMCache** * Offloading context to SSD to make space for model on VRAM. The PyPI wheel was crashing on Blackwell (missing sm\_120 cubins), so I rebuilt it from source. Works great on my Optane drive. Both the prebuilt wheel and the build script are in the repo. * **Research docs** * AI-outputted deep-dives on what's actually different about the latest model families (Nemotron Omni V3, Qwen 3.5/3.6, Gemma 4). Helpful reference. The Qwen 3.5/3.6 one in particular saved me from a nasty trap — they look like renamed Qwen3-VL but are completely different architecture under the hood. # Benchmark highlights single RTX Pro 6000 96GB, no tensor parallelism. Speed numbers are sustained decode tok/s (median of 3 runs, 500-token completions). **Nemotron-3-Nano-Omni V3 (multimodal — image/video/audio + text)** * NVFP4 quant, tested at 8k context → \*\*270 tok/s\*\* * Fastest + handles all modalities. Needs TRT-LLM v1.3.0rc13. * [https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) **Nemotron-3-Nano (text only)** * NVFP4 quant, tested at 8k context → \*\*249 tok/s\*\* * Best for tool-calling agents (10/10 on tools). * [https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4) **DeepSeek-V4-Flash** * IQ2\_XXS-XL GGUF, tested at 65k context → \*\*31 tok/s\*\* * Best for complex reasoning (9/10 intel + 10/10 tools + 13/13 calibration). * [https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF](https://huggingface.co/teamblobfish/DeepSeek-V4-Flash-GGUF) (IQ2\_XXS-XL) **MiniMax-M2.7-REAP-172B** * Q3\_K\_S GGUF, tested at 196k context → \*\*117 tok/s\*\* * Long conversations. * [https://huggingface.co/exdysa/MiniMax-M2.7-REAP-172B-A10B-GGUF](https://huggingface.co/exdysa/MiniMax-M2.7-REAP-172B-A10B-GGUF) (Q3\_K\_S) **MiniMax-M2.7 W4A16 (with LMCache → Optane SSD)** * W4A16 AutoRound, tested at 154k context → \*\*20-22 tok/s\*\* * Long-ctx with W4A16-quality answers, KV cache offloaded to SSD. * [https://huggingface.co/MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16](https://huggingface.co/MJPansa/MiniMax-M2.7-REAP-172B-A10B-AutoRound-W4A16) **MiniMax-M2.7 W4A16 (short ctx, no LMCache)** * Same model as above, tested at 64k context → \*\*22-25 tok/s\*\* * Highest-quality short answers (10/10 intel). *Full table with TTFT, prefill speeds, concurrency numbers, and all quality eval scores → bench/results.md in the repo:* [*https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md*](https://github.com/elsung/blackwell-llm-toolkit/blob/main/bench/results.md) # Bench tools used to validate * \`rapid\_bench.py\` — 41-prompt quality eval (10 intelligence + 10 tool-use + 13 calibration + 3 orchestration + 5 creative writing) * \`bench\_harness.py\` — sustained decode + TTFT + prefill + concurrency, plus a \`--prompt-tokens N\` mode for the 154k long-ctx mjpansa runs *Apache 2.0, PRs welcome — especially benchmark contributions from other Blackwell GPU folks (RTX 5090/5080/5070TI) so the comparison fills out across different hardware.*
A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.
AirLLM is an open-source tool that optimizes inference memory usage, enabling 70B LLMs to run on a single 4GB GPU without quantization, and supports 405B models on 8GB VRAM.
A technical guide on setting up local LLM autocomplete (Qwen2.5-Coder-7B) and agentic coding (Qwen3.6-35B-A3B) on a single 16GB GPU with 64GB+ RAM using llama.cpp, including commands and performance benchmarks.
A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.