@hotschmoe: After reading this post, I decided to get nvfp4 running on my Intel arc b70s just to see, after 12 hours it's running a…
Summary
A user successfully ran nvfp4 quantization on Intel Arc B70s GPUs, achieving nearly double speed and higher accuracy compared to their best int4 configuration, challenging hardware-specific format assumptions.
View Cached Full Text
Cached at: 07/04/26, 06:54 PM
After reading this post, I decided to get nvfp4 running on my Intel arc b70s just to see, after 12 hours it’s running almost twice as fast, whole being more accurate, than my current best int4 autoround config
“none of this was supposed to work”
Eric Hartford (@QuixiAI): “none of this was supposed to work”
Pfuh! Power to the people!
No more “gguf/mlx is for mac, bf16 is for ampere, nvfp4 is for nvidia” nonsense.
Similar Articles
@RayFernando1337: “The selected runtime uses NVFP4 weights for maximum performance. From the original FP8 weights, we performed an in-hou…
Discusses using NVFP4 4-bit floating point weights for maximum performance, achieved via in-house quantization from FP8 using NVIDIA ModelOpt, highlighting the data format's dual scale factors for high dynamic range.
@TeksEdge: Solved! Qwen3.6-27B-FP8 is now running on Intel Arc Pro B70! LocalMaxxing shows a working 4× Arc Pro B70 32GB run at ~5…
Qwen3.6-27B-FP8 model is now running on Intel Arc Pro B70 GPUs at ~50 tok/s with a vLLM bug fix, marking a significant milestone for Intel GPU local AI inference.
@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…
A benchmark of NVFP4 on an RTX 5090 with Qwen3.6-27B shows prefill speed gains of 32-42% over equal-bit Q4_K_M and 52-68% over Q6_K, but decode gains are modest (+9% vs Q4) as decode is memory-bandwidth bound. The quality loss compared to Q6 is minimal (-0.8 average), making NVFP4 a good choice for local inference.
@AaronWeiHuang: Our new blog looks at how FP4 is moving beyond compression into a practical primitive for training and inference across…
NVIDIA's blog details how FP4, with the NVFP4 format and Blackwell hardware, has evolved from a compression trick to a practical primitive for training and inference across LLMs and diffusion models, achieving near 16-bit accuracy.
NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.