@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…
Summary
A technical comparison between nvfp4 and mxfp4 formats, highlighting that nvfp4 uses an additional tensor-wise scale factor to overcome fp4's range limit, allowing more precision in block-wise scale factors.
View Cached Full Text
Cached at: 06/17/26, 04:02 PM
nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale factor to overcome the range limit of fp4, and thus can use more precisions for block-wise scale factors. https://t.co/9d1hvNBWhO
Similar Articles
NVFP4 + MTP - voilà on llama.cpp
NVFP4 quantization and Multi-Token Prediction support have been added to llama.cpp in release b9297.
NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.
@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…
A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.
@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…
A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.
@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…
A benchmark of NVFP4 on an RTX 5090 with Qwen3.6-27B shows prefill speed gains of 32-42% over equal-bit Q4_K_M and 52-68% over Q6_K, but decode gains are modest (+9% vs Q4) as decode is memory-bandwidth bound. The quality loss compared to Q6 is minimal (-0.8 average), making NVFP4 a good choice for local inference.