@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…

X AI KOLs Timeline 06/17/26, 07:17 AM News

Summary

A technical comparison between nvfp4 and mxfp4 formats, highlighting that nvfp4 uses an additional tensor-wise scale factor to overcome fp4's range limit, allowing more precision in block-wise scale factors.

nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale factor to overcome the range limit of fp4, and thus can use more precisions for block-wise scale factors. https://t.co/9d1hvNBWhO

Original Article

View Cached Full Text

Cached at: 06/17/26, 04:02 PM

Similar Articles

NVFP4 + MTP - voilà on llama.cpp

Reddit r/LocalLLaMA

NVFP4 quantization and Multi-Token Prediction support have been added to llama.cpp in release b9297.

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Reddit r/LocalLLaMA

NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.

@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

X AI KOLs Following

A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.

@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…

X AI KOLs Following

A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.

@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…

X AI KOLs Timeline

A benchmark of NVFP4 on an RTX 5090 with Qwen3.6-27B shows prefill speed gains of 32-42% over equal-bit Q4_K_M and 52-68% over Q6_K, but decode gains are modest (+9% vs Q4) as decode is memory-bandwidth bound. The quality loss compared to Q6 is minimal (-0.8 average), making NVFP4 a good choice for local inference.