@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

X AI KOLs Following Tools

Summary

A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.

Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training. @AAAzzam and I cooked up this visualizer for micro-scaling/block quant formats like NVFP4, MXFP4, and friends. Try it: https://modal.com/llm-almanac/block-quants/nvidia-fp4…
Original Article
View Cached Full Text

Cached at: 06/22/26, 09:33 AM

Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training. @AAAzzam and I cooked up this visualizer for micro-scaling/block quant formats like NVFP4, MXFP4, and friends. Try it: https://modal.com/llm-almanac/block-quants/nvidia-fp4…


What is NVFP4? | LLM Engineer’s Almanac

Source: https://modal.com/llm-almanac/block-quants/nvidia-fp4

What are blockwise quantization/micro-scaling float formats?

Standard formats like FP16 encode each element independently, with one exponent and significand per value. Micro-scaling formats like OCP MXFP4 trade that independence for compression: every 16 consecutive elements share a single scale factor (stored as a E4M3 value), and each element stores only itsrelativemagnitude within the block in a low-precision E2M1 value.

The banding in the image above is the block structure made visible. A block containing both a very bright and a very dark pixel must scale to fit the bright one, collapsing the darker values into only a handful of distinct levels. FP4 has just 8 non-negative representable values (0, 0.5, 1, 1.5, 2, 3, 4, 6 × scale), so FP4 blocks “posterize” to at most 8 colors. FP6 has up to 28 non-negative values and FP8 up to 240, so degradation at those precisions is subtler.

These quantized formats are used in LLM inference to reduce demand onmemory bandwidth, especially during decode, and to take advantage of higherarithmetic bandwidth, especially during prefill. They are generally destined for use in theTensor Cores, where the vast majority of that bandwidth lies in contemporary GPUs.

Explore how individual floats in these formats are encoded on theQuant Formatspage. The image-as-tensor visualization technique is inspired byquant-jaunt.

Similar Articles