@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

X AI KOLs Following 06/22/26, 02:08 AM Tools

low-precision-floats block-quantization nvfp4 mxfp4 quantization visualizer llm-inference

Summary

A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.

Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training. @AAAzzam and I cooked up this visualizer for micro-scaling/block quant formats like NVFP4, MXFP4, and friends. Try it: https://modal.com/llm-almanac/block-quants/nvidia-fp4…

Original Article

View Cached Full Text

Cached at: 06/22/26, 09:33 AM

What is NVFP4? | LLM Engineer’s Almanac

Source: https://modal.com/llm-almanac/block-quants/nvidia-fp4

What are blockwise quantization/micro-scaling float formats?

Standard formats like FP16 encode each element independently, with one exponent and significand per value. Micro-scaling formats like OCP MXFP4 trade that independence for compression: every 16 consecutive elements share a single scale factor (stored as a E4M3 value), and each element stores only itsrelativemagnitude within the block in a low-precision E2M1 value.

The banding in the image above is the block structure made visible. A block containing both a very bright and a very dark pixel must scale to fit the bright one, collapsing the darker values into only a handful of distinct levels. FP4 has just 8 non-negative representable values (0, 0.5, 1, 1.5, 2, 3, 4, 6 × scale), so FP4 blocks “posterize” to at most 8 colors. FP6 has up to 28 non-negative values and FP8 up to 240, so degradation at those precisions is subtler.

These quantized formats are used in LLM inference to reduce demand onmemory bandwidth, especially during decode, and to take advantage of higherarithmetic bandwidth, especially during prefill. They are generally destined for use in theTensor Cores, where the vast majority of that bandwidth lies in contemporary GPUs.

Explore how individual floats in these formats are encoded on theQuant Formatspage. The image-as-tensor visualization technique is inspired byquant-jaunt.

@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

What is NVFP4? | LLM Engineer’s Almanac

What are blockwise quantization/micro-scaling float formats?

Similar Articles

@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…

@charles_irl: This block quant visualizer is another page in our LLM Engineer's Almanac -- a one-stop shop for engineers looking to o…

@jino_rohit: before you start learning quantization for llms, you need to understand how different number formats are represented in…

@charles_irl: my gut says that to solve float numerics problems from nondeterminism x nonassociativity, we need to think bigger than …

@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…

Submit Feedback

Similar Articles

@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…

@charles_irl: This block quant visualizer is another page in our LLM Engineer's Almanac -- a one-stop shop for engineers looking to o…

@jino_rohit: before you start learning quantization for llms, you need to understand how different number formats are represented in…

@charles_irl: my gut says that to solve float numerics problems from nondeterminism x nonassociativity, we need to think bigger than …

@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…