@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…
Summary
A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.
View Cached Full Text
Cached at: 06/22/26, 09:33 AM
Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training. @AAAzzam and I cooked up this visualizer for micro-scaling/block quant formats like NVFP4, MXFP4, and friends. Try it: https://modal.com/llm-almanac/block-quants/nvidia-fp4…
What is NVFP4? | LLM Engineer’s Almanac
Source: https://modal.com/llm-almanac/block-quants/nvidia-fp4
What are blockwise quantization/micro-scaling float formats?
Standard formats like FP16 encode each element independently, with one exponent and significand per value. Micro-scaling formats like OCP MXFP4 trade that independence for compression: every 16 consecutive elements share a single scale factor (stored as a E4M3 value), and each element stores only itsrelativemagnitude within the block in a low-precision E2M1 value.
The banding in the image above is the block structure made visible. A block containing both a very bright and a very dark pixel must scale to fit the bright one, collapsing the darker values into only a handful of distinct levels. FP4 has just 8 non-negative representable values (0, 0.5, 1, 1.5, 2, 3, 4, 6 × scale), so FP4 blocks “posterize” to at most 8 colors. FP6 has up to 28 non-negative values and FP8 up to 240, so degradation at those precisions is subtler.
These quantized formats are used in LLM inference to reduce demand onmemory bandwidth, especially during decode, and to take advantage of higherarithmetic bandwidth, especially during prefill. They are generally destined for use in theTensor Cores, where the vast majority of that bandwidth lies in contemporary GPUs.
Explore how individual floats in these formats are encoded on theQuant Formatspage. The image-as-tensor visualization technique is inspired byquant-jaunt.
Similar Articles
@charles_irl: another page for the @modal LLMEng Almanac: an explorer for low-precision floats, from bf16 to fp4 https://modal.com/ll…
A page from Modal's LLM Engineer's Almanac that provides an interactive explorer for understanding low-precision floating-point formats like bf16 and fp4.
@charles_irl: This block quant visualizer is another page in our LLM Engineer's Almanac -- a one-stop shop for engineers looking to o…
A new page in the LLM Engineer's Almanac provides a block quant visualizer to help engineers understand quantization formats for owning their LLM inference.
@jino_rohit: before you start learning quantization for llms, you need to understand how different number formats are represented in…
A thread explaining why understanding number formats in memory is crucial for learning LLM quantization, covering gradient NaN debugging, numerical stability, and quantization distortion.
@charles_irl: my gut says that to solve float numerics problems from nondeterminism x nonassociativity, we need to think bigger than …
This tweet discusses the idea of training models with 'implementation noise' to improve robustness against float numerics problems caused by nondeterminism and nonassociativity.
@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…
A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.