@jino_rohit: before you start learning quantization for llms, you need to understand how different number formats are represented in…
Summary
A thread explaining why understanding number formats in memory is crucial for learning LLM quantization, covering gradient NaN debugging, numerical stability, and quantization distortion.
View Cached Full Text
Cached at: 05/23/26, 10:16 PM
before you start learning quantization for llms, you need to understand how different number formats are represented in memory. why?
- to debug why gradients go NaN
- why training is numerically unstable
- how does quantization distort my number line
- why certain quantization schemes work better than others
this is my article that helps you build a mental model around it with visuals!
you are!
Similar Articles
@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…
A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
Researchers identify two distinct failure modes in aggressive LLM quantization—Signal Degradation and Computation Collapse—and show that training-free fixes only remedy the former, indicating structural reconstruction is needed for ultra-low-bit models.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
This paper investigates smoothness degradation in extremely quantized Large Language Models, arguing that preserving smoothness is crucial for maintaining performance beyond numerical accuracy.
Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
This paper studies how post-training quantization introduces new biases in instruction-tuned LLMs, finding that 3-bit precision causes 6–21% of previously unbiased items to develop stereotypes, while standard metrics like perplexity fail to detect this degradation.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.