Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Summary
This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.
View Cached Full Text
Cached at: 06/20/26, 02:27 PM
Paper page - Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Source: https://huggingface.co/papers/2606.20381 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
Uniform 4-bit training with RHT-based quantization outperforms E2M1-based methods by eliminating shrinkage bias and improving training stability across large language model architectures.
FP4 trainingpromises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered onE2M1data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such asE2M1inherently suffer fromShrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by theRandom Hadamard Transform(RHT), providing a unified explanation for thetraining instabilityobserved in existingE2M1-based FP4 recipes. In contrast,uniform grids(E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higherquantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three trainingGEMMswhile restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strongE2M1-based baselines, supported byscaling-law analysisandablation studies. Our results suggest that future accelerators should supportE1M2/INT4-style uniform 4-bit grids as first-class training primitives alongsideE2M1.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.20381
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20381 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.20381 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20381 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
This paper decomposes MXFP4 quantization error into three additive components—scale bias, deadzone truncation, and grid noise—and proposes targeted corrections that recover BF16 accuracy to within 0.7 pp on Qwen2.5-3B and 3.0 pp on Qwen3-30B-A3B-Base for LLM reinforcement learning post-training.
Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
This paper studies how post-training quantization introduces new biases in instruction-tuned LLMs, finding that 3-bit precision causes 6–21% of previously unbiased items to develop stereotypes, while standard metrics like perplexity fail to detect this degradation.
Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs
This paper proposes a dense-to-sparse continual training method for LLMs, using a predictor-gated bank-wise sparsity to achieve 4x FFN sparsity, and demonstrates it on Qwen2.5-8B with long-context training.
InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
InfoQuant introduces a train-free method, Peak Suppression Orthogonal Transformation (PSOT), to reshape activation distributions for low-bit LLM quantization, preserving 97% floating-point accuracy under W4A4KV4 and outperforming prior PTQ methods.
DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]
DeepSeek released the full V4 paper detailing FP4 quantization-aware training, MoE training stability tricks (anticipatory routing and SwiGLU clamping), and a generative reward model for RLHF, achieving dramatic efficiency gains—V4-Flash uses only 10% of V3.2's FLOPs and 7% of its KV cache at 1M context length.