@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…
Summary
Explains three main approaches to FP8 scaling in LLM pretraining—per-tensor, blockwise, and MXFP8—focusing on how the scale is attached, and derives tile geometries from the constraint that scale must remain constant along the matmul's contracted dimension.
View Cached Full Text
Cached at: 06/28/26, 06:13 PM
Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.
per-tensor vs blockwise vs MXFP8.
Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that’s where all the complexity lives.
The three recipes differ in how the scale is attached — granularity, dtype, layout:
— Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell.
One rule ties it all together: the scale must stay constant along the matmul’s contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary.
I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy.
Full walkthrough in my blogpost (link in comments)!
Full write-up — every recipe, every matmul, drawn out:
Similar Articles
@eisokant: Great blog post from @ArkadiiBessonov on our pretraining team!
A tweet shares a blog post discussing three methods for FP8 in LLM pretraining: per-tensor, blockwise, and MXFP8, focusing on how the scale is attached.
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.
@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…
A technical comparison between nvfp4 and mxfp4 formats, highlighting that nvfp4 uses an additional tensor-wise scale factor to overcome fp4's range limit, allowing more precision in block-wise scale factors.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.
ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
ScaleSweep proposes a new block scale initialization method for NVFP4 post-training quantization of LLMs, achieving improved accuracy by sweeping over feasible block scale candidates. Experiments on Llama and Qwen models show it preserves over 93% of full-precision performance under aggressive quantization.