@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…

X AI KOLs Timeline 06/27/26, 12:05 PM News

fp8 llm pretraining precision scaling deep-learning training

Summary

Explains three main approaches to FP8 scaling in LLM pretraining—per-tensor, blockwise, and MXFP8—focusing on how the scale is attached, and derives tile geometries from the constraint that scale must remain constant along the matmul's contracted dimension.

Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tensor vs blockwise vs MXFP8. Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that's where all the complexity lives. The three recipes differ in how the scale is attached — granularity, dtype, layout: — Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell. One rule ties it all together: the scale must stay constant along the matmul's contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary. I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy. Full walkthrough in my blogpost (link in comments)!

Original Article

View Cached Full Text

Cached at: 06/28/26, 06:13 PM

Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.

per-tensor vs blockwise vs MXFP8.

Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that’s where all the complexity lives.

The three recipes differ in how the scale is attached — granularity, dtype, layout:

— Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell.

One rule ties it all together: the scale must stay constant along the matmul’s contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary.

I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy.

Full walkthrough in my blogpost (link in comments)!

Full write-up — every recipe, every matmul, drawn out:

@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…

Similar Articles

@eisokant: Great blog post from @ArkadiiBessonov on our pretraining team!

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

Submit Feedback

Similar Articles

@eisokant: Great blog post from @ArkadiiBessonov on our pretraining team!

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

@zcbenz: nvfp4 vs mxfp4 is not just different choices of block size and scale format, nvfp4 uses an additional tensor-wise scale…

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization