@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…

X AI KOLs Timeline News

Summary

Explains three main approaches to FP8 scaling in LLM pretraining—per-tensor, blockwise, and MXFP8—focusing on how the scale is attached, and derives tile geometries from the constraint that scale must remain constant along the matmul's contracted dimension.

Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tensor vs blockwise vs MXFP8. Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that's where all the complexity lives. The three recipes differ in how the scale is attached — granularity, dtype, layout: — Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell. One rule ties it all together: the scale must stay constant along the matmul's contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary. I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy. Full walkthrough in my blogpost (link in comments)!
Original Article
View Cached Full Text

Cached at: 06/28/26, 06:13 PM

Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.

per-tensor vs blockwise vs MXFP8.

Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights, activations, gradients). Each role wants its own scale layout — and that’s where all the complexity lives.

The three recipes differ in how the scale is attached — granularity, dtype, layout:

— Per-tensor: one scale for the whole tensor. Simplest, least robust to outliers. — Blockwise: 1×128 / 128×128 tiles, FP32 scales. The DeepSeek-V3 style. — MXFP8: 1×32 blocks + E8M0 scale. Native on Blackwell.

One rule ties it all together: the scale must stay constant along the matmul’s contracted dimension. That single constraint derives every tile geometry above — nothing here is arbitrary.

I drew every layout out, per recipe and per matmul, so the geometry is concrete instead of hand-wavy.

Full walkthrough in my blogpost (link in comments)!

Full write-up — every recipe, every matmul, drawn out:

Similar Articles