@eisokant: Great blog post from @ArkadiiBessonov on our pretraining team!
Summary
A tweet shares a blog post discussing three methods for FP8 in LLM pretraining: per-tensor, blockwise, and MXFP8, focusing on how the scale is attached.
View Cached Full Text
Cached at: 06/28/26, 04:00 AM
Great blog post from @ArkadiiBessonov on our pretraining team!
Arkadii (@ArkadiiBessonov): Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.
per-tensor vs blockwise vs MXFP8.
Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights,
Similar Articles
@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…
Explains three main approaches to FP8 scaling in LLM pretraining—per-tensor, blockwise, and MXFP8—focusing on how the scale is attached, and derives tile geometries from the constraint that scale must remain constant along the matmul's contracted dimension.
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.
@nrehiew_: For the visual learners
A thread reviewing the paper 'Pretraining Large Language Models with NVFP4' and discussing NVFP4 pre-training, especially for NVIDIA Blackwell.
@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587
The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.
@yukangchen_: Excited to share our new blog: Scaling Video Training with Parallelism https://research.nvidia.com/labs/eai/blogs/scali…
This blog from NVIDIA Research discusses how sequence parallelism can scale long-video training systems for both understanding and generation, addressing the challenge of fitting very long video sequences across multiple GPUs.