@eisokant: Great blog post from @ArkadiiBessonov on our pretraining team!

X AI KOLs Timeline 06/27/26, 12:10 PM News

Summary

A tweet shares a blog post discussing three methods for FP8 in LLM pretraining: per-tensor, blockwise, and MXFP8, focusing on how the scale is attached.

Great blog post from @ArkadiiBessonov on our pretraining team!

Original Article

View Cached Full Text

Cached at: 06/28/26, 04:00 AM

Great blog post from @ArkadiiBessonov on our pretraining team!

Arkadii (@ArkadiiBessonov): Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.

per-tensor vs blockwise vs MXFP8.

Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights,

Similar Articles

@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…

X AI KOLs Timeline

Explains three main approaches to FP8 scaling in LLM pretraining—per-tensor, blockwise, and MXFP8—focusing on how the scale is attached, and derives tile geometries from the constraint that scale must remain constant along the matmul's contracted dimension.

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Hugging Face Daily Papers

This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.

@nrehiew_: For the visual learners

X AI KOLs Timeline

A thread reviewing the paper 'Pretraining Large Language Models with NVFP4' and discussing NVFP4 pre-training, especially for NVIDIA Blackwell.

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.

@yukangchen_: Excited to share our new blog: Scaling Video Training with Parallelism https://research.nvidia.com/labs/eai/blogs/scali…

X AI KOLs Following

This blog from NVIDIA Research discusses how sequence parallelism can scale long-video training systems for both understanding and generation, addressing the challenge of fitting very long video sequences across multiple GPUs.

Similar Articles

@ArkadiiBessonov: Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached. per-tens…

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

@nrehiew_: For the visual learners

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

@yukangchen_: Excited to share our new blog: Scaling Video Training with Parallelism https://research.nvidia.com/labs/eai/blogs/scali…

Submit Feedback