@eisokant: Great blog post from @ArkadiiBessonov on our pretraining team!

X AI KOLs Timeline News

Summary

A tweet shares a blog post discussing three methods for FP8 in LLM pretraining: per-tensor, blockwise, and MXFP8, focusing on how the scale is attached.

Great blog post from @ArkadiiBessonov on our pretraining team!
Original Article
View Cached Full Text

Cached at: 06/28/26, 04:00 AM

Great blog post from @ArkadiiBessonov on our pretraining team!

Arkadii (@ArkadiiBessonov): Three main ways to do FP8 in LLM pretraining — and they differ in mainly one thing: how the scale is attached.

per-tensor vs blockwise vs MXFP8.

Why pretraining has so much structure here: forward + backward is 3 matmuls (Fprop, Dgrad, Wgrad) across 3 tensor roles (weights,

Similar Articles

@nrehiew_: For the visual learners

X AI KOLs Timeline

A thread reviewing the paper 'Pretraining Large Language Models with NVFP4' and discussing NVFP4 pre-training, especially for NVIDIA Blackwell.

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.