@plugyawn: Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs! Megaprop is a fork of Megatron …

X AI KOLs Following 06/15/26, 04:27 PM Tools

preconditioned-optimization gpu megatron transformerengine fsdp muon library

Summary

Megaprop is a new library for efficient preconditioned optimization across GPUs, forked from Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, and MuP support for width and depth.

Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs! Megaprop is a fork of Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, with MuP support, for both width and depth... (1/n) https://t.co/6OUds8kNj8

Original Article

View Cached Full Text

Cached at: 06/15/26, 07:05 PM

Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs!

Megaprop is a fork of Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, with MuP support, for both width and depth…

(1/n)

Megaprop’s PSGD implementation calculates preconditioning matrices along with the gradient, collecting and communicating X.T @ X and dY.T @ dY at the same time we do the gradient on the weights: dY.T @ X, and has first-class support for diagonal/block-diag approx. support.

(2/n)

PSGD in Megaprop happens by specifying the right-preconditioner and left-preconditioner. An optimizer “recipe”.

For eg: right-prec. with the feature gram retrieves FOOF. Composing with Muon retrieves Newton-Muon.

For Newton Muon, in Megaprop, we pass: –matrix-optimizer muon –matrix-input-preconditioner feature_gram

If you want a diagonal approximation of the feature_gram, we also pass. –matrix-input-preconditioner-approximation diag

And more!

(3/n)

https://github.com/plugyawn/Megaprop…

To summarize: Megaprop is a library for efficient preconditioned stochastic gradient descent. It’s still a work in progress; but with width/depth MuP support for Muon, and FSDP support upto Muon + diagonal KFAC already, it’s quite powerful!

And, of course, it inherits all of the Megatron-FSDP goodness of its parents.

@plugyawn: Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs! Megaprop is a fork of Megatron …

Similar Articles

SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling

@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…

@Akashi203: i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is b…

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here:

Submit Feedback

Similar Articles

SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling

@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…

@Akashi203: i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is b…

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here: