@plugyawn: Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs! Megaprop is a fork of Megatron …
Summary
Megaprop is a new library for efficient preconditioned optimization across GPUs, forked from Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, and MuP support for width and depth.
View Cached Full Text
Cached at: 06/15/26, 07:05 PM
Introducing: Megaprop: a library for efficient preconditioned optimization across GPUs!
Megaprop is a fork of Megatron and TransformerEngine, with FSDP support for Muon, FOOF, KFAC, and Newton-Muon, with MuP support, for both width and depth…
(1/n)
Megaprop’s PSGD implementation calculates preconditioning matrices along with the gradient, collecting and communicating X.T @ X and dY.T @ dY at the same time we do the gradient on the weights: dY.T @ X, and has first-class support for diagonal/block-diag approx. support.
(2/n)
PSGD in Megaprop happens by specifying the right-preconditioner and left-preconditioner. An optimizer “recipe”.
For eg: right-prec. with the feature gram retrieves FOOF. Composing with Muon retrieves Newton-Muon.
For Newton Muon, in Megaprop, we pass: –matrix-optimizer muon –matrix-input-preconditioner feature_gram
If you want a diagonal approximation of the feature_gram, we also pass. –matrix-input-preconditioner-approximation diag
And more!
(3/n)
https://github.com/plugyawn/Megaprop…
To summarize: Megaprop is a library for efficient preconditioned stochastic gradient descent. It’s still a work in progress; but with width/depth MuP support for Muon, and FSDP support upto Muon + diagonal KFAC already, it’s quite powerful!
And, of course, it inherits all of the Megatron-FSDP goodness of its parents.
Similar Articles
SNAP-FM: Sparse Nonlinear Accelerated Projection for Physics-Constrained Generative Modeling
Proposes SNAP-FM, a method that leverages sparse GPU nonlinear optimization to accelerate constraint projection in physics-constrained generative modeling, achieving faster inference while preserving exact physical constraint satisfaction.
@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…
A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.
@Akashi203: i open-sourced automegakernel -- compiles any huggingface model into a single persistent megakernel batch-1 decode is b…
AutoMegaKernel is an open-source agent harness that compiles any HuggingFace model into a single persistent megakernel, fusing the entire forward pass into one GPU launch to reduce overhead. It achieves up to 1.33x speedup over CUDA-graphed cuBLAS on inference-class GPUs like L4 and L40S, while proving schedules deadlock- and race-free.
Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization
KernelPro is a closed-loop multi-agent system that uses LLMs and micro-profiling tools to automatically optimize GPU kernel code, achieving geomean speedups of 2.42×/4.69×/5.30× on KernelBench and demonstrating a measured 11.6% energy reduction at matched speed.
@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here:
A technical blog post introduces a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized GPU/TPU kernels and hyperparameter scaling laws, demonstrating performance gains over torch.compile.