model-width

#model-width

@maximelabonne: Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width I'm no nanoGP…

X AI KOLs Following ↗ · 2026-05-21 Cached

A tweet suggests that scaling the embedding learning rate by model width can replace the need for µP (micro-parameterization), referencing Muon optimizer for hidden layers and Adam for the rest.

0 favorites 0 likes

model-width

@maximelabonne: Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width I'm no nanoGP…

Submit Feedback