model-width

Tag

Cards List
#model-width

@maximelabonne: Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width I'm no nanoGP…

X AI KOLs Following · 2026-05-21 Cached

A tweet suggests that scaling the embedding learning rate by model width can replace the need for µP (micro-parameterization), referencing Muon optimizer for hidden layers and Adam for the rest.

0 favorites 0 likes
← Back to home

Submit Feedback