Tag
The paper introduces Ember, a lightweight optimizer for embedding and LM-head matrices that exploits gradient geometry to improve efficiency and performance across supervised finetuning, RL, and pretraining, while using far less optimizer state than Adam.