Tag
Proposes MGUP, a momentum-gradient alignment update policy for selective intra-layer parameter updates in stochastic optimization, which integrates with optimizers like AdamW, Lion, and Muon, and provides theoretical convergence guarantees along with superior performance on large-scale model training tasks.
GRASP proposes a method for multi-source transfer learning that sequentially merges source models into a single target model with constant O(1) memory usage, using gradient-based parameter alignment to avoid negative transfer. Experiments show it outperforms ensemble methods while being much more memory-efficient.
This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.