gradient-alignment

#gradient-alignment

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

arXiv cs.LG ↗ · 2026-06-17 Cached

Proposes MGUP, a momentum-gradient alignment update policy for selective intra-layer parameter updates in stochastic optimization, which integrates with optimizers like AdamW, Lion, and Muon, and provides theoretical convergence guarantees along with superior performance on large-scale model training tasks.

0 favorites 0 likes

#gradient-alignment

GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning

arXiv cs.LG ↗ · 2026-06-16 Cached

GRASP proposes a method for multi-source transfer learning that sequentially merges source models into a single target model with constant O(1) memory usage, using gradient-based parameter alignment to avoid negative transfer. Experiments show it outperforms ensemble methods while being much more memory-efficient.

0 favorites 0 likes

#gradient-alignment

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.

0 favorites 0 likes

gradient-alignment

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

GRASP: Gradient-Aligned Sequential Parameter Transfer for Memory-Efficient Multi-Source Learning

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Submit Feedback