Tag
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
The paper proposes Slice, a gradient-surgery-based initialization for LoRA adapters in continual learning that reconciles conflicting gradients from current and past tasks to reduce catastrophic forgetting, achieving better stability-plasticity trade-offs.