Tag
Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.
The author proposes two architectures, Internal KV-Sphere Architecture (IKSA) and Background Micro Fine-Tuning (BMFT), for enabling LLMs to learn continually from personal interactions without GPU requirements and without catastrophic forgetting.
MixSD proposes a self-distillation method for knowledge injection in language models that aligns supervision with the model's native distribution, reducing catastrophic forgetting during fine-tuning. It achieves near-perfect memorization while retaining up to 100% of base capabilities, vastly outperforming standard SFT.
This paper proposes using reinforcement learning with semantic rewards (via GRPO) to expand LLMs to low-resource languages without the typical alignment tax of catastrophic forgetting, showing improved semantic quality and transferability over supervised fine-tuning.
The paper proposes Slice, a gradient-surgery-based initialization for LoRA adapters in continual learning that reconciles conflicting gradients from current and past tasks to reduce catastrophic forgetting, achieving better stability-plasticity trade-offs.
This paper shows that mixing post-training data into pretraining (early exposure) improves how robustly a model retains capabilities after subsequent fine-tuning, challenging the notion that immediate post-training performance predicts retention. Controlled experiments on 135M and 1B models demonstrate that early exposure consistently improves the trade-off between upstream retention and downstream performance.
Google researchers introduce Nested Learning, a new architecture that replaces the Transformer by treating models as nested optimization problems, solving catastrophic forgetting and achieving 100% long-context memory stability.
This paper introduces a Fast-Slow Training framework for LLMs that combines parameter updates with optimized context to improve sample efficiency and reduce catastrophic forgetting during continual learning.
A fast-slow learning framework for LLMs combines fixed slow weights with optimized fast context weights, achieving up to 3x better sample efficiency and reduced catastrophic forgetting in continual learning scenarios.
ORBIT proposes a method to mitigate catastrophic forgetting in large language models fine-tuned for generative retrieval by tracking parameter distances and using weight averaging, outperforming common continual learning baselines.
This paper introduces Retention-aware Policy Optimization (RaPO) to mitigate catastrophic forgetting in visual continual learning using reinforcement fine-tuning. RaPO uses trajectory-level reward shaping and cross-task advantage normalization to close the gap between reinforcement and supervised fine-tuning in class- and domain-incremental learning.
This research investigates how task geometry influences continual post-training in LLMs, identifying 'geometry conflict' as a cause of forgetting and a mechanism for controlling update integration. The authors propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free method that improves retention and performance across various model sizes.
The paper addresses catastrophic forgetting in sequentially trained early-exiting neural networks and proposes two methods based on Elastic Weight Consolidation and Learning without Forgetting to preserve earlier exit performance while adding new ones.
This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.
GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.
JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.
This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.
This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.