Tag
The paper proposes a gravitational interpretation for fine-tuning reversion, where early training creates dominant behavioral manifolds that later alignment only shallowly displaces, causing a persistent reversion direction. Experiments show that blocking this direction reduces harmfulness with minimal task cost.
This paper introduces a diagnostic framework using Sparse Autoencoders to analyze concept-level forgetting in continual learning, finding that much forgetting is due to representational inaccessibility rather than erasure.