Tag
CBD introduces an API-only black-box unlearning framework for LLMs that uses two auxiliary models to create controlled behavioral divergence between retained and target data, achieving a better unlearning-utility trade-off compared to existing methods.
RepSelect introduces a method for robust LLM unlearning that isolates forget-set-specific representations by collapsing top principal components of weight gradients, achieving 4-50× better robustness against relearning attacks compared to existing baselines across multiple model families.
The paper proposes the Unlearning Depth Score (UDS), a metric that uses activation patching to quantify how thoroughly target knowledge is erased from LLMs, achieving state-of-the-art faithfulness and robustness across multiple unlearning methods.
This paper introduces Minor Component Unlearning (MCU), a novel approach to LLM unlearning that targets minor components in representations to resist relearning attacks. It addresses the vulnerability of existing methods by focusing on robust directions within the model's spectral structure.