Tag
Compares Diff-in-Means and Iterative Nullspace Projection (INLP) methods for steering refusal in safety fine-tuned chat models, finding that INLP counterfactual flipping matches DiM directional ablation for refusal suppression while offering more tunable interventions.