diff-in-means

#diff-in-means

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

arXiv cs.AI ↗ · 6d ago Cached

Compares Diff-in-Means and Iterative Nullspace Projection (INLP) methods for steering refusal in safety fine-tuned chat models, finding that INLP counterfactual flipping matches DiM directional ablation for refusal suppression while offering more tunable interventions.

0 favorites 0 likes

diff-in-means

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

Submit Feedback