diff-in-means

Tag

Cards List
#diff-in-means

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

arXiv cs.AI · 6d ago Cached

Compares Diff-in-Means and Iterative Nullspace Projection (INLP) methods for steering refusal in safety fine-tuned chat models, finding that INLP counterfactual flipping matches DiM directional ablation for refusal suppression while offering more tunable interventions.

0 favorites 0 likes
← Back to home

Submit Feedback