Tag
Compares Diff-in-Means and Iterative Nullspace Projection (INLP) methods for steering refusal in safety fine-tuned chat models, finding that INLP counterfactual flipping matches DiM directional ablation for refusal suppression while offering more tunable interventions.
A user reports that Fable 5 accepts prompts and consumes tokens but then refuses to answer, highlighting a low threshold for acceptance and inefficient token usage.
This paper introduces Cartograph, a verification layer for AI scientists that couples subspace experiment steering, ambiguity resolution, and library inadequacy detection. The framework outperforms baselines in autonomous discovery testbeds and retrospectively flags inconclusive claims in the A-Lab materials system.
This paper investigates how chain-of-thought reasoning in large reasoning models complicates activation-based steering of refusal behavior. Experiments on DeepSeek-R1-Distill-LLaMA-8B show that refusal is jointly encoded in residual stream activations and the CoT trace, making models more robust to activation-level interventions but exposing the CoT as an alternative attack surface.
Preliminary results from fine-tuning three 8B Llama 3 variants (Hermes, Dolphin, Llama-Instruct) with a 271-example curriculum show significant changes in refusal and uncertainty expression, suggesting that teaching authentic refusal values is more effective than compliance training.
OpenAI introduced 'safe completions,' a new safety-training approach in GPT-5 that replaces binary refusal-based training with output-centric rewards, improving both safety and helpfulness—especially for dual-use prompts. The method penalizes unsafe outputs and rewards helpful responses, resulting in fewer and less severe safety violations compared to refusal-trained models like o3.