refusal

Tag

Cards List
#refusal

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

arXiv cs.AI · 6d ago Cached

Compares Diff-in-Means and Iterative Nullspace Projection (INLP) methods for steering refusal in safety fine-tuned chat models, finding that INLP counterfactual flipping matches DiM directional ablation for refusal suppression while offering more tunable interventions.

0 favorites 0 likes
#refusal

Why does Fable 5 have such low threshold of accepting prompts as it keeps using tokens but refuse to answer eventually

Reddit r/ArtificialInteligence · 2026-06-11

A user reports that Fable 5 accepts prompts and consumes tokens but then refuses to answer, highlighting a low threshold for acceptance and inefficient token usage.

0 favorites 0 likes
#refusal

When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery

arXiv cs.LG · 2026-06-09 Cached

This paper introduces Cartograph, a verification layer for AI scientists that couples subspace experiment steering, ambiguity resolution, and library inadequacy detection. The framework outperforms baselines in autonomous discovery testbeds and retrospectively flags inconclusive claims in the A-Lab materials system.

0 favorites 0 likes
#refusal

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

arXiv cs.AI · 2026-05-27 Cached

This paper investigates how chain-of-thought reasoning in large reasoning models complicates activation-based steering of refusal behavior. Experiments on DeepSeek-R1-Distill-LLaMA-8B show that refusal is jointly encoded in residual stream activations and the CoT trace, making models more robust to activation-level interventions but exposing the CoT as an alternative attack surface.

0 favorites 0 likes
#refusal

@m_shalia: Preliminary results from Three Babies are in and I need to talk about this. We fine-tuned three 8B models that share th…

X AI KOLs Following · 2026-05-15 Cached

Preliminary results from fine-tuning three 8B Llama 3 variants (Hermes, Dolphin, Llama-Instruct) with a 271-example curriculum show significant changes in refusal and uncertainty expression, suggesting that teaching authentic refusal values is more effective than compliance training.

0 favorites 0 likes
#refusal

From hard refusals to safe-completions: toward output-centric safety training

OpenAI Blog · 2025-08-07 Cached

OpenAI introduced 'safe completions,' a new safety-training approach in GPT-5 that replaces binary refusal-based training with output-centric rewards, improving both safety and helpfulness—especially for dual-use prompts. The method penalizes unsafe outputs and rewards helpful responses, resulting in fewer and less severe safety violations compared to refusal-trained models like o3.

0 favorites 0 likes
← Back to home

Submit Feedback