refusal-steering

Tag

Cards List
#refusal-steering

Expert-Aware Refusal Steering

arXiv cs.CL · 3d ago Cached

This paper extends refusal steering (activation-based jailbreaking) to Mixture-of-Experts LLMs, finding that MoE routing patterns do not inhibit steering, and proposes expert-aware methods that can suppress refusal behavior based on a single expert's output.

0 favorites 0 likes
← Back to home

Submit Feedback