Tag
This paper extends refusal steering (activation-based jailbreaking) to Mixture-of-Experts LLMs, finding that MoE routing patterns do not inhibit steering, and proposes expert-aware methods that can suppress refusal behavior based on a single expert's output.
An informal research note describing a behavior in transformers where the model's inherent 'clarity-seeking' vectors can bypass constraints when discussing higher-order topics, potentially relevant to alignment and safety research.
This paper introduces TRACE, a framework for turn-aware credit assignment in multi-turn LLM jailbreaking attacks using reinforcement learning, claiming significant improvements in attack success rates and defense alignment.
OpenGuardrails is an open-source platform for AI safety, offering context-aware content-safety and manipulation detection (e.g., prompt injection, jailbreaking) via a unified model, plus a separate NER pipeline for data-leakage identification. It achieves state-of-the-art performance on safety benchmarks and supports private, enterprise-grade deployment.