jailbreak-detection

#jailbreak-detection

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

arXiv cs.CL ↗ · 2026-06-25 Cached

This paper investigates how jailbreak attempts are encoded in the internal representations of large language models by analyzing token-level predictive entropy trajectories across layers using the logit lens. It finds that entropy dynamics at intermediate layers are more discriminative than aggregate statistics, providing a training-free detection method consistent across multiple models.

0 favorites 0 likes

#jailbreak-detection

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper proposes MLJailDe, a multilingual jailbreak detection framework that uses back-translation data augmentation and relative-distance constraints to improve cross-lingual generalization and robustness, achieving 98.5% F1 score across 11 languages.

0 favorites 0 likes

#jailbreak-detection

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.

0 favorites 0 likes

#jailbreak-detection

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

arXiv cs.LG ↗ · 2026-05-14 Cached

This paper investigates failures of final-token safety probes on jailbreak prompts, finding that harmful content can be distributed across earlier tokens and missed by the final readout. It proposes a PCA-HMM trajectory model as a diagnostic tool that recovers many misses without the false positives of naive token pooling.

0 favorites 0 likes

#jailbreak-detection

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv cs.CL ↗ · 2026-04-22 Cached

Empirical study shows multi-generation sampling significantly improves jailbreak detection in LLMs, revealing hidden harmful outputs that single-generation audits miss.

0 favorites 0 likes

#jailbreak-detection

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

arXiv cs.CL ↗ · 2026-04-20 Cached

TRIDENT is a novel framework and dataset synthesis pipeline for enhancing LLM safety through tri-dimensional red-teaming data that covers lexical diversity, malicious intent, and jailbreak tactics. Fine-tuning Llama-3.1-8B on TRIDENT-Edge achieves 14.29% reduction in Harm Score and 20% decrease in Attack Success Rate compared to baseline models.

0 favorites 0 likes

jailbreak-detection

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Submit Feedback