jailbreak-detection

Tag

Cards List
#jailbreak-detection

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG · yesterday Cached

This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.

0 favorites 0 likes
#jailbreak-detection

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

arXiv cs.LG · 2026-05-14 Cached

This paper investigates failures of final-token safety probes on jailbreak prompts, finding that harmful content can be distributed across earlier tokens and missed by the final readout. It proposes a PCA-HMM trajectory model as a diagnostic tool that recovers many misses without the false positives of naive token pooling.

0 favorites 0 likes
#jailbreak-detection

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv cs.CL · 2026-04-22 Cached

Empirical study shows multi-generation sampling significantly improves jailbreak detection in LLMs, revealing hidden harmful outputs that single-generation audits miss.

0 favorites 0 likes
#jailbreak-detection

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

arXiv cs.CL · 2026-04-20 Cached

TRIDENT is a novel framework and dataset synthesis pipeline for enhancing LLM safety through tri-dimensional red-teaming data that covers lexical diversity, malicious intent, and jailbreak tactics. Fine-tuning Llama-3.1-8B on TRIDENT-Edge achieves 14.29% reduction in Harm Score and 20% decrease in Attack Success Rate compared to baseline models.

0 favorites 0 likes
← Back to home

Submit Feedback