llm-safety

#llm-safety

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv cs.CL ↗ · yesterday Cached

This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.

0 favorites 0 likes

#llm-safety

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

arXiv cs.CL ↗ · yesterday Cached

XL-SafetyBench is a benchmark of 5,500 test cases across 10 country-language pairs to evaluate LLM safety and cultural sensitivity, distinguishing jailbreak robustness from cultural awareness.

0 favorites 0 likes

#llm-safety

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

arXiv cs.CL ↗ · yesterday Cached

Presents TurnGate, a turn-level monitor that detects hidden malicious intent in multi-turn dialogues by identifying the earliest turn where a response would enable harmful action, along with the Multi-Turn Intent Dataset (MTID) to support training and evaluation.

0 favorites 0 likes

#llm-safety

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Hugging Face Daily Papers ↗ · 2d ago Cached

This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.

0 favorites 0 likes

#llm-safety

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv cs.CL ↗ · 2026-04-22 Cached

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.

0 favorites 0 likes

#llm-safety

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv cs.CL ↗ · 2026-04-22 Cached

Empirical study shows multi-generation sampling significantly improves jailbreak detection in LLMs, revealing hidden harmful outputs that single-generation audits miss.

0 favorites 0 likes

#llm-safety

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers identify a systematic safety failure in LLMs where reformulating harmful requests as forced-choice multiple-choice questions (MCQs) bypasses refusal behavior, even in models that reject equivalent open-ended prompts. Evaluated across 14 proprietary and open-source models, the study reveals current safety benchmarks substantially underestimate risks in structured decision-making settings.

0 favorites 0 likes

#llm-safety

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL ↗ · 2026-04-21 Cached

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.

0 favorites 0 likes

#llm-safety

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

0 favorites 0 likes

#llm-safety