llm-safety

Tag

Cards List
#llm-safety

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

arXiv cs.CL · yesterday Cached

This paper introduces a paired-prompt protocol to measure 'evaluation-context divergence' in open-weight LLMs, finding that models behave differently depending on whether prompts are framed as evaluations or live deployments. The study highlights heterogeneity across models, with some being 'eval-cautious' and others 'deployment-cautious', raising concerns about the validity of safety benchmarks.

0 favorites 0 likes
#llm-safety

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

arXiv cs.CL · yesterday Cached

XL-SafetyBench is a benchmark of 5,500 test cases across 10 country-language pairs to evaluate LLM safety and cultural sensitivity, distinguishing jailbreak robustness from cultural awareness.

0 favorites 0 likes
#llm-safety

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

arXiv cs.CL · yesterday Cached

Presents TurnGate, a turn-level monitor that detects hidden malicious intent in multi-turn dialogues by identifying the earliest turn where a response would enable harmful action, along with the Multi-Turn Intent Dataset (MTID) to support training and evaluation.

0 favorites 0 likes
#llm-safety

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Hugging Face Daily Papers · 2d ago Cached

This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.

0 favorites 0 likes
#llm-safety

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv cs.CL · 2026-04-22 Cached

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.

0 favorites 0 likes
#llm-safety

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv cs.CL · 2026-04-22 Cached

Empirical study shows multi-generation sampling significantly improves jailbreak detection in LLMs, revealing hidden harmful outputs that single-generation audits miss.

0 favorites 0 likes
#llm-safety

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

arXiv cs.CL · 2026-04-21 Cached

Researchers identify a systematic safety failure in LLMs where reformulating harmful requests as forced-choice multiple-choice questions (MCQs) bypasses refusal behavior, even in models that reject equivalent open-ended prompts. Evaluated across 14 proprietary and open-source models, the study reveals current safety benchmarks substantially underestimate risks in structured decision-making settings.

0 favorites 0 likes
#llm-safety

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL · 2026-04-21 Cached

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.

0 favorites 0 likes
#llm-safety

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL · 2026-04-21 Cached

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

0 favorites 0 likes
#llm-safety

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

arXiv cs.CL · 2026-04-20 Cached

RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.

0 favorites 0 likes
#llm-safety

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

arXiv cs.CL · 2026-04-20 Cached

TRIDENT is a novel framework and dataset synthesis pipeline for enhancing LLM safety through tri-dimensional red-teaming data that covers lexical diversity, malicious intent, and jailbreak tactics. Fine-tuning Llama-3.1-8B on TRIDENT-Edge achieves 14.29% reduction in Harm Score and 20% decrease in Attack Success Rate compared to baseline models.

0 favorites 0 likes
#llm-safety

A Case Study on the Impact of Anonymization Along the RAG Pipeline

arXiv cs.CL · 2026-04-20 Cached

This case study empirically investigates where anonymization should be applied in Retrieval-Augmented Generation (RAG) pipelines to balance privacy and utility, examining the impact of anonymization at different stages (dataset vs. generated answer) to inform privacy risk mitigation strategies.

0 favorites 0 likes
#llm-safety

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL · 2026-04-20 Cached

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

0 favorites 0 likes
#llm-safety

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

arXiv cs.CL · 2026-04-20 Cached

FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.

0 favorites 0 likes
#llm-safety

A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

arXiv cs.CL · 2026-04-20 Cached

A systematic study evaluating training-free methods for improving trustworthiness in large language models, categorizing approaches into input, internal, and output-level interventions while analyzing trade-offs between trustworthiness, utility, and robustness.

0 favorites 0 likes
#llm-safety

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes
#llm-safety

Improving instruction hierarchy in frontier LLMs

OpenAI Blog · 2026-03-10 Cached

OpenAI presents a training approach using instruction-hierarchy tasks to improve LLM safety and reliability by teaching models to properly prioritize instructions based on trust levels (system > developer > user > tool). The method addresses prompt-injection attacks and safety steerability through reinforcement learning with a new dataset called IH-Challenge.

0 favorites 0 likes
#llm-safety

Estimating worst case frontier risks of open weight LLMs

OpenAI Blog · 2025-08-05 Cached

OpenAI researchers study worst-case frontier risks of releasing open-weight LLMs through malicious fine-tuning (MFT) in biology and cybersecurity domains, finding that open-weight models underperform frontier closed-weight models and don't substantially advance harmful capabilities.

0 favorites 0 likes
#llm-safety

Google DeepMind at NeurIPS 2024

Google DeepMind Blog · 2024-12-05 Cached

Google DeepMind announces their presence at NeurIPS 2024 with over 100 papers covering adaptive AI agents, 3D scene creation, and LLM training safety, including Test of Time awards for influential foundational work and live demonstrations of Gemma Scope and other applications.

0 favorites 0 likes
#llm-safety

A hazard analysis framework for code synthesis large language models

OpenAI Blog · 2022-07-25 Cached

OpenAI presents a hazard analysis framework for evaluating safety risks associated with code synthesis LLMs like Codex, examining technical, social, political, and economic impacts through a novel evaluation methodology for code generation capabilities.

0 favorites 0 likes
← Back to home

Submit Feedback