When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
Summary
This paper introduces the concept of the audit gap between behavioral safety and representation-level robustness in LLMs, proposing an intervention-based evaluation framework and the Latent Vulnerability Score (LVS) to measure hidden vulnerabilities.
View Cached Full Text
Cached at: 06/10/26, 09:46 PM
Paper page - When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
Source: https://huggingface.co/papers/2606.08044
Abstract
Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing.
Large Language Model(LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as theaudit gap: the difference betweenbehavioral safetyand robustness under intervention. To study this gap, we constructdissociated modelsthat preserve safe outward behavior while remaining vulnerable in thelatent space. We introduce an intervention-based evaluation framework to test model robustness throughsoft interventionsin parameter andlatent spaces, includingharmful fine-tuningandlayer-wise latent perturbations. To formalize the evaluation, we propose theLatent Vulnerability Score(LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show thatbehavioral safetymetrics are insufficient measures ofrepresentation-level robustnessacross multiple safely and unsafely aligned state-of-the-art models. Notably,dissociated modelsshow substantially elevatedLVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest thatbehavioral safetyevaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.08044
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.08044 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.08044 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.08044 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
This paper introduces Agent-BOM, a unified graph representation for security auditing in LLM-based agentic systems. It addresses the semantic gap in post-hoc auditing by modeling static capabilities and dynamic runtime states to detect complex attack chains like memory poisoning and tool misuse.
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
This paper investigates how incorporating web retrieval into LLM agents can degrade safety alignment, revealing the 'Safe Source Paradox' where even safety-oriented documents increase harmful compliance. It introduces the AgentREVEAL diagnostic framework and HarmURLBench benchmark to analyze and evaluate retrieval-induced safety vulnerabilities.
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
PromptAudit is a controlled evaluation framework that isolates the effects of prompt formulations on LLM-based vulnerability detection, finding that chain-of-thought prompting achieves the best overall performance while prompt sensitivity must be treated as a first-class system property.