When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

Summary

This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component (η^2 approx 0.52), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.

Original Article

View Cached Full Text

Cached at: 05/08/26, 10:54 AM

Paper page - When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Source: https://huggingface.co/papers/2605.06652

Abstract

Comparative safety scoring without labeled benchmarks relies on scenario-based audits with validity chains measuring responsiveness, variance dominance, and stability to establish deployment evidence.

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting asbenchmarkless comparative safety scoringand specify the contract under which ascenario-based auditcan be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, andrerun budget. Because no labels are available, we replace ground-truth agreement with aninstrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance oftarget-driven varianceover auditor andjudge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, alocal-first scoring instrument, and validate it on a Norwegiansafety pack. Safe and abliterated targets separate withAUROCvalues between 0.89 and 1.00, target identity is the dominant variance component (η^2 approx 0.52), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.

View arXiv page View PDF GitHub14 Add to collection

Get this paper in your agent:

hf papers read 2605\.06652

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06652 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06652 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06652 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Paper page - When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Submit Feedback

Similar Articles

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs