When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Summary
This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.
View Cached Full Text
Cached at: 05/08/26, 10:54 AM
Paper page - When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Source: https://huggingface.co/papers/2605.06652
Abstract
Comparative safety scoring without labeled benchmarks relies on scenario-based audits with validity chains measuring responsiveness, variance dominance, and stability to establish deployment evidence.
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting asbenchmarkless comparative safety scoringand specify the contract under which ascenario-based auditcan be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, andrerun budget. Because no labels are available, we replace ground-truth agreement with aninstrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance oftarget-driven varianceover auditor andjudge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, alocal-first scoring instrument, and validate it on a Norwegiansafety pack. Safe and abliterated targets separate withAUROCvalues between 0.89 and 1.00, target identity is the dominant variance component (η^2 approx 0.52), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
View arXiv pageView PDFGitHub14Add to collection
Get this paper in your agent:
hf papers read 2605\.06652
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06652 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06652 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06652 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Gate AI: LLM Security Benchmark Evaluation Methodology and Results
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
This paper introduces AI-MASLD, a stress-audit framework for medical LLMs that reveals how benchmark accuracy can hide serious safety failures, and demonstrates that open-weight models can match or exceed proprietary ones on safety dimensions.
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
The paper introduces IndustryBench, a benchmark evaluating LLMs on industrial procurement QA in Chinese against national standards, highlighting safety compliance gaps. It reveals that extended reasoning often lowers safety-adjusted scores and reshuffles model rankings when safety violations are considered.