I-SAFE:用于科学AI模型结构审计的Wasserstein一致性度量

arXiv cs.LG 论文

摘要

本文介绍了I-SAFE,一个基于Wasserstein一致性度量的科学AI模型事后分布审计框架,它揭示了基于准确率的评估无法捕获的模型输出中的结构差异。在药物-靶点相互作用预测任务上进行了演示,该框架是模型无关的,适用于任何具有结构化输入和外部先验知识的领域。

arXiv:2605.21731v1 公告类型:新 摘要:深度学习模型越来越多地用于科学预测任务,在这些任务中,强大的基准性能通常被解释为具有科学意义行为的证据。这种解释是脆弱的,因为模型可能利用捷径特征、数据集特定规律或分布偏差,这些虽然在保留数据上具有预测能力,但与领域相关结构不一致。为了解决这一局限,我们引入了\textsc{I-SAFE}(干涉安全、准确、公平和可解释)框架,这是一个以Wasserstein一致性度量(WCM)为中心的科学AI模型事后分布审计框架。给定一个训练好的黑盒预测器和一个编码输入结构领域知识的外部结构先验,\textsc{I-SAFE}在结构引导的输入扰动下评估原始模型输出。所提出的审计通过三个互补度量来衡量输出分布的一致性:基于分位数的度量(QBM)用于位置级一致性,WCM用于序数一致性,以及一个平移不变的WCM变体用于形状一致性。我们在Davis激酶基准、KLIFS(激酶-配体相互作用指纹和结构)结合口袋注释以及三个基于序列的DTI模型(DeepConvDTI、DeepDTA和TAPB)上实例化了\textsc{I-SAFE}进行药物-靶点相互作用(DTI)预测。尽管这些模型在可比较的预测范围内运行,但\textsc{I-SAFE}揭示了显著不同的分布响应轮廓,这是基于准确率的评估无法看到的区别。该框架是模型无关的,适用于任何输入可进行结构化分解且存在外部先验的领域。
查看原文
查看缓存全文

缓存时间: 2026/05/22 08:51

# I-SAFE:用于科学AI模型结构审计的Wasserstein一致性度量
来源:https://arxiv.org/html/2605.21731
Barbara Tarantino 帕维亚大学经济系,Via S. Felice Al Monastero, 5, 帕维亚, 27100, 意大利,邮箱:[email protected]
Gennaro Auricchio 帕多瓦大学数学系,Via Trieste, 63, 帕多瓦, 35131, 意大利,邮箱:[email protected]
Paolo Giudici 帕维亚大学经济系,Via S. Felice Al Monastero, 5, 帕维亚, 27100, 意大利,邮箱:[email protected]

###### 摘要

深度学习模型越来越多地应用于科学预测任务,其中强大的基准性能常被解读为模型具有科学意义行为的证据。这种解读是脆弱的,因为模型可能利用捷径特征、数据集特定规律或分布偏差,这些因素在留出数据上具有预测性,但与领域相关结构并不对齐。为解决这一局限性,我们提出了I-SAFE(干预式安全、准确、公平和可解释)框架,这是一个以Wasserstein一致性度量(WCM)为核心的、针对科学AI模型的事后分布审计框架。给定一个训练好的黑箱预测器和一个编码了任务相关输入结构领域知识的外部结构先验,I-SAFE在输入的结构引导扰动下评估原始模型输出。所提出的审计通过三个互补度量来测量输出分布的一致性:用于位置级一致性的基于分位数度量(QBM)、用于序数一致性的WCM,以及用于形状一致性的平移不变WCM变体。我们基于Davis激酶基准、KLIFS(激酶-配体相互作用指纹与结构)结合口袋注释以及三个基于序列的DTI模型(DeepConvDTI、DeepDTA和TAPB)实例化了I-SAFE在药物-靶标相互作用(DTI)预测上的应用。尽管这些模型具有可比较的预测性能,I-SAFE揭示了它们截然不同的分布响应特征,这是一个基于准确性的评估无法区分的区别。该框架是模型无关的,适用于任何输入允许结构分解且外部先验可用的领域。

## 1 引言

留出预测性能仍然是评估科学机器学习中深度学习模型的主要标准,包括分子性质预测、药物-靶标相互作用(DTI)和基因扰动响应[37 (https://arxiv.org/html/2605.21731#bib.bib37),14 (https://arxiv.org/html/2605.21731#bib.bib14),21 (https://arxiv.org/html/2605.21731#bib.bib21)]。在这些场景中,强大的基准性能常被解读为模型具有科学意义行为的证据。这种解读是脆弱的,因为预测准确度度量会受到基准分布中存在的统计规律的影响,而不仅仅取决于模型是否正确识别了输入的相关结构[25 (https://arxiv.org/html/2605.21731#bib.bib25)]。这一差距是现代机器学习的一个核心问题,高容量模型可以通过利用不一定与科学关注的结构[12 (https://arxiv.org/html/2605.21731#bib.bib12),15 (https://arxiv.org/html/2605.21731#bib.bib15)]对齐的捷径来实现强大性能。

在科学应用中,这一局限性既是经验性的,也是概念性的。例如,高性能的蛋白质-配体亲和模型已被证明在很大程度上依赖于配体记忆而不是相互作用特定信息[23 (https://arxiv.org/html/2605.21731#bib.bib23)]。具体来说,在DTI中,标准基准上的表面进步可能受到靶标偏差、脚手架效应、泄漏以及其他形式的分布偏差[20 (https://arxiv.org/html/2605.21731#bib.bib20),35 (https://arxiv.org/html/2605.21731#bib.bib35),13 (https://arxiv.org/html/2605.21731#bib.bib13)]的影响,这导致模型在结构相似但功能行为显著不同的化合物上失败[33 (https://arxiv.org/html/2605.21731#bib.bib33)]。

这些发现揭示了模型预测成功与其捕捉问题结构性质能力之间的不匹配。标准基准评估可以确定一个模型在给定分布上预测得很好,但无法描述当科学相关的输入结构受到扰动时,模型输出如何重新组织。

解决这一问题的一个原则性方法是从观察性评估转向干预性评估,即问题不仅仅是模型是否预测正确,而是其预测在输入的受控扰动下如何变化。从因果角度来看,机制性声明涉及对干预的响应,而不是在固定分布下观察到的关联[25 (https://arxiv.org/html/2605.21731#bib.bib25),26 (https://arxiv.org/html/2605.21731#bib.bib26)]。这一观点启发了多项事后工作,包括内部表示的因果抽象[10 (https://arxiv.org/html/2605.21731#bib.bib10),11 (https://arxiv.org/html/2605.21731#bib.bib11)]、用于解释忠实性的模型随机化测试[1 (https://arxiv.org/html/2605.21731#bib.bib1)],以及语言模型中推理行为的扰动探测[38 (https://arxiv.org/html/2605.21731#bib.bib38),40 (https://arxiv.org/html/2605.21731#bib.bib40)]。在科学预测中,领域知识提供了输入的结#  I-SAFE:Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models
Source: https://arxiv.org/html/2605.21731
Barbara TarantinoDepartment of Economics, University of Pavia, Via S. Felice Al Monastero, 5, Pavia, 27100, Italy, email: [email protected] AuricchioDepartment of Mathematics, University of Padua, Via Trieste, 63, Padua, 35131, Italy, email: [email protected] GiudiciDepartment of Economics, University of Pavia, Via S. Felice Al Monastero, 5, Pavia, 27100, Italy, email: [email protected]

######  Abstract

Deep learning models are increasingly used in scientific prediction tasks where strong benchmark performance is often interpreted as evidence of scientifically meaningful behavior. This interpretation is fragile, as models may exploit shortcut features, dataset-specific regularities, or distributional biases that are predictive on held-out data but not aligned with domain-relevant structure. To address this limitation, we introduce theI-SAFE(Interventional Secure, Accurate, Fair and Explainable) framework, a post-hoc distributional auditing framework for scientific AI models centered on the Wasserstein Coherence Metric (WCM). Given a trained black-box predictor and an external structural prior encoding domain knowledge about task-relevant input structure,I-SAFEevaluates raw model outputs under structurally guided perturbations of the input. The proposed audit measures output-distribution coherence through three complementary metrics: a Quantile-Based Metric (QBM) for location-level coherence, the WCM for ordinal coherence, and a translation-invariant WCM variant for shape coherence. We instantiateI-SAFEon drug–target interaction (DTI) prediction using the Davis kinase benchmark, KLIFS (Kinase–Ligand Interaction Fingerprints and Structures) binding-pocket annotations, and three sequence-based DTI models: DeepConvDTI, DeepDTA, and TAPB. Although the models operate in a comparable predictive regime,I-SAFEreveals substantially different distributional response profiles, a distinction invisible to accuracy-based evaluation. The framework is model-agnostic and applicable to any domain where inputs admit a structured decomposition and an external prior is available.

## 1Introduction

Held-out predictive performance remains the dominant criterion for evaluating deep learning models in scientific machine learning, including molecular property prediction, drug–target interaction (DTI), and gene perturbation response[37 (https://arxiv.org/html/2605.21731#bib.bib37),14 (https://arxiv.org/html/2605.21731#bib.bib14),21 (https://arxiv.org/html/2605.21731#bib.bib21)]. In these settings, strong benchmark performance is often interpreted as evidence of scientifically meaningful behavior. This interpretation is fragile, as predictive accuracy measures are affected by statistical regularities that are present in the benchmark distribution, and not only on whether the model rightly identified the relevant structures of the input[25 (https://arxiv.org/html/2605.21731#bib.bib25)]. This gap is central to modern machine learning, where high-capacity models can achieve strong performance by exploiting shortcuts that do not necessarily align with the scientific structure of interest[12 (https://arxiv.org/html/2605.21731#bib.bib12),15 (https://arxiv.org/html/2605.21731#bib.bib15)].

In scientific applications, this limitation is empirical as well as conceptual. For example, high-performing protein–ligand affinity models have been shown to rely substantially on ligand memorisation rather than interaction-specific information[23 (https://arxiv.org/html/2605.21731#bib.bib23)]. Specifically, in DTI, apparent progress on standard benchmarks can be influenced by target bias, scaffold effects, leakage, and other forms of distributional bias[20 (https://arxiv.org/html/2605.21731#bib.bib20),35 (https://arxiv.org/html/2605.21731#bib.bib35),13 (https://arxiv.org/html/2605.21731#bib.bib13)], which leads the model to fail on structurally similar compounds with markedly different functional behaviour[33 (https://arxiv.org/html/2605.21731#bib.bib33)].

These findings expose a mismatch between the predictive success of a model and its ability to capture the structural properties of the problem. Standard benchmark evaluation can establish that a model predicts well on a given distribution, but cannot characterize how model outputs reorganize when scientifically relevant input structure is perturbed.

A principled response to this problem is to move from observational to interventional evaluation, where the question is not only whether a model predicts correctly, but how its predictions change under controlled perturbations of the input. In causal terms, mechanistic claims concern responses to interventions rather than associations observed under a fixed distribution[25 (https://arxiv.org/html/2605.21731#bib.bib25),26 (https://arxiv.org/html/2605.21731#bib.bib26)]. This perspective has informed several post-hoc lines of work, including causal abstraction of internal representations[10 (https://arxiv.org/html/2605.21731#bib.bib10),11 (https://arxiv.org/html/2605.21731#bib.bib11)], model-randomisation tests for explanation faithfulness[1 (https://arxiv.org/html/2605.21731#bib.bib1)], and perturbation probes of reasoning behaviour in language models[38 (https://arxiv.org/html/2605.21731#bib.bib38),40 (https://arxiv.org/html/2605.21731#bib.bib40)]. In scientific prediction, domain knowledge provides a structural prior over inputs, enabling comparison of model responses to perturbations on prior-selected versus non-prior selected components. This yields a post-hoc, prior-relative audit of whether the model response is organized with respect to meaningful input structure, as formalized in[32 (https://arxiv.org/html/2605.21731#bib.bib32)].

#### Our Contribution.

In this paper, we introduceI-SAFE(Interventional Secure, Accurate, Fair and Explainable), a post-hoc distributional auditing framework for trained black-box scientific predictors. While existing intervention-based audits rely on scalar summaries, theI-SAFEframework evaluates the full output distribution induced by the model, capturing how output ranks reorder under prior-guided perturbations of selected input components. Our main theoretical contribution thus consists in defining a set of metrics that capture different levels of ranking coherence: the Quantile-Based Metric for location-level coherence, the Wasserstein Coherence Metric for ordinal coherence, and a translation-invariant Wasserstein Metric for distributional shape coherence. For each metric, the prior-relative contrast compares the ranking-coherence under outside-prior controls and prior-selected perturbations. Positive contrasts indicate that perturbations of prior-selected components induce more coherent output responses than their controls outside the prior.

We then instantiateI-SAFEon the Davis kinase benchmark[8 (https://arxiv.org/html/2605.21731#bib.bib8)], auditing three sequence-based DTI models, DeepDTA[24 (https://arxiv.org/html/2605.21731#bib.bib24)], DeepConvDTI[18 (https://arxiv.org/html/2605.21731#bib.bib18)], and TAPB[20 (https://arxiv.org/html/2605.21731#bib.bib20)], using Kinase–Ligand Interaction Fingerprints and Structures (KLIFS) binding-pocket annotations as an external structural prior[17 (https://arxiv.org/html/2605.21731#bib.bib17),16 (https://arxiv.org/html/2605.21731#bib.bib16)]. While the audited models operate in a comparable predictive regime, we show that their interventional behaviour differs substantially. In particular, TAPB is the only model for which KLIFS-aligned pocket perturbations induce significantly more coherent quantile-level and ordinal responses than non-pocket controls. Our results show that predictive performance and distributional coherence capture distinct aspects of scientific model behaviour, indicating thatI-SAFEprovides a statistically grounded, prior-relative test of how trained scientific predictors operate under structurally meaningful perturbations. We stress that the audit does not establish causal validity of the model, nor does it require access to the model itself, it rather provides reusable, model-agnostic metrics for evaluating black-box model responses under structurally guided perturbations, applicable to any setting where inputs admit a structured decomposition and a domain prior is available.

## 2Related Work

#### From explanation to interventional model analysis.

Attribution and explanation methods provide tools for inspecting trained models. Saliency maps, local surrogate models, Shapley-based explanations, and integrated gradients[28 (https://arxiv.org/html/2605.21731#bib.bib28),22 (https://arxiv.org/html/2605.21731#bib.bib22),31 (https://arxiv.org/html/2605.21731#bib.bib31),29 (https://arxiv.org/html/2605.21731#bib.bib29)]identify input features that influence predictions, supporting transparency and debugging[9 (https://arxiv.org/html/2605.21731#bib.bib9)]. However, such explanations are mostly associational: they indicate the influential features under the observed input distribution, without establishing whether predictions depend on scientifically relevant structure. Their fragility has been documented empirically: Adebayo et al.[1 (https://arxiv.org/html/2605.21731#bib.bib1)]show that saliency methods can be invariant to model parameter randomisation, calling into question their mechanistic faithfulness. Intervention-based approaches address this limitation by asking how the model changes under controlled modifications. Causal abstraction and related methods[10 (https://arxiv.org/html/2605.21731#bib.bib10),11 (https://arxiv.org/html/2605.21731#bib.bib11)]compare neural representations with interpretable causal variables through interchange interventions on internal activations. These methods provide a formal route to mechanistic comparison, but usually require access to internal representations and a target causal structure. Input-level perturbation methods avoid this requirement by probing the input–output matching induced by the model and have been applied to language models[38 (https://arxiv.org/html/2605.21731#bib.bib38),40 (https://arxiv.org/html/2605.21731#bib.bib40)]. In scientific prediction, post-hoc structural auditing under an external prior has been proposed to contrast mechanistic and spurious perturbations through scalar summaries[32 (https://arxiv.org/html/2605.21731#bib.bib32)].I-SAFEretains the post-hoc black-box setting, but shifts the object of comparison from scalar aggregated values to the distributional changes in the model output.

#### Shortcut learning and structural robustness.

The need for auditing frameworks is reinforced by extensive evidence that predictive success can be driven by shortcuts or unstable correlations rather than task-relevant structures[12 (https://arxiv.org/html/2605.21731#bib.bib12),15 (https://arxiv.org/html/2605.21731#bib.bib15)]. In molecular prediction and DTI, this issue appears as ligand memorisation[23 (https://arxiv.org/html/2605.21731#bib.bib23)], target prior bias[20 (https://arxiv.org/html/2605.21731#bib.bib20)], benchmark artifacts related to leakage and split design[35 (https://arxiv.org/html/2605.21731#bib.bib35),13 (https://arxiv.org/html/2605.21731#bib.bib13)], and failures on activity cliffs, where structurally similar compounds have different functional effects[33 (https://arxiv.org/html/2605.21731#bib.bib33)]. These findings motivate the search for predictors relying on stable relations. Invariant causal prediction[26 (https://arxiv.org/html/2605.21731#bib.bib26)]and Invariant Risk Minimization[3 (https://arxiv.org/html/2605.21731#bib.bib3)], formalize stability across environments as a route toward more robust prediction. The objective ofI-SAFEis complementary. Rather than modifying the learning procedure or requiring environment annotations, it asks whether a trained model responds coherently when input components identified by external scientific knowledge are perturbed. This distinction is important in scientific AI, where models are used as black-box predictors and retraining is impractical, unavailable, or insufficient to diagnose the predictor. A parallel limitation affects intervention-based evaluation itself, where the richness of the auditing framework is bounded by the statistical resolution at which model responses are characterized.

#### Beyond aggregate performance.

Single aggregate metrics are often too coarse to characterize behaviour relevant for downstream use. In language model evaluation, DecodingTrust[36 (https://arxiv.org/html/2605.21731#bib.bib36)], HELM[19 (https://arxiv.org/html/2605.21731#bib.bib19)], and BIG-bench[30 (https://arxiv.org/html/2605.21731#bib.bib30)]address this issue by organizing assessment across multiple capabilities, risks, and scenarios. Likewise, in scientific model evaluation, held-out performance establishes predictive adequacy on a benchmark distribution, but not how predictions reorganize under perturbations of scientifically meaningful input structure. A similar limitation arises within intervention-based evaluation itself. Model responses are often reduced to average effects, scalar sensitivity scores, or a small set of moments[10 (https://arxiv.org/html/2605.21731#bib.bib10),40 (https://arxiv.org/html/2605.21731#bib.bib40),32 (https://arxiv.org/html/2605.21731#bib.bib32)]. These summaries detect location-level contrasts, but they might miss whether output distributions shift coherently, preserve ordinal structure, or change shape. A standard approach to overcome single aggregate metrics relies on using two-sample distributional comparison[7 (https://arxiv.org/html/2605.21731#bib.bib7),2 (https://arxiv.org/html/2605.21731#bib.bib2)]and optimal transport theory[34 (https://arxiv.org/html/2605.21731#bib.bib34),27 (https://arxiv.org/html/2605.21731#bib.bib27)], as these metrics capture the geometry of the underlying space.I-SAFEbrings this perspective to structural auditing by decomposing the distributional response to perturbations into three complementary axes: location, ordinal structure, and shape. Each axis captures aspects of output reorganization invisible to scalar summaries.

## 3The I-SAFE Framework

In this section, we formalize theI-SAFEframework as a post-hoc auditing procedure for fixed black-box predictors.I-SAFEleverages prior-guided perturbations to induce paired profiles of raw model outputs and evaluates the coherence of their distributional reorganization after the intervention. We consider a predictor f_BB: X → R accessed only through its input–output map, where X = ∏_{m=1}^M X_m admits a decomposition in M identifiable components. For an input x = (x^{(1)},...,x^{(M)}), the components of x define the units on which interventions act. Throughout the audit, f_BB(x) denotes the raw model output on input x ∈ X, before thresholding, calibration, or downstream decision rules.

### 3.1 Problem formulation

Given a black-box predictor f_BB, the audit is performed on A = {x_i}_{i=1}^N ⊂ X which is disjoint from the data used to train f_BB. The set A contains the elements of X on which all perturbations are applied and all output distributions are compared. To determine how to perform a perturbation, we have access to a structural prior over X, derived from domain knowledge that does not depend on f_BB.

相似文章

具有随时有效保证的 AI 系统自适应审计

arXiv cs.AI

本文引入了一种统计框架,利用安全随时有效推断(SAVI)技术对 AI 系统进行自适应审计,旨在基于有限数据得出严谨的结论。文章提出了一种“通过赌博进行测试”的方法,以验证模型的鲁棒性,同时在自适应采样过程中控制第一类错误。

当无基准存在时:验证无真实标签的LLM安全评分比较

Hugging Face Daily Papers

本文介绍了一个框架,用于在没有真实标签的情况下验证LLM安全评分比较,通过使用'工具有效性链'来建立部署证据。该方法通过一个名为SimpleAudit的本地优先工具在挪威安全包上进行了演示,并比较了Borealis和Gemma 3等模型。

MedSkillAudit:医学研究智能体技能领域专用审计框架

Hugging Face Daily Papers

本文介绍了MedSkillAudit,这是一个领域专用审计框架,用于在部署前评估医学研究智能体技能的安全性和质量。研究表明,该系统在评估一致性方面达到可靠水平,优于或相当于人类专家审查。

具体的AI安全问题

OpenAI Blog

OpenAI、伯克利和斯坦福的研究人员共同撰写了一篇基础性论文,指出了现代AI系统中的五个具体安全问题:安全探索、对分布偏移的鲁棒性、避免负面副作用、防止奖励黑客和可扩展的监督。