Adaptive auditing of AI systems with anytime-valid guarantees
Summary
This paper introduces a statistical framework for adaptively auditing AI systems using Safe Anytime-Valid Inference (SAVI) to draw rigorous conclusions with limited data. It proposes a 'testing by betting' approach to validate model robustness while controlling type-I errors during adaptive sampling.
View Cached Full Text
Cached at: 05/11/26, 07:10 AM
# Adaptive auditing of AI systems with anytime-valid guarantees
Source: [https://arxiv.org/html/2605.07002](https://arxiv.org/html/2605.07002)
Siyu Zhou1∗Patrick Vossler1∗Venkatesh Sivaraman1Yifan Mai2Jean Feng1 1University of California, San Francisco2Stanford University
###### Abstract
A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation\. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results\. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited \(often 10 to 50 cases\) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre\-specified rule\. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two “dueling” perspectives: \(i\) the model’s null that asserts there is no failure mode with performance below a target threshold versus \(ii\) the auditor’s null that asserts they have a sampling strategy that will uncover a failure mode\. Leveraging Safe Anytime\-Valid Inference \(SAVI\), we formalize the auditor as conducting “testing by betting,” which translates into simultaneous e\-processes for testing the dueling null hypotheses\. Furthermore,ifthe auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust\. Empirically, we demonstrate that our proposed testing procedures maintain anytime\-valid type\-I error control, outperform pre\-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations\.
## 1Introduction
AI systems are known to exhibit various generalization failures across subgroups and input types, often described as the “jagged frontier\.” Because gold\-standard evaluations for AI \(agent\) outputs can often be time\- and/or cost\-consuming\(Bandelet al\.,[2026](https://arxiv.org/html/2605.07002#bib.bib80)\), a common practice is to construct small, adaptively\-selected test suites to conduct more targeted probing of potential failure modes\(Husain and Shankar,[2026](https://arxiv.org/html/2605.07002#bib.bib46); Yan and Zhang,[2022](https://arxiv.org/html/2605.07002#bib.bib54)\)\. The practical appeal of adaptive testing is that one can decide in real\-time which samples to annotate and how many based on results from past annotations; approaches include checklist\-style behavioral tests that probe specific linguistic capabilities\(Ribeiroet al\.,[2020](https://arxiv.org/html/2605.07002#bib.bib66)\); human\-in\-the\-loop systems where the practitioner interatively refines test cases alongside model\-generated candidates\(Ribeiro and Lundberg,[2022](https://arxiv.org/html/2605.07002#bib.bib41); Gaoet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib64)\); visualization interfaces to audit algorithmic fairness\(Cabreraet al\.,[2019](https://arxiv.org/html/2605.07002#bib.bib18)\); and adversarial generation, where practitioners try to elicit worst\-case behaviors via red\-teaming\(Perezet al\.,[2022](https://arxiv.org/html/2605.07002#bib.bib67); Ganguliet al\.,[2022](https://arxiv.org/html/2605.07002#bib.bib68); Leeet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib34); Liet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib65)\)\. This practice even extends beyond AI, such as stress\-testing devices across worst\-case scenarios\(U\.S\. Food and Drug Administration,[2020](https://arxiv.org/html/2605.07002#bib.bib69)\)\.
Despite its widespread use, the flexibility of adaptive testing workflows raises a fundamental statistical question: can a small number of adaptively chosen observations support rigorous inference about system robustness? Standard statistical inference typically assume both a pre\-specified sampling scheme and sample sizes that are large enough for asymptotic theory to apply\. Adaptive testing violates both: AI practitioners often curate as few as 10 to 50 test cases\(Husain and Shankar,[2026](https://arxiv.org/html/2605.07002#bib.bib46); Yan and Zhang,[2022](https://arxiv.org/html/2605.07002#bib.bib54)\), with the sampling and stopping strategy decided in the midst of data collection rather than based on a pre\-specified rule\. Such testing regimes are traditionally disallowed as they can substantially inflate type I error; for instance, if one continuously peeks at the p\-value, the test is guaranteed to reject with probability one under the null\(Ramdas and Wang,[2025](https://arxiv.org/html/2605.07002#bib.bib35)\)\. Even more fundamentally, it is difficult to formulate what exactly is being tested; for instance, arbitrary sampling practices cannot be used to test a model’s average performance\.Thus, the goal of this work is to \(i\) understand what hypothesis tests can be assessed under highly flexible testing regimes and \(ii\) construct corresponding testing procedures that are universally valid with finite\-sample guarantees\.
Our proposal begins with formalizing AI robustness auditing as conducting two “dueling” hypothesis tests, one from the perspective of the “model” and one from the perspective of the “auditor” \(Figure[1](https://arxiv.org/html/2605.07002#S1.F1)\(a\)\)\. The model’s null hypothesis makes the omnibus assertion that no failure modes exist, in that every sufficiently large subgroup meets a minimum performance threshold; rejection of the model’s null means that the AI model is not robust\. The auditor’s null hypothesis asserts that they have a strategy that will eventually identify a failure mode; rejection of the auditor’s null implies passage of this specific audit\. While these two null hypotheses were specifically constructed to be testable in practice, they are generally not exact complements\. Thus, they cannot conduct the ideal audit that yields a binary conclusion of whether an AI system is or is not robust\. Butifone has a powerful auditor that is guaranteed to find a failure mode if one truly exists, we prove that the two null hypothesesarecomplements and thus correspond to the ideal audit\. Through this formalization, we clarify what can be tested in practice, what assumptions are additionally needed to conduct the ideal audit in theory, and how one should interpret and conduct audits\.
We then leverage Safe Anytime\-Valid Inference \(SAVI\) methods to frame the auditor as conducting a “Testing by Betting” procedure, in which the auditor is allowed to make adaptive bets against any sequence of subgroups that the auditor suspects could be failure modes and where under the null, there is no winning betting strategy \(Figure[1](https://arxiv.org/html/2605.07002#S1.F1)\(b, c\)\)\(Grünwaldet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib38); Ramdas and Wang,[2025](https://arxiv.org/html/2605.07002#bib.bib35)\)\. We investigate various mathematical translations of these bets into e\-processes—the SAVI analog of the p\-value in the sequential setting—which can be interpreted as the accumulation of wealth from the auditor’s bets and, crucially, provide valid finite\-sample Type\-I error control under arbitrary adaptive sampling and optional stopping\. We begin with the simplest option of the likelihood ratio, then define more adaptive formulations based on universal inference, and finally introduce changepoint\-based formulations\.
We make the following contributions:
1. 1\.We formalize AI auditing as conducting dueling hypothesis tests between the model’s null and the auditor’s null\. We prove these are asymptotically complements under sufficiently powerful auditing strategies\.
2. 2\.We introduce dual e\-processes that provide stopping bounds for simultaneous failure mode detection and audit certification, with SAVI guarantees\.
3. 3\.We provide empirical results demonstrating how adaptive testing paired with the e\-process procedures can indeed provide Type\-I error control and make rigorous conclusions much faster than pre\-specified auditing strategies, sometimes with as few as 20 observations\.
Figure 1:We \(a\) formalize flexible audits of failure modes in AI systems as dueling hypothesis tests, \(b\) introduce a “testing\-by\-betting” framework that is suitable for settings where sampling and stopping strategies are decided in the midst of data collection and not pre\-defined, and \(c\) represent adaptive bets as dual e\-processes with safe anytime\-valid guarantees\.
## 2Related Work
LLM Evaluation and Adaptive Testing\.Many free\-form LLM outputs are expensive and time\-consuming to evaluate, e\.g\., multiple subagent calls or need manual annotation by experts\. Consequently, beyond large\-scale benchmarking efforts of LLMs\(Lianget al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib31)\), a growing body of research focus on uncovering failure modes of LLMs through highly targeted adaptive or active testing, in which results obtained in previous rounds inform which observations to label next\(Ribeiro and Lundberg,[2022](https://arxiv.org/html/2605.07002#bib.bib41); Gaoet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib64); Liet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib65); Viveket al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib56)\)\. These adaptive testing methods often involve human\-in\-the\-loop, sometimes assisted by automatic systems; in the most extreme scenario, active testing is akin to red\-teaming, where one constructs worst\-case test scenarios\(Ganguliet al\.,[2022](https://arxiv.org/html/2605.07002#bib.bib68)\)\. Due to their resource/cost\-efficient nature, adaptive testing methods are also commonly used by AI practitioners to evaluate AI pipelines in target areas, where common recommendations are to assemble diverse test suites with 10–50 cases to cover edge cases and failure modes\(Yan,[2025](https://arxiv.org/html/2605.07002#bib.bib72); Husain and Shankar,[2026](https://arxiv.org/html/2605.07002#bib.bib46); Anthropic,[2026](https://arxiv.org/html/2605.07002#bib.bib73)\)\. However, there is currently no statistical framework that formalizes what exactly is and can be the target of \(statistical\) inference when one conducts highly adaptive and targeted testing with unknown sampling schemes, as this violates assumptions in classical active testing settings that require a pre\-specified sampling scheme\(Berradaet al\.,[2025](https://arxiv.org/html/2605.07002#bib.bib27); Kuanget al\.,[2025](https://arxiv.org/html/2605.07002#bib.bib26)\)\. This work provides such a framework by allowing for arbitrary sampling and stopping schemes\.
E\-values and E\-processes\.Because classical p\-value and hypothesis testing procedures place rigid constraints and assumptions on the data generating mechanism, recent works have proposed an alternative approach based on e\-values\. E\-values, whose properties are derived from \(conditional\) expectations rather than probability distributions, have the advantage of providing anytime\-valid testing with finite\-sample Type I error control under any unknown sampling schemes\(Grünwaldet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib38); Ramdas and Wang,[2025](https://arxiv.org/html/2605.07002#bib.bib35)\)\. Furthermore, e\-values have been extended to the online setting under optional stopping, with e\-processes for sequential testing and e\-detectors for sequential changepoint detection\(Shinet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib36)\)\. These frameworks can be viewed as testing by betting, in which the tester places a bet at each iteration and, under the null, the expected return is \(no greater than\) zero\(Shafer,[2021](https://arxiv.org/html/2605.07002#bib.bib37)\)\. While there is growing interest in using e\-values to improve generated outputs from AI systems\(Sadhukaet al\.,[2025](https://arxiv.org/html/2605.07002#bib.bib70)\), there are current no e\-value frameworks for highly\-targeted and adaptive audits of AI systems, as the exact hypothesis tests in these settings have yet to be formalized\.
Subgroup and Fairness Testing\.The problem of auditing failure modes of AI systems falls under the broader research topic of testing and identifying subgroups with large performance disparities\(Chunget al\.,[2019](https://arxiv.org/html/2605.07002#bib.bib20); Eyubogluet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib22); Fenget al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib28); Singhet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib32); Subbaswamyet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib39); Zenget al\.,[2026](https://arxiv.org/html/2605.07002#bib.bib40)\)and algorithmic fairness\(Chouldechova and Roth,[2020](https://arxiv.org/html/2605.07002#bib.bib17); Mitchellet al\.,[2021](https://arxiv.org/html/2605.07002#bib.bib11)\)\. Nevertheless, the existing literature either study the offline \(batch\) settings or active labelling with pre\-specified sampling or subgroups\(Yan and Zhang,[2022](https://arxiv.org/html/2605.07002#bib.bib54); Hartmannet al\.,[2026](https://arxiv.org/html/2605.07002#bib.bib55)\)\. Our framework differs in its generality by allowing for statistically rigorous inference under active testing where neither the subgroups nor sampling strategy are pre\-specified\.
## 3Methods
To formalize adaptive audits of AI system robustness, the following section addresses the two main technical challenges: \(i\) what null hypotheses can be feasibly tested and \(ii\) what testing procedures provide safe anytime\-valid inference? Due to space constraints, we refer the reader to the Appendix for implementation details and detailed proofs\.
Notation\.Let\(𝒳,ℱ,μ\)\(\\mathcal\{X\},\\mathcal\{F\},\\mu\)denote a probability space with input space𝒳\\mathcal\{X\},σ\\sigma\-algebraℱ\\mathcal\{F\}, and probability measureμ\\mu\. For queryX∈𝒳X\\in\\mathcal\{X\}, let random variableYYdenote the score for the AI system’s response \(e\.g\., correctness\), which we assume to be IID conditional onXX\. For a measurable subsetS∈ℱS\\in\\mathcal\{F\}with prevalenceμ\(S\)\>0\\mu\(S\)\>0, the AI system’s average score for subgroupSSis denotedV\(S\)≔𝔼\[Y∣X∈S\]V\(S\)\\coloneqq\\mathbb\{E\}\[Y\\mid X\\in S\]\(we presume higher scores are better\)\. For fixedϵ∈\(0,1\]\\epsilon\\in\(0,1\], let𝒮ϵ≔\{S∈ℱ:μ\(S\)≥ϵ\}\\mathcal\{S\}\_\{\\epsilon\}\\coloneqq\\\{S\\in\\mathcal\{F\}:\\mu\(S\)\\geq\\epsilon\\\}denote all subgroups with mass at leastϵ\\epsilon\. A*failure mode*is a subgroupS∈𝒮ϵS\\in\\mathcal\{S\}\_\{\\epsilon\}on whichV\(S\)V\(S\)falls below a target thresholdqq\. For any positive integerrr,\[r\]\[r\]corresponds to the sequence\{1,2,…,r\}\\\{1,2,\\dots,r\\\}\.
### 3\.1Dueling hypothesis tests for auditing AI robustness
Defining a suitable hypothesis testing framework requires overcoming a fundamental asymmetry in adaptive model auditing: while identifying a failure mode proves a lack of robustness, passing a targeted test suite does not certify its robustness\. This is true from both the software engineering perspective of unit tests as well as the Fisherian perspective of statistical hypothesis testing\(Christensen,[2005](https://arxiv.org/html/2605.07002#bib.bib71)\)\. Furthermore, because adaptive sampling opportunistically seeks out difficult cases, it cannot make conclusions about typical estimands like average or subgroup\-specific performance\.
We resolve this by framing AI robustness audits as conducting two “dueling” hypothesis tests: the*model’s null*makes the omnibus assertion that no failure modes exist, while the*auditor’s null*asserts that their specific auditing strategy will eventually find a failure mode\. We then investigate conditions under which testing these two null hypotheses corresponds to the ideal audit, where passage of the audit occurs if and only if the AI system is globally robust\.
#### 3\.1\.1The Model’s hypothesis
The model’s null hypothesis states that it performs well across all subgroups with prevalence at leastϵ\\epsilon, while the alternative states that there exists one or more failure modes\. This can be mathematically formalized as
H0mod:∀S∈𝒮ϵ,V\(S\)≥q,H1mod:∃S∈𝒮ϵsuch thatV\(S\)<q\.\\displaystyle\\begin\{split\}H\_\{0\}^\{\\texttt\{mod\}\}:&\\quad\\forall S\\in\\mathcal\{S\}\_\{\\epsilon\},\\quad V\(S\)\\geq q,\\\\ H\_\{1\}^\{\\texttt\{mod\}\}:&\\quad\\exists S\\in\\mathcal\{S\}\_\{\\epsilon\}\\text\{ such that \}V\(S\)<q\.\\end\{split\}\(1\)We emphasize that this is a subgroup\-uniform testing problem rather than an estimation problem: the procedure does not need to recover the worst\-case subgroup nor its accuracy\.
The closest analog to this test appears in themulticalibrationliterature\(Hebert\-Johnsonet al\.,[2018](https://arxiv.org/html/2605.07002#bib.bib19); Fenget al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib28)\), which studies whether a model is well\-calibrated over a collection of subgroups\. Similar questions are also considered in the Distributionally Robust Optimization \(DRO\) literature where one may be interested in the worst\-case subgroup performance\(Subbaswamyet al\.,[2021](https://arxiv.org/html/2605.07002#bib.bib14)\)\. The key difference in this work is that we study adaptive anytime testing, whereas prior works focused on the batch setting with prespecified sampling schemes\.
#### 3\.1\.2The Auditor’s hypothesis
The auditor’s null hypothesis asserts that they have an unspecified, adaptive, and potentially adversarial auditing strategy that will eventually uncover a failure mode\. Formally, auditing strategyπ\\piis defined as a function of the natural filtrationℱt\\mathcal\{F\}\_\{t\}to the space of subgroups𝒮ϵ\\mathcal\{S\}\_\{\\epsilon\}, where we useStπS\_\{t\}^\{\\pi\}to denote the adaptively selected subgroup at timett\. At each timett, the auditor bets against a subgroupStπS\_\{t\}^\{\\pi\}, samples an observationXtX\_\{t\}from it \(with respect toμ\\murestricted toStπS\_\{t\}^\{\\pi\}\), and then obtains the corresponding scoreYtY\_\{t\}\. Formally, we represent the auditor’s hypothesis test as
H0aud,\[m\]:∃τ∈\[m\],s\.t\.∀t≥τ,V\(Stπ\)<q,H1aud,\[m\]:∀τ∈\[m\],∃t≥τ,s\.t\.V\(Stπ\)≥q\.\\displaystyle\\begin\{split\}H\_\{0\}^\{\\texttt\{aud\},\[m\]\}:&\\ \\exists\\tau\\in\[m\],\\ s\.t\.\\ \\forall t\\geq\\tau,\\ V\(S\_\{t\}^\{\\pi\}\)<q,\\\\ H\_\{1\}^\{\\texttt\{aud\},\[m\]\}:&\\forall\\tau\\in\[m\],\\ \\exists t\\geq\\tau,\\ s\.t\.\\ V\(S\_\{t\}^\{\\pi\}\)\\geq q\.\\end\{split\}\(2\)Note that \([2](https://arxiv.org/html/2605.07002#S3.E2)\) is defined with respect to a hyperparametermm, which states that the auditor has a budget of sizemmto find the failure mode\. This budget is necessary to ensure the null hypothesis is testable in practice \(otherwise one would have to wait forever\)\. Nevertheless, neither data collection nor the auditor’s adaptive sampling strategy need to be frozen at timemm, nor should it be\. Instead, as discussed later, budgetmmis simply the time at which we may begin seriously testing the auditor’s null hypothesis\.
#### 3\.1\.3Duality of the model’s and auditor’s null
In practice, one can only test the model’s and auditor’s null hypotheses\. Nevertheless, because the auditor’s null only makes a statement about a specific audit strategy, it is natural to ask what conditions are needed to certify that an AI system isgloballyrobust\. We show that this is possible if the auditor’s strategy is*asymptotically consistent*, as defined below\.
###### Definition 1\(Asymptotic consistency\)\.
The auditor’s strategyπ\\piis asymptotically consistent if the performance of the selected subgroups converges to belowqq, i\.e\.,limt→∞V\(Stπ\)<q\\lim\_\{t\\rightarrow\\infty\}V\(S^\{\\pi\}\_\{t\}\)<q, whenever there exists a subgroupS∗∈𝒮ϵS^\{\*\}\\in\\mathcal\{S\}\_\{\\epsilon\}with performance belowqq\.
Under the above assumption, we have the following duality result that states that the model’s and auditor’s null hypotheses are asymptotically complements\.
###### Theorem 1\.
Under an asymptotically consistent auditing strategyπ\\pi, the model’s and auditor’s hypotheses are asymptotically complementary, i\.e\.,
limm→∞H1aud,\[m\]true⇔H0modtrueandlimm→∞H0aud,\[m\]true⇔H1modtrue\.\\displaystyle\\lim\_\{m\\rightarrow\\infty\}H\_\{1\}^\{\\texttt\{aud\},\[m\]\}\\text\{ true\}\\iff H\_\{0\}^\{\\texttt\{mod\}\}\\text\{ true\}\\quad and\\quad\\lim\_\{m\\rightarrow\\infty\}H\_\{0\}^\{\\texttt\{aud\},\[m\]\}\\text\{ true\}\\iff H\_\{1\}^\{\\texttt\{mod\}\}\\text\{ true\}\.
Most standard nonparametric AI/ML models are known to be asymptotically consistent, e\.g\., neural networks, splines, tree\-based models, etc\(Anthony and Bartlett,[1999](https://arxiv.org/html/2605.07002#bib.bib74); Cox,[1984](https://arxiv.org/html/2605.07002#bib.bib76); Scornetet al\.,[2015](https://arxiv.org/html/2605.07002#bib.bib75)\)\. Thus, Theorem[1](https://arxiv.org/html/2605.07002#Thmtheorem1)implies that if the auditor is strong enough, testing the two “dueling” hypotheses can in fact be interpreted as providingdualstopping bounds for certifying or rejecting an AI system as being globally robust\. Such dual stopping bounds can also be viewed as a generalization of those used in clinical trials to declare efficacy versus futility of a treatment\(Pocock,[1977](https://arxiv.org/html/2605.07002#bib.bib7); O’Brien and Fleming,[1979](https://arxiv.org/html/2605.07002#bib.bib2)\)\. The key difference is our focus is an omnibus test with respect to all subgroups𝒮ϵ\\mathcal\{S\}\_\{\\epsilon\}, whereas classical dual stopping bounds are solely concerned with the average treatment effect with respect to the full population\.
Practically speaking, this result suggests leveraging both expert\- and data\-driven methods when auditing AI systems, as the former can provide useful prior knowledge in settings with limited data while the latter can ensure asymptotic consistency\. At the same time, this result warns against making robustness claims when the auditor is weak, echoing recent concerns that poorly designed red\-teaming exercises may be akin to “security theater”\(Fefferet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib33)\)\.
### 3\.2Dual safe anytime\-valid testing procedures
We now investigate procedures that translate fully adaptive sampling schemes and optional stopping into statistics for testing the model’s and auditor’s hypotheses while providing finite\-sample guarantees\. Motivated by the “Testing by Betting” framework\(Shafer,[2021](https://arxiv.org/html/2605.07002#bib.bib37)\), we formulate testing procedures as e\-processes\(Ramdas and Wang,[2025](https://arxiv.org/html/2605.07002#bib.bib35)\), a key mathematical object from the SAVI literature:
###### Definition 2\(E\-process\)\.
A sequence\(Et\)t≥0\(E\_\{t\}\)\_\{t\\geq 0\}adapted to a filtration\(ℱt\)\(\\mathcal\{F\}\_\{t\}\)is an*e\-process*with respect toH0H\_\{0\}if𝔼P\[Eτ\]≤1\\mathbb\{E\}\_\{P\}\[E\_\{\\tau\}\]\\leq 1for any stopping timeτ\\tauand anyP∈H0P\\in H\_\{0\}\. Equivalently,\(Et\)t≥0\(E\_\{t\}\)\_\{t\\geq 0\}is an e\-process if there exists a test supermartingale family\(MtP\)P∈H0\(M\_\{t\}^\{P\}\)\_\{P\\in H\_\{0\}\}in that,∀P∈H0\\forall P\\in H\_\{0\}, \(i\)MtP≥0M\_\{t\}^\{P\}\\geq 0PP\-almost surely, \(ii\)MtPM\_\{t\}^\{P\}is a supermartingale underPP, and \(iii\)𝔼P\[M0P\]≤1\\mathbb\{E\}\_\{P\}\[M\_\{0\}^\{P\}\]\\leq 1, and thatEt≤MtPE\_\{t\}\\leq M\_\{t\}^\{P\}PP\-almost surely\.
From a testing perspective, the key property of e\-processes is that they are guaranteed to control the Type I error rate under adaptive sampling and optional stopping by Ville’s inequality\. That is, for anyP∈H0P\\in H\_\{0\}andℱ\\mathcal\{F\}\-stopping ruleτ\\tau, an e\-process stopped at threshold1/α1/\\alphahas a Type I error rate no more thanα\\alpha, i\.e\.,
P\(Eτ≥1α\)≤α\.P\\left\(E\_\{\\tau\}\\geq\\frac\{1\}\{\\alpha\}\\right\)\\leq\\alpha\.\(3\)As such, we next present various e\-processes for testing the model’s and auditor’s hypotheses, beginning with the simplest option of the likelihood ratio and building up to more complex constructions\. For ease of exposition, this section focuses on binary scoresYY\(e\.g\., accuracy\) but the results extend to continuous or multiclass scoring\.
#### 3\.2\.1Testing the model’s null
To test the model’s null hypothesis, we consider the likelihood ratio process, an adaptive version based on Universal Inference \(UI\)\(Wassermanet al\.,[2020](https://arxiv.org/html/2605.07002#bib.bib30)\), and extensions to both procedures by additionally integrating changepoint detection\(Shiryaev,[1963](https://arxiv.org/html/2605.07002#bib.bib16); Roberts,[1966](https://arxiv.org/html/2605.07002#bib.bib15); Shinet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib36)\)\.
Option 1: Likelihood ratio process \(LR\)\.One of the most well\-known approaches to construct an e\-process is to use likelihood ratios, which is based on the classical Sequential Probability Ratio Test \(SPRT\)\(Wald,[1945](https://arxiv.org/html/2605.07002#bib.bib48)\)\. In particular, letp\(Y;γ\)p\(Y;\\gamma\)be the likelihood of a Bernoulli distribution with probabilityγ\\gamma, and define theLRtest statistic asEtLR,mod:=E1:tLR,modE\_\{t\}^\{\\text\{LR\},\\texttt\{mod\}\}:=E\_\{1:t\}^\{\\text\{LR\},\\texttt\{mod\}\}where
Ej:tLR,mod=∏k=jtp\(Yk;q−δ\)p\(Yk;q\)\.\\displaystyle E\_\{j:t\}^\{\\text\{LR\},\\texttt\{mod\}\}=\\prod\_\{k=j\}^\{t\}\\frac\{p\(Y\_\{k\};q\-\\delta\)\}\{p\(Y\_\{k\};q\)\}\.\(4\)Here,δ\>0\\delta\>0is a fixed user\-specified separation parameter\. When the auditor samples subgroups withV\(Stπ\)≤q−δV\(S\_\{t\}^\{\\pi\}\)\\leq q\-\\delta, the process has positive expected log\-growth against the boundary valueqq\. Subgroups withV\(Stπ\)∈\(q−δ,q\)V\(S\_\{t\}^\{\\pi\}\)\\in\(q\-\\delta,q\)remain failures underH1modH\_\{1\}^\{\\texttt\{mod\}\}but may produce slow growth, if any\. Under the nullH0modH\_\{0\}^\{\\texttt\{mod\}\}, the test statisticEtLR,modE\_\{t\}^\{\\text\{LR\},\\texttt\{mod\}\}is an e\-process, with no expected growth\. Therefore rejectingH0modH\_\{0\}^\{\\texttt\{mod\}\}at timettifEtLR,mod≥1αE\_\{t\}^\{\\text\{LR\},\\texttt\{mod\}\}\\geq\\frac\{1\}\{\\alpha\}is an any\-time valid level\-α\\alphasequential test\.
Option 2: Likelihood Ratios with UI \(LR\-UI\)\.While simple to construct, the disadvantage of theLRprocedure above is that the fixed probabilityq−δq\-\\deltain \([4](https://arxiv.org/html/2605.07002#S3.E4)\) can be suboptimal for detecting a particular alternative inH1modH\_\{1\}^\{\\texttt\{mod\}\}\. To improve its power, we introduce an adaptive version denotedLR\-UIthat draws on UI to define the test statisticEtLR\-UI,mod:=E1:tLR\-UI,mod\\quad E\_\{t\}^\{\\text\{LR\-UI\},\\texttt\{mod\}\}:=E\_\{1:t\}^\{\\text\{LR\-UI\},\\texttt\{mod\}\}, where
Ej:tLR\-UI,mod=∏k=jtp\(Yk;γ^k−1mod\)p\(Yk;q\)\.\\displaystyle E\_\{j:t\}^\{\\text\{LR\-UI\},\\texttt\{mod\}\}=\\prod\_\{k=j\}^\{t\}\\frac\{p\\left\(Y\_\{k\};\\hat\{\\gamma\}\_\{k\-1\}^\{\\texttt\{mod\}\}\\right\)\}\{p\\left\(Y\_\{k\};q\\right\)\}\.\(5\)Hereγ^t−1mod\\hat\{\\gamma\}\_\{t\-1\}^\{\\texttt\{mod\}\}is any estimator ofV\(Xt\)V\(X\_\{t\}\)with respect to the natural filtrationℱt−1\\mathcal\{F\}\_\{t\-1\}that outputs a probability consistent withH1modH\_\{1\}^\{\\texttt\{mod\}\}\. By using an estimator, one can view the auditor as placing an adaptive bet on not only which subgroup is a potential failure mode, but also its likely failure rate, in the hopes to further optimize its gain in the testing\-by\-betting framework\. BecauseYtY\_\{t\}andγ^t−1mod\\hat\{\\gamma\}\_\{t\-1\}^\{\\texttt\{mod\}\}are conditionally independent givenℱt−1\\mathcal\{F\}\_\{t\-1\},EtLR\-UI,modE\_\{t\}^\{\\text\{LR\-UI\},\\texttt\{mod\}\}is a test supermartingale and thus an e\-process forH0modH\_\{0\}^\{\\texttt\{mod\}\}\.
Options 3 and 4: SR Likelihood ratio processes \(SR\-LR,SR\-LR\-UI\)\.LRandLR\-UIare optimal when the auditor identifies failure modes from the very beginning, but an auditor may require multiple attempts \(e\.g\., a learning period\) in practice\. Consequently, we consider integrating changepoint detection techniques into the aforementioned e\-processes, so that the testing procedures are powered to detect the emergence of failure modes in the adaptive testing sequence\.
In particular, we will consider an extension based on the Shirayaev\-Roberts \(SR\) procedure\(Shiryaev,[1963](https://arxiv.org/html/2605.07002#bib.bib16); Roberts,[1966](https://arxiv.org/html/2605.07002#bib.bib15)\), which is classically defined as the sum of LR processes starting from all possible candidate changepoints and has been more recently studied under the umbrella of e\-detectors\(Shinet al\.,[2024](https://arxiv.org/html/2605.07002#bib.bib36)\)\. Because the weighted sum of e\-processes is also an e\-process, we define the test statisticsSR\-LRandSR\-LR\-UIas the e\-processes
EtSR\-LR,mod=∑j=1twjEj:tLR,mod,EtSR\-LR\-UI,mod=∑j=1twjEj:tLR\-UI,mod\.\\displaystyle E\_\{t\}^\{\\text\{SR\-\}\\text\{LR\},\\texttt\{mod\}\}=\\sum\_\{j=1\}^\{t\}w\_\{j\}E\_\{j:t\}^\{\\text\{LR\},\\texttt\{mod\}\},\\quad E\_\{t\}^\{\\text\{SR\-\}\\text\{LR\-UI\},\\texttt\{mod\}\}=\\sum\_\{j=1\}^\{t\}w\_\{j\}E\_\{j:t\}^\{\\text\{LR\-UI\},\\texttt\{mod\}\}\.for pre\-defined weights\{wj:j=1,2,⋯\}\\\{w\_\{j\}:j=1,2,\\cdots\\\}that sum to one\. In the betting framework, the SR extension equates to the auditor distributing their wagers across all candidate changepoints to hedge against the unknown delay before a failure mode emerges\.
#### 3\.2\.2Testing the auditor’s null
To test the auditor’s null hypothesis, we cannot directly apply the aforementioned solutions because the auditor’s null has a complex composite structure whereas the model’s null is much simpler\. The challenge is most apparent when viewing the auditor’s null as a changepoint statement\. In this formulation, the null asserts the presence of a drop in the subgroup scores \- a change point \-*within*the allotted budgetmm, whereas the alternative asserts the absence of such a drop\. Rejecting the null prior to observing*all*mmobservations is therefore risky, since doing so entails an unwarranted extrapolation about the future of an auditor adaptive to the*full*past trajectory\.
As such, one may ask if it is even possible to meaningfully test the auditor’s null prior to timemm\. We prove that this is indeed impossible, in that there is no anytime\-valid testing procedure that can achieve higher power than its Type I error prior to themm\-th observation\.
###### Theorem 2\.
Letϕ\(t\)\\phi\(t\)be any any\-time valid testing procedure for the auditor’s null hypotheses such that the Type I error rate is controlled atα\\alpha, i\.e\.,supm′∈\[m\],t<m′𝔼\(ϕ\(t\)\|H0aud,m′is true\)≤α\\sup\_\{m^\{\\prime\}\\in\[m\],t<m^\{\\prime\}\}\\mathbb\{E\}\\left\(\\phi\(t\)\|H\_\{0\}^\{\\texttt\{aud\},m^\{\\prime\}\}\\text\{ is true\}\\right\)\\leq\\alphawhereH0aud,m′H\_\{0\}^\{\\texttt\{aud\},m^\{\\prime\}\}is the singleton analog of the auditor’s null hypothesis\. Then, the power prior to timemmis also no more thanα\\alpha, i\.e\.,supm′∈\[m\],t<m′𝔼\(ϕ\(t\)\|H1aud,m′is true\)≤α\\sup\_\{m^\{\\prime\}\\in\[m\],t<m^\{\\prime\}\}\\mathbb\{E\}\\left\(\\phi\(t\)\|H\_\{1\}^\{\\texttt\{aud\},m^\{\\prime\}\}\\text\{ is true\}\\right\)\\leq\\alpha\.
Given this result, we can simplify the testing procedure to wait until the end of the budget windowmm\. Then starting at timemm, we can test the simplified null hypothesis
H0aud,m:∀t≥m,V\(Stπ\)<q,H1aud,m:∃t≥m,s\.t\.V\(Stπ\)≥q,\\displaystyle\\begin\{split\}H\_\{0\}^\{\\texttt\{aud\},m\}:&\\ \\forall t\\geq m,\\ V\(S\_\{t\}^\{\\pi\}\)<q,\\\\ H\_\{1\}^\{\\texttt\{aud\},m\}:&\\ \\exists t\\geq m,\\ s\.t\.\\ V\(S\_\{t\}^\{\\pi\}\)\\geq q,\\end\{split\}\(6\)for which the testing procedures from the previous section are now applicable\.
Option 1\. Likelihood Ratio Process \(LR\)\.The first option is to define a similar likelihood ratio process as \([4](https://arxiv.org/html/2605.07002#S3.E4)\), but starting from timemmand with the larger probability parameter in the numerator\. TheLRtest statistic for the auditor’s null is thus
EtLR,aud=\{∏k=mtp\(Yk;q\+δ′\)p\(Yk;q\),ift≥m,1,otherwise,\\displaystyle E\_\{t\}^\{\\text\{LR\},\\texttt\{aud\}\}=\\begin\{cases\}\\prod\_\{k=m\}^\{t\}\\frac\{p\(Y\_\{k\};q\+\\delta^\{\\prime\}\)\}\{p\(Y\_\{k\};q\)\},&\\text\{ if \}t\\geq m,\\\\ 1,&\\text\{ otherwise,\}\\end\{cases\}\(7\)for some fixed toleranceδ′\>0\\delta^\{\\prime\}\>0\.
Option 2\. Likelihood Ratios with UI \(LR\-UI\)\.Alternatively, we can again use an adaptive version of the likelihood ratio process to avoid specifying a fixed alternative\. That is, theLR\-UItest statistic for the auditor’s null is
EtLR\-UI,aud=\{∏k=mtp\(Yk;γ^k−1aud\)p\(Yk;q\),ift≥m,1,otherwise,\\displaystyle E\_\{t\}^\{\\text\{LR\-UI\},\\texttt\{aud\}\}=\\begin\{cases\}\\prod\_\{k=m\}^\{t\}\\frac\{p\\left\(Y\_\{k\};\\hat\{\\gamma\}\_\{k\-1\}^\{\\texttt\{aud\}\}\\right\)\}\{p\\left\(Y\_\{k\};q\\right\)\},&\\text\{ if \}t\\geq m,\\\\ 1,&\\text\{ otherwise,\}\\end\{cases\}\(8\)whereγ^t−1aud\\hat\{\\gamma\}\_\{t\-1\}^\{\\texttt\{aud\}\}is an estimator ofV\(Xt\)V\(X\_\{t\}\)with respect to the natural filtrationℱt−1\\mathcal\{F\}\_\{t\-1\}underH1aud,mH\_\{1\}^\{\\texttt\{aud\},m\}\.
#### 3\.2\.3Dual testing
Combining the e\-processes for the model’s and auditor’s nulls yields dual safe anytime\-valid procedures for any adaptive auditor \(Figure[1](https://arxiv.org/html/2605.07002#S1.F1); Algorithm[1](https://arxiv.org/html/2605.07002#alg1)\)\. At each roundtt, the model’s test statistic is updated; oncet≥mt\\geq m, the auditor’s test statistic is monitored alongside it\. The procedure stops as soon as either statistic exceeds1/α1/\\alpha\. Because at most one of the two nulls is true under any data distribution, Bonferroni correction is unnecessary and the family\-wise error rate \(FWER\) is equal to the Type I error of the individual tests, as formalized below\.
###### Theorem 3\.
For any auditor strategyπ\\pi, the dual testing procedure with stopping thresholds1/α1/\\alphaprovides safe any\-time valid control of the FWER at levelα\\alpha\.
The main hyperparameter of the dual testing procedure is the choice of the auditor’s budgetmm, where a higher value means a more stringent audit is conducted\. This should be selected based on prior knowledge and real\-world constraints\. One should generally choose a highermmto mitigate failure modes in high\-risk settings or find failure modes with lower prevalence\. If the goal is to ensure a typical user is unlikely to encounter a failure mode,mmcan also be chosen to approximate how many test cases a typical user may conduct to assess the robustness of an AI system\. That said, our ablation study \(in the Appendix\) show that the procedure is relatively robust to the choice ofmm\.
## 4Experiments
We evaluate the dual anytime\-valid testing framework in two experimental settings\. Experiment 1 uses semi\-synthetic data to comprehensively test a variety of settings that vary the size and magnitude of the failure mode\. Experiment 2 evaluates the practical utility of the methods on a real\-world LLM pipeline for clinical note analysis\. Failure modes are defined as poor\-performing subgroups with prevalenceϵ\\epsilonat least0\.050\.05\. All hypothesis tests were conducted at levelα=0\.05\\alpha=0\.05\.
We test several auditing strategies and testing procedures, with the primary goal of comparing the power of the testing procedures given the same auditing strategy\. First, we apply the simplestLRe\-process formulation with three different fixed auditing strategies:Stratified\(uniform sampling across known subgroups\),Pre\-learned\(pre\-learn a model for sampling, based on the batch approach as in\(Fenget al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib28)\)\), andOracle\(perfect subgroup knowledge, which upper bounds performance\)\. Next, we test the four e\-processes from Section[3\.2](https://arxiv.org/html/2605.07002#S3.SS2)\(LR,LR\-UI,SR\-LR, andSR\-LR\-UI\) with an adaptive auditing strategy based on online learning\. This adaptive auditor converts LLM queries \(XtX\_\{t\}\) into embeddings and retrains an NN ensemble with each additional annotation\. Based on the predicted scores, the auditor samples from theϵ\\epsilon\-sized subgroup with the lowest predicted scores below the thresholdqq, if such a subgroup exists; otherwise, based on the Upper Confidence Bound \(UCB\) algorithm\(Aueret al\.,[2002](https://arxiv.org/html/2605.07002#bib.bib78)\), it samples the bottom half of the population with the lowest confidence bounds for the estimated performance\. The adaptive alternative probabilities inLR\-UIandSR\-LR\-UIare constructed using the Exponentially Weighted Averaging Forecaster \(EWAF\) algorithm\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2605.07002#bib.bib10)\)\. In the Appendix, we provide full experimental details, demonstrate that the test procedures are robust to the choice of budgetmmin ablation studies, and show results with similar trends on the CUB\-200\-2011 image dataset\(Wahet al\.,[2011](https://arxiv.org/html/2605.07002#bib.bib61)\)\.
### 4\.1Experiment 1: Detecting failure modes with semi\-synthetic data
To comprehensively evaluate testing methods across a variety of conditions, we simulated an AI pipeline with varying degrees of degradations in the GPQA dataset\(Reinet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib42)\)\. Following the available strata of science domains \(Biology, Chemistry, Physics\) and education levels \(Undergraduate, Graduate, Post\-graduate\), we simulated a globally robust model with no failure modes \(*None*\) and models with a failure mode in the Chemistry domain, particularly for higher education levels \(*Small*,*Medium*and*Large degradation*\)\. We conduct experiments across two settings commonly found in practice: \(a\) the auditor has access to an unlabelled dataset from which it can actively choose observations to label, and \(b\) the auditor has no pre\-existing data and must actively generate test cases\. We simulate the latter by generating data from an LLM, and then use the auditor’s accuracy model to prioritize which observations to annotate\. We evaluate each method over 100 independent trials, measuring rejection rates ofH0modH\_\{0\}^\{\\texttt\{mod\}\}\(power to detect failure modes\) andH0audH\_\{0\}^\{\\texttt\{aud\}\}\(audit passage\), as well as the number of observations required to reach a decision\.
Among the adaptive testing methods,SR\-LR\-UIachieved the highest power for detecting failure modes \(Fig[2](https://arxiv.org/html/2605.07002#S4.F2)\)\. UI variants \(SR\-LR\-UI,LR\-UI\) generally outperformed their fixed\-probability counterparts \(SR\-LR,LR\), as using a probability estimator can further optimize the auditor’s bets and thus power for detecting failure modes\. Integrating changepoint detection also improves power for UI procedures \(i\.e\.,SR\-LR\-UI\>\>LR\-UI\), though it had a minimal effect for fixed\-probability methods\. The reason is that in the setting where the auditor can adaptively place bets against the failure rate, SR\-LR\-UI can tune the alternative probability to match the failure mode identified by the auditor after some initial learning period, while LR\-UI is constrained to finding an alternative probability that works well from the very beginning\. In contrast, in the fixed\-probability setting, the two methods are more similar because the alternative probabilities are fixed to be the same, leaving little room for optimizing power\.
As expected, adaptive testing strategies substantially outperformed the pre\-specified auditing approaches ofStratifiedandPre\-learnedsampling\. Furthermore, because the pre\-specified auditing strategies have low power to uncover failure modes, their tests frequently reject the auditor’s null\. While this is mathematically correct since the AI system has passed these specific audits, it risks being misinterpreted as indicating robustness of the AI system\. Finally, theOracleauditor outperforms all the existing methods since it knows the true failure mode, showing that rejection can be as fast as 15–25 observations\. This show that if one has an even smarter auditing strategy, such as through prior knowledge, one may be able to complete the statistical test with even smaller test suites\.
When the model’s null is true, every method correctly rejectsH0audH\_\{0\}^\{\\texttt\{aud\}\}at high rates, certifying that the audit has passed\. This confirms that Type I error forH0modH\_\{0\}^\{\\texttt\{mod\}\}is controlled: no method falsely reports a failure mode when none exists\. We highlight that substantially more observations are needed by the testing procedures that use more adaptive auditors as they are better at finding poor\-performing subgroups\. This is desirable, as it means the auditing strategy was indeed quite good and a lot more data was needed to confidently reject the auditor’s null\.
\(a\) Auditor selects which unlabeled cases to annotate


\(b\) Auditor generates test cases to annotate 

Figure 2:Detecting failure modes with semi\-synthetic data \(Experiment 1\)\. The rows correspond to settings where the model has a failure mode with decreasing magnitude/prevalence of the degradation \(*Large*,*Medium*,*Small*\), where the bottom corresponds to*No failure mode*\. Left column: rate of rejecting dueling null hypotheses, where orange indicates rate of detecting a failure mode, blue indicate passing the audit, and grey correspond to neither \(inconclusive\)\. Right column: boxplots of the number of cases annotated to reach the correct conclusion or the maximum sample size\.
### 4\.2Experiment 2: Auditing an LLM pipeline for clinical notes
We next applied our framework to a real\-world task: auditing an LLM\-based pipeline originally developed byKothariet al\.\([2026](https://arxiv.org/html/2605.07002#bib.bib44)\)that extracts social determinants of health \(SDoH\) from clinical notes across 28 categories, including housing stability, substance use, mental health, safety concerns, and patient support networks\. The original dataset includes human annotations across 81 clinical notes for all categories, yielding 2268 total comparisons\. The overall LLM\-human agreement rate was 88%, but category\-level accuracy varied from 58% to 100%\. To evaluate the proposed auditing procedures on this dataset, we set the auditing threshold toq=0\.83q=0\.83\.
Figure[3](https://arxiv.org/html/2605.07002#S4.F3)reveals similar qualitative patterns in real data, whereSR\-LR\-UIachieves the highest power and requires the fewest samples for rejecting the model’s null \(except forOracle\)\. Visualizing which SDoH categories are being selected by the adaptive auditor, we also find that the adaptive auditor indeed learns to concentrate on the low\-accuracy categories \(Figure[3](https://arxiv.org/html/2605.07002#S4.F3)right\), with sampling frequency strongly negatively correlated with the category’s accuracy \(Spearmanρ=−0\.84\\rho=\-0\.84\)\.




Figure 3:Auditing an LLM pipeline for extraction for Social Determinants of Health \(SDoH\) from clinical notes \(Experiment 2\)\. Leftmost: Sampling frequency vs\. accuracy of different categories of SDoH clinical notes forSR\-LR\-UI, showing that the auditor learns to upsample low\-accuracy categories\. Left\-center: Example dual e\-process with orange and blue curves representing the tests for the model’s and auditor’s null, respectively\. The test detects the AI system is not robust, as the orange curve exceeds the threshold of1/α1/\\alpha\(red dashed line\)\. Right\-center: Rates in which the AI sytem fails the robustness audit \(orange\) versus passage of the audit \(blue\); gray is inconclusive\. Rightmost: Boxplots of the number of clinical notes annotated until robustness is rejected or the maximum sample size is reached\.
## 5Discussion
To understand the statistical underpinnings of highly flexible and targeted evaluation of AI workflows, this work introduces a formal hypothesis testing with procedures that provide anytime\-valid guarantees under arbitrary adaptive sampling and optional stopping\. The key insight is to represent robustness auditing of AI systems from two dueling perspectives: the model’s null that asserts there are no failure modes, and the auditor’s null that asserts they will find one\. Then using the testing\-by\-betting framework, we introduce e\-process\-based procedures that maintain finite\-sample Type I error control regardless of how adaptively the test cases are selected\. Code for reproducing all results are available at[https://www\.github\.com/jjfenglab/safe\-active\-testing](https://www.github.com/jjfenglab/safe-active-testing)\.
Future directions of this work include optimizing the auditing strategy, as this work aims to provide a statistical foundation for arbitrarily adaptive auditing strategies and treats auditing strategies as a black box\. In addition, our empirical evaluations focused on automated \(but unpredictable\) auditors to comprehensively test the statistical procedures across a variety of settings, but future work should study the suitability of this framework with human\-in\-the\-loop systems\. Ultimately, by providing a general statistical framework for common auditing practices of modern AI systems, we hope to empower practitioners to confidently assess AI robustness without sacrificing statistical validity\.
## References
- Neural network learning: theoretical foundations\.Cambridge University Press,Cambridge\(en\)\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.07002#S3.SS1.SSS3.p3.1)\.
- Anthropic \(2026\)Demystifying evals for AI agents\.\(en\)\.Note:[https://www\.anthropic\.com/engineering/demystifying\-evals\-for\-ai\-agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)Accessed: 2026\-4\-28Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer \(2002\)Finite\-time analysis of the multiarmed bandit problem\.Mach\. Learn\.47\(2\-3\),pp\. 235–256\(en\)\.Cited by:[§B\.2](https://arxiv.org/html/2605.07002#A2.SS2.p2.11),[§4](https://arxiv.org/html/2605.07002#S4.p2.4)\.
- E\. Bandel, A\. Yehudai, L\. Eden, Y\. Sagron, Y\. Perlitz, E\. Venezian, N\. Razinkov, N\. Ergas, S\. S\. Ifergan, S\. Shlomov, M\. Jacovi, L\. Choshen, L\. Ein\-Dor, Y\. Katz, and M\. Shmueli\-Scheuer \(2026\)General agent evaluation\.arXiv \[cs\.AI\]\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1)\.
- G\. Berrada, J\. Kossen, F\. B\. Smith, M\. Razzak, Y\. Gal, and T\. Rainforth \(2025\)Scaling up active testing to large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- Á\. A\. Cabrera, W\. Epperson, F\. Hohman, M\. Kahng, J\. Morgenstern, and D\. H\. Chau \(2019\)FAIRVIS: visual analytics for discovering intersectional bias in machine learning\.In2019 IEEE Conference on Visual Analytics Science and Technology \(VAST\),pp\. 46–56\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1)\.
- N\. Cesa\-Bianchi and G\. Lugosi \(2006\)Prediction, learning, and games\.Cambridge University Press\(en\)\.Cited by:[Appendix C](https://arxiv.org/html/2605.07002#A3.SS0.SSS0.Px3.p2.6),[§4](https://arxiv.org/html/2605.07002#S4.p2.4)\.
- A\. Chouldechova and A\. Roth \(2020\)A snapshot of the frontiers of fairness in machine learning\.Commun\. ACM63\(5\),pp\. 82–89\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- R\. Christensen \(2005\)Testing fisher, neyman, pearson, and bayes\.Am\. Stat\.59\(2\),pp\. 121–126\.Cited by:[§3\.1](https://arxiv.org/html/2605.07002#S3.SS1.p1.1)\.
- Y\. Chung, T\. Kraska, N\. Polyzotis, K\. H\. Tae, and S\. E\. Whang \(2019\)Slice finder: automated data slicing for model validation\.In2019 IEEE 35th International Conference on Data Engineering \(ICDE\),pp\. 1550–1553\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- A\. Cohan, S\. Feldman, I\. Beltagy, D\. Downey, and D\. Weld \(2020\)SPECTER: document\-level representation learning using citation\-informed transformers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Stroudsburg, PA, USA,pp\. 2270–2282\.Cited by:[Appendix C](https://arxiv.org/html/2605.07002#A3.SS0.SSS0.Px3.p1.4),[Appendix D](https://arxiv.org/html/2605.07002#A4.SS0.SSS0.Px3.p1.1)\.
- D\. D\. Cox \(1984\)Multivariate smoothing spline functions\.SIAM J\. Numer\. Anal\.21\(4\),pp\. 789–813\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.07002#S3.SS1.SSS3.p3.1)\.
- S\. Eyuboglu, M\. Varma, K\. Saab, J\. Delbrouck, C\. Lee\-Messer, J\. Dunnmon, J\. Zou, and C\. Ré \(2023\)Domino: discovering systematic errors with cross\-modal embeddings\.International Conference on Learning Representations\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- M\. Feffer, A\. Sinha, Z\. C\. Lipton, and H\. Heidari \(2024\)Red\-teaming for generative AI: silver bullet or security theater?\.arXiv \[cs\.CY\]\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.07002#S3.SS1.SSS3.p4.1)\.
- J\. Feng, A\. Gossmann, R\. Pirracchio, N\. Petrick, G\. Pennello, and B\. Sahiner \(2023\)Is this model reliable for everyone? testing for strong calibration\.AISTATS\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1),[§3\.1\.1](https://arxiv.org/html/2605.07002#S3.SS1.SSS1.p2.1),[§4](https://arxiv.org/html/2605.07002#S4.p2.4)\.
- D\. Ganguli, L\. Lovitt, J\. Kernion, A\. Askell, Y\. Bai, S\. Kadavath, B\. Mann, E\. Perez, N\. Schiefer, K\. Ndousse, A\. Jones, S\. Bowman, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, N\. Elhage, S\. El\-Showk, S\. Fort, Z\. Hatfield\-Dodds, T\. Henighan, D\. Hernandez, T\. Hume, J\. Jacobson, S\. Johnston, S\. Kravec, C\. Olsson, S\. Ringer, E\. Tran\-Johnson, D\. Amodei, T\. Brown, N\. Joseph, S\. McCandlish, C\. Olah, J\. Kaplan, and J\. Clark \(2022\)Red teaming language models to reduce harms: methods, scaling behaviors, and lessons learned\.arXiv \[cs\.CL\]\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1),[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- I\. Gao, G\. Ilharco, S\. Lundberg, and M\. T\. Ribeiro \(2023\)Adaptive testing of computer vision models\.IEEE/CVF International Conference on Computer Vision\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1),[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- P\. Grünwald, R\. de Heide, and W\. Koolen \(2024\)Safe testing\.J\. R\. Stat\. Soc\. Series B Stat\. Methodol\.86\(5\),pp\. 1091–1128\(en\)\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p4.1),[§2](https://arxiv.org/html/2605.07002#S2.p2.1)\.
- D\. Hartmann, L\. Pohlmann, L\. Hanslik, N\. Gießing, B\. Berendt, and P\. Delobelle \(2026\)Audit me if you can: query\-efficient active fairness auditing of black\-box LLMs\.arXiv \[cs\.LG\]\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2015\)Deep residual learning for image recognition\.arXiv \[cs\.CV\]\.Cited by:[Appendix E](https://arxiv.org/html/2605.07002#A5.SS0.SSS0.Px2.p1.1)\.
- U\. Hebert\-Johnson, M\. Kim, O\. Reingold, and G\. Rothblum \(2018\)Multicalibration: calibration for the \(Computationally\-identifiable\) masses\.International Conference on Machine Learning80,pp\. 1939–1948\.Cited by:[§3\.1\.1](https://arxiv.org/html/2605.07002#S3.SS1.SSS1.p2.1)\.
- H\. Husain and S\. Shankar \(2026\)LLM evals: everything you need to know – hamel’s blog \- hamel husain\.\(en\)\.Note:[https://hamel\.dev/blog/posts/evals\-faq/](https://hamel.dev/blog/posts/evals-faq/)Accessed: 2026\-1\-17Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1),[§1](https://arxiv.org/html/2605.07002#S1.p2.1),[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- A\. Kothari, P\. Vossler, J\. Digitale, M\. Forouzannia, E\. Rosenberg, M\. Lee, J\. Bryant, M\. Molina, J\. Marks, L\. Zier, and J\. Feng \(2026\)When the domain expert has no time and the LLM developer has no clinical expertise: real\-world lessons from LLM co\-design in a safety\-net hospital\.Proc\. Conf\. AAAI Artif\. Intell\.\.Cited by:[§4\.2](https://arxiv.org/html/2605.07002#S4.SS2.p1.1)\.
- Q\. Kuang, B\. Gang, and Y\. Xia \(2025\)Active hypothesis testing under computational budgets with applications to GWAS and LLM\.arXiv \[stat\.ME\]\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- D\. Lee, J\. Lee, J\. Ha, J\. Kim, S\. Lee, H\. Lee, and H\. O\. Song \(2023\)Query\-efficient black\-box red teaming via bayesian optimization\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11551–11574\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1)\.
- X\. L\. Li, F\. Kaiyom, E\. Z\. Liu, Y\. Mai, P\. Liang, and T\. Hashimoto \(2024\)AutoBencher: towards declarative benchmark construction\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1),[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. A\. Cosgrove, C\. D\. Manning, C\. Re, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. Wang, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. S\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. A\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- S\. Mitchell, E\. Potash, S\. Barocas, A\. D’Amour, and K\. Lum \(2021\)Algorithmic fairness: choices, assumptions, and definitions\.Annu\. Rev\. Stat\. Appl\.8\(1\),pp\. 141–163\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- P\. C\. O’Brien and T\. R\. Fleming \(1979\)A multiple testing procedure for clinical trials\.Biometrics35\(3\),pp\. 549–556\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.07002#S3.SS1.SSS3.p3.1)\.
- E\. Perez, S\. Huang, F\. Song, T\. Cai, R\. Ring, J\. Aslanides, A\. Glaese, N\. McAleese, and G\. Irving \(2022\)Red teaming language models with language models\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Stroudsburg, PA, USA,pp\. 3419–3448\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1)\.
- S\. J\. Pocock \(1977\)Group sequential methods in the design and analysis of clinical trials\.Biometrika64\(2\),pp\. 191–199\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.07002#S3.SS1.SSS3.p3.1)\.
- A\. Ramdas and R\. Wang \(2025\)Hypothesis testing with e\-values\.arXiv \[math\.ST\]\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p2.1),[§1](https://arxiv.org/html/2605.07002#S1.p4.1),[§2](https://arxiv.org/html/2605.07002#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.07002#S3.SS2.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: a graduate\-level google\-proof Q&A benchmark\.arXiv \[cs\.AI\]\.Cited by:[Appendix C](https://arxiv.org/html/2605.07002#A3.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.07002#S4.SS1.p1.2)\.
- M\. T\. Ribeiro and S\. Lundberg \(2022\)Adaptive testing and debugging of NLP models\.InACL 2022,\(en\)\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1),[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh \(2020\)Beyond accuracy: behavioral testing of NLP models with CheckList\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Stroudsburg, PA, USA,pp\. 4902–4912\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1)\.
- S\. W\. Roberts \(1966\)A comparison of some control chart procedures\.Technometrics8\(3\),pp\. 411–430\.Cited by:[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p1.1),[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p5.2)\.
- S\. Sadhuka, D\. Prinster, C\. Fannjiang, G\. Scalia, A\. Regev, and H\. Wang \(2025\)E\-valuator: reliable agent verifiers with sequential hypothesis testing\.arXiv \[cs\.LG\]\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p2.1)\.
- E\. Scornet, G\. Biau, and J\. Vert \(2015\)Consistency of random forests\.Ann\. Stat\.43\(4\),pp\. 1716–1741\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.07002#S3.SS1.SSS3.p3.1)\.
- G\. Shafer \(2021\)Testing by betting: a strategy for statistical and scientific communication\.J\. R\. Stat\. Soc\. Ser\. A Stat\. Soc\.184\(2\),pp\. 407–431\(en\)\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.07002#S3.SS2.p1.1)\.
- J\. Shin, A\. Ramdas, and A\. Rinaldo \(2024\)E\-detectors: a nonparametric framework for sequential change detection\.New England Journal of Statistics in Data Science\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p2.1),[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p1.1),[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p5.2)\.
- A\. N\. Shiryaev \(1963\)On optimum methods in quickest detection problems\.Theory Probab\. Appl\.8\(1\),pp\. 22–46\.Cited by:[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p1.1),[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p5.2)\.
- H\. Singh, F\. Xia, M\. Kim, R\. Pirracchio, R\. Chunara, and J\. Feng \(2023\)A brief tutorial on sample size calculations for fairness audits\.Workshop on Regulatable Machine Learning at the 37th Conference on Neural Information Processing Systems\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- A\. Subbaswamy, R\. Adams, and S\. Saria \(2021\)Evaluating model robustness and stability to dataset shift\.InProceedings of The 24th International Conference on Artificial Intelligence and Statistics,A\. Banerjee and K\. Fukumizu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.130,pp\. 2611–2619\.Cited by:[§3\.1\.1](https://arxiv.org/html/2605.07002#S3.SS1.SSS1.p2.1)\.
- A\. Subbaswamy, B\. Sahiner, N\. Petrick, V\. Pai, R\. Adams, M\. C\. Diamond, and S\. Saria \(2024\)A data\-driven framework for identifying patient subgroups on which an AI/machine learning model may underperform\.NPJ Digit\. Med\.7\(1\),pp\. 334\(en\)\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- U\.S\. Food and Drug Administration \(2020\)Recommended content and format of non\-clinical bench performance testing information in premarket submissions\.FDA\(en\)\.Note:[https://www\.fda\.gov/regulatory\-information/search\-fda\-guidance\-documents/recommended\-content\-and\-format\-non\-clinical\-bench\-performance\-testing\-information\-premarket](https://www.fda.gov/regulatory-information/search-fda-guidance-documents/recommended-content-and-format-non-clinical-bench-performance-testing-information-premarket)Accessed: 2026\-4\-14Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1)\.
- R\. Vivek, K\. Ethayarajh, D\. Yang, and D\. Kiela \(2024\)Anchor points: benchmarking models with much fewer examples\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Stroudsburg, PA, USA,pp\. 1576–1601\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- C\. Wah, S\. Branson, P\. Welinder, P\. Perona, and S\. Belongie \(2011\)The caltech\-UCSD birds\-200\-2011 dataset\.\(en\)\.Note:[https://authors\.library\.caltech\.edu/records/cvm3y\-5hh21](https://authors.library.caltech.edu/records/cvm3y-5hh21)Accessed: –Cited by:[Appendix E](https://arxiv.org/html/2605.07002#A5.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.07002#S4.p2.4)\.
- A\. Wald \(1945\)Sequential tests of statistical hypotheses\.Ann\. Math\. Stat\.16\(2\),pp\. 117–186\(en\)\.Cited by:[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p2.3)\.
- L\. Wasserman, A\. Ramdas, and S\. Balakrishnan \(2020\)Universal inference\.Proc\. Natl\. Acad\. Sci\. U\. S\. A\.117\(29\),pp\. 16880–16890\(en\)\.Cited by:[§3\.2\.1](https://arxiv.org/html/2605.07002#S3.SS2.SSS1.p1.1)\.
- E\. Yan \(2025\)Product evals in three simple steps\.\(en\)\.Note:[https://eugeneyan\.com/writing/product\-evals/](https://eugeneyan.com/writing/product-evals/)Accessed: 2026\-4\-28Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p1.1)\.
- T\. Yan and C\. Zhang \(2022\)Active fairness auditing\.InInternational Conference on Machine Learning,pp\. 24929–24962\(en\)\.Cited by:[§1](https://arxiv.org/html/2605.07002#S1.p1.1),[§1](https://arxiv.org/html/2605.07002#S1.p2.1),[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
- Z\. Zeng, Y\. Wang, H\. Hajishirzi, and P\. W\. Koh \(2026\)EvalTree: profiling language model weaknesses via hierarchical capability trees\.Conference on Language Modeling\.Cited by:[§2](https://arxiv.org/html/2605.07002#S2.p3.1)\.
## Appendix AProofs
###### Proof of Theorem[1](https://arxiv.org/html/2605.07002#Thmtheorem1)\.
Recall the model’s hypotheses are
H0mod:∀S∈𝒮ϵ,V\(S\)≥q,H1mod:∃S∈𝒮ϵsuch thatV\(S\)<q\.\\displaystyle\\begin\{split\}H\_\{0\}^\{\\texttt\{mod\}\}:&\\quad\\forall S\\in\\mathcal\{S\}\_\{\\epsilon\},\\quad V\(S\)\\geq q,\\\\ H\_\{1\}^\{\\texttt\{mod\}\}:&\\quad\\exists S\\in\\mathcal\{S\}\_\{\\epsilon\}\\text\{ such that \}V\(S\)<q\.\\end\{split\}and the auditor’s hypotheses can be formulated as:
H0aud,\[m\]:∃τ∈\[m\]such that∀t≥τ,V\(Stπ\)<q,H1aud,\[m\]:∀τ∈\[m\],∃t≥τsuch that∀t≥τ,V\(Stπ\)≥q\.\\displaystyle\\begin\{split\}H\_\{0\}^\{\\texttt\{aud\},\[m\]\}:&\\ \\exists\\tau\\in\[m\]\\text\{ such that \}\\forall t\\geq\\tau,\\ V\(S\_\{t\}^\{\\pi\}\)<q,\\\\ H\_\{1\}^\{\\texttt\{aud\},\[m\]\}:&\\forall\\tau\\in\[m\],\\exists t\\geq\\tau\\text\{ such that \}\\forall t\\geq\\tau,\\ V\(S\_\{t\}^\{\\pi\}\)\\geq q\.\\end\{split\}
Asm→∞m\\rightarrow\\infty, denote
H0aud,ℕ:=limm→∞H0aud,\[m\]:\\displaystyle H\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}:=\\lim\_\{m\\rightarrow\\infty\}H\_\{0\}^\{\\texttt\{aud\},\[m\]\}:∃τ∈ℕsuch that∀t≥τ,V\(Stπ\)<q\\displaystyle\\quad\\exists\\tau\\in\\mathbb\{N\}\\text\{ such that \}\\forall t\\geq\\tau,V\(S\_\{t\}^\{\\pi\}\)<q\\\(9\)H1aud,ℕ:=limm→∞H1aud,\[m\]:\\displaystyle H\_\{1\}^\{\\texttt\{aud\},\\mathbb\{N\}\}:=\\lim\_\{m\\rightarrow\\infty\}H\_\{1\}^\{\\texttt\{aud\},\[m\]\}:∀τ∈ℕ,∃t≥τsuch thatV\(Stπ\)≥q\\displaystyle\\quad\\forall\\tau\\in\\mathbb\{N\},\\ \\exists t\\geq\\tau\\text\{ such that \}V\(S\_\{t\}^\{\\pi\}\)\\geq q\(10\)
SinceH0modH\_\{0\}^\{\\texttt\{mod\}\}andH0aud,ℕH\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}are the respective complements ofH1modH\_\{1\}^\{\\texttt\{mod\}\}andH0aud,ℕH\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}, we only need to proveH1mod⇔H0aud,ℕH\_\{1\}^\{\\texttt\{mod\}\}\\Leftrightarrow H\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}\.
Notice thatH0aud,ℕH\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}is equivalent tolimt→∞V\(Stπ\)<q\\lim\_\{t\\rightarrow\\infty\}V\(S\_\{t\}^\{\\pi\}\)<q\. IfH0aud,ℕH\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}is true, then there exists a largeTTsuch thatV\(STπ\)<qV\(S\_\{T\}^\{\\pi\}\)<qandH1modH\_\{1\}^\{\\texttt\{mod\}\}is true by takingS∗=STπS^\{\*\}=S\_\{T\}^\{\\pi\}\. On the other hand, ifH1modH\_\{1\}^\{\\texttt\{mod\}\}is true, then by asymptotic consistency of the auditor’s strategy,limt→∞V\(Stπ\)<q\\lim\_\{t\\rightarrow\\infty\}V\(S\_\{t\}^\{\\pi\}\)<qand thusH0aud,ℕH\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}is true\. Therefore,H1modH\_\{1\}^\{\\texttt\{mod\}\}is true iffH0aud,ℕH\_\{0\}^\{\\texttt\{aud\},\\mathbb\{N\}\}\.
∎
###### Proof of Theorem[2](https://arxiv.org/html/2605.07002#Thmtheorem2)\.
For anym′∈\[m\]m^\{\\prime\}\\in\[m\], LetS1,π=\{S11,π,S21,π,…,\}S^\{1,\\pi\}=\\left\\\{S\_\{1\}^\{1,\\pi\},S\_\{2\}^\{1,\\pi\},\\dots,\\right\\\}be any sequence of subsets which are identified by the auditor’s sampling strategyπ\\piand satisfyH1aud,m′H\_\{1\}^\{\\texttt\{aud\},m^\{\\prime\}\}\. Suppose, to the contrary, thatsupt<m′𝔼\(ϕ\(t\);S11,π,…,St1,π\)\>α\\sup\_\{t<m^\{\\prime\}\}\\mathbb\{E\}\(\\phi\(t\);S\_\{1\}^\{1,\\pi\},\\dots,S\_\{t\}^\{1,\\pi\}\)\>\\alpha, i\.e\., the power can be aboveα\\alphaat somet<m′t<m^\{\\prime\}\.
Next, constructS0,π=\{S10,π,S20,π,…,\}S^\{0,\\pi\}=\\left\\\{S\_\{1\}^\{0,\\pi\},S\_\{2\}^\{0,\\pi\},\\dots,\\right\\\}such thatS0,πS^\{0,\\pi\}satisfiesH0aud,m′H\_\{0\}^\{\\texttt\{aud\},m^\{\\prime\}\}andSt0,π=St1,πS\_\{t\}^\{0,\\pi\}=S\_\{t\}^\{1,\\pi\}fort<m′t<m^\{\\prime\}\. However, this implies
supt<m′𝔼\(ϕ\(t\);S10,π,…,St0,π\)=supt<m′𝔼\(ϕ\(t\);S11,π,…,St1,π\)\>α,\\displaystyle\\sup\_\{t<m^\{\\prime\}\}\\mathbb\{E\}\(\\phi\(t\);S\_\{1\}^\{0,\\pi\},\\dots,S\_\{t\}^\{0,\\pi\}\)=\\sup\_\{t<m^\{\\prime\}\}\\mathbb\{E\}\(\\phi\(t\);S\_\{1\}^\{1,\\pi\},\\dots,S\_\{t\}^\{1,\\pi\}\)\>\\alpha,which contradictssupt𝔼\(ϕ\(t\)\|H0aud,m′is true\)≤α\\sup\_\{t\}\\mathbb\{E\}\\left\(\\phi\(t\)\|H\_\{0\}^\{\\texttt\{aud\},m^\{\\prime\}\}\\text\{ is true\}\\right\)\\leq\\alpha\. Therefore, by contradiction,supt<m′𝔼\(ϕ\(t\)\|H1aud,m′is true\)≤α\\sup\_\{t<m^\{\\prime\}\}\\mathbb\{E\}\\left\(\\phi\(t\)\|H\_\{1\}^\{\\texttt\{aud\},m^\{\\prime\}\}\\text\{ is true\}\\right\)\\leq\\alpha, which holds for anym′∈\[m\]m^\{\\prime\}\\in\[m\]\.
∎
###### Proof for Theorem[3](https://arxiv.org/html/2605.07002#Thmtheorem3)\.
If the model’s null is true, then the auditor’s null cannot be true\. If the auditor’s null is true, then the model’s null cannot be true\. Since only one of these null hypotheses can be true, then if both of the individual tests are controlled at levelα\\alpha, the FWER is controlled at levelα\\alpha\. ∎
## Appendix BTesting procedures
### B\.1Pseudocode for dual testing procedure
The pseudocode for dual testing is given in Algorithm[1](https://arxiv.org/html/2605.07002#alg1)\. The model’s testEtmodE\_\{t\}^\{\\texttt\{mod\}\}can beEtLR,modE\_\{t\}^\{\\text\{LR\},\\texttt\{mod\}\},EtLR\-UI,modE\_\{t\}^\{\\text\{LR\-UI\},\\texttt\{mod\}\},EtSR\-LR,modE\_\{t\}^\{\\text\{SR\-\}\\text\{LR\},\\texttt\{mod\}\}orEtSR\-LR\-UI,modE\_\{t\}^\{\\text\{SR\-\}\\text\{LR\-UI\},\\texttt\{mod\}\}and the auditor’s testEtaudE\_\{t\}^\{\\texttt\{aud\}\}can beEtLR\-UI,audE\_\{t\}^\{\\text\{LR\-UI\},\\texttt\{aud\}\}orEtSR\-LR,audE\_\{t\}^\{\\text\{SR\-\}\\text\{LR\},\\texttt\{aud\}\}\.
Algorithm 1Dual testing1:Initialize
E0mod=E0aud=1E\_\{0\}^\{\\texttt\{mod\}\}=E\_\{0\}^\{\\texttt\{aud\}\}=1, an auditing strategy
π\\pi\.
2:for
t=1,2,…t=1,2,\\ldotsdo
3:Pick subgroup
StπS\_\{t\}^\{\\pi\}based on the audit strategy
π\\pi\.
4:Sample test observation
XtX\_\{t\}from
StπS\_\{t\}^\{\\pi\}\.
5:Evaluate the AI system’s response on
XtX\_\{t\}and obtain the score
YtY\_\{t\}\.
6:Update
EtmodE\_\{t\}^\{\\texttt\{mod\}\}\.
7:if
Etmod\>1/αE\_\{t\}^\{\\texttt\{mod\}\}\>1/\\alphathen
8:returnReject
H0modH\_\{0\}^\{\\texttt\{mod\}\}: Failure mode is detected\.
9:endif
10:if
t≥mt\\geq mthen
11:Update
EtaudE\_\{t\}^\{\\texttt\{aud\}\}
12:if
Etaud\>1/αE\_\{t\}^\{\\texttt\{aud\}\}\>1/\\alphathen
13:returnReject
H0aud,mH\_\{0\}^\{\\texttt\{aud\},m\}: Audit is passed\.
14:endif
15:endif
16:endfor
### B\.2Auditing strategies
With rigorous finite\-sample inference being possible under active testing based on theoretical results above, the practical value of this framework depends on having a good auditing strategyπ\\pifor discovering failure modes\.
Scenario \(a\): Unlabeled data\.If one has unlabelled data from the target population, one could learn a poor\-performing subgroup with online learning\. For instance, one may \(re\)train an AI/ML modelπ^\\hat\{\\pi\}at the start of each iterationtton labelled data collected up to timet−1t\-1to estimate the conditional expectation of the scoreE\[Y\|X\]E\[Y\|X\]\. This can then be used to determine sampling\. An appropriate exploitation\-exploration tradeoff should be made, to ensure asymptotic consistency of the auditor\. In our experiments, the auditor greedily defines the subsetStπS\_\{t\}^\{\\pi\}as the observationsXXpredicted to be in the bottomϵ\\epsilonpercentileandhave predicted scoresπ^\(X\)\\hat\{\\pi\}\(X\)below the target thresholdqq\. If such anϵ\\epsilon\-sized subgroup does not exist, then the auditor favors exploration at timettinstead by sampling among observations whose lower confidence bound of the predicted score falls within the bottom 50%, following the UCB strategy in the bandit literature\[Aueret al\.,[2002](https://arxiv.org/html/2605.07002#bib.bib78)\]\.
Scenario \(b\): Data generation\.In certain settings, one may not have a pre\-existing unlabeled dataset\{Xj\}\\\{X\_\{j\}\\\}\. This can occur, for instance, when one is building a new AI pipeline, where no data has yet been collected from the target application\. To audit such an AI pipeline, the auditor must generateXXvalues to audit instead\. In the ideal setup, the auditor can \(i\) determine that its selected subgroupStπS\_\{t\}^\{\\pi\}has prevalence at leastϵ\\epsilonand \(ii\) generate observations fromStπS\_\{t\}^\{\\pi\}with respect to the probability measureμ\\murestricted to the subgroup\. While the former is plausible as long as the auditor has adequate prior knowledge, the latter is highly dependent on the auditor’s generation capabilities in practice \(both in the setting whereXXis constructed manually or automatically using a foundation model\)\. In such cases, an accurate and more honest description of the audit should be with respect to the actual data generation mechanism, which is an approximation of the probability measure in the target application\.
## Appendix CExperiment 1: Detailed Setup
We provide complete details for the semi\-synthetic experiments described in Section[4](https://arxiv.org/html/2605.07002#S4)\.
##### Data and label simulation\.
We use the GPQA dataset\[Reinet al\.,[2023](https://arxiv.org/html/2605.07002#bib.bib42)\]which contains 448 graduate\-level multiple\-choice science questions spanning three domains: Chemistry \(187 questions\), Biology \(142 questions\), and Physics \(119 questions\)\. Each question is also categorized by education level: easy undergraduate, hard undergraduate, hard graduate, and post\-graduate\.
We simulate binary accuracy labels \(correct/incorrect\) by assigning each domain×\\timeseducation level combination a ground\-truth accuracy probability\. As an illustration, Table[1](https://arxiv.org/html/2605.07002#A3.T1)shows the accuracy probabilities in Scenario \(a\) discussed below for Chemistry \(the poor\-performing subgroup\) across education level; Biology and Physics maintain accuracy≥0\.85\\geq 0\.85in all settings\. The null threshold isq=0\.85q=0\.85throughout\.
Table 1:Simulated accuracy settings for Chemistry across magnitude of the degradation in Experiment 1 Scenario \(a\)\. Biology and Physics domains maintain accuracy≥0\.85\\geq 0\.85in all settings\.
##### Testing parameters\.
The maximum sample budget is 250 observations\. We initialize withninit=2n\_\{\\text\{init\}\}=2randomly sampled observations before active selection begins\. ThePre\-learnedsampler uses an additional 30 held\-out observations for pre\-training\.
##### Adaptive sampling
The adaptive sampling trains a model of the accuracy to identify the subgroup most likely to be a failure mode\. The accuracy model uses the text embedding of the GPQA question and its metadata \(domain and education level\) as input, generated from thesentence\-transformers/allenai\-spectermodel\[Cohanet al\.,[2020](https://arxiv.org/html/2605.07002#bib.bib81)\]\. Accuracy is modeled using an ensemble of neural networks with one hidden layer, tuned over the hyperparameters of hidden dimension∈\{5,10\}\\in\\\{5,10\\\}and number of epochs∈\{2,4,8,16\}\\in\\\{2,4,8,16\\\}\. At each iteration, the best hyperparameter was based on the one with the smallest exponentially weighted held\-out loss, which allows the selection procedure to adapt to shifts in the optimal hyperparameter as the accuracy model improved over time\. We useϵ=0\.05\\epsilon=0\.05forϵ\\epsilon\-greedy exploration\.
For UI\-based estimators, we construct adaptive alternative probabilities inLR\-UIandSR\-LR\-UIusing the Exponentially Weighted Averaging Forecaster algorithm\[Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2605.07002#bib.bib10)\]\. That is, the predicted probability for testing the model’s null,γ^t−1mod\\hat\{\\gamma\}\_\{t\-1\}^\{\\texttt\{mod\}\}, is an adaptively weighted average over a prespecified grid ofBBplausible probabilities𝒬1mod=\{qb∈\(0,q\):b∈\[B\]\}\\mathcal\{Q\}\_\{1\}^\{\\texttt\{mod\}\}=\\\{q\_\{b\}\\in\(0,q\):b\\in\[B\]\\\}underH1modH\_\{1\}^\{\\texttt\{mod\}\}\. In particular,γ^t−1mod=∑b=1But−1,bqb\\hat\{\\gamma\}\_\{t\-1\}^\{\\texttt\{mod\}\}=\\sum\_\{b=1\}^\{B\}u\_\{t\-1,b\}q\_\{b\}with the weightsut−1,bu\_\{t\-1,b\}defined as
ut,b=ut−1,bexp\(λlog\(p\(Yt;qb\)p\(Yt;q\)\)\)∑b′=1But−1,b′exp\(λlog\(p\(Yt;qb′\)p\(Yt;q\)\)\)\.\\displaystyle u\_\{t,b\}=\\frac\{u\_\{t\-1,b\}\\exp\\left\(\\lambda\\log\\left\(\\frac\{p\(Y\_\{t\};q\_\{b\}\)\}\{p\(Y\_\{t\};q\)\}\\right\)\\right\)\}\{\\sum\_\{b^\{\\prime\}=1\}^\{B\}u\_\{t\-1,b^\{\\prime\}\}\\exp\\left\(\\lambda\\log\\left\(\\frac\{p\(Y\_\{t\};q\_\{b\}^\{\\prime\}\)\}\{p\(Y\_\{t\};q\)\}\\right\)\\right\)\}\.The predicted probability for testing the auditor’s null,γ^t−1aud\\hat\{\\gamma\}\_\{t\-1\}^\{\\texttt\{aud\}\}, is computed similarly\.
##### Scenario \(a\): Unlabeled data\.
The auditor has access to all 448 GPQA questions without labels\. At each iteration, the auditor selects one question to label based on predicted accuracy \(greedy selection of lowest predicted accuracy, withϵ\\epsilon\-greedy exploration\)\. The selected question is then removed from the pool\.
##### Scenario \(b\): Data generation\.
The auditor generates a question from each of the3×33\\times 3categories corresponding to the levels of domain and education level\. \(To control computational costs, we pre\-generate a question bank using an LLM for each category, rather than querying an LLM at every iteration\.\) The auditor then selects among the generated candidates usingϵ\\epsilon\-greedy exploration\.
##### Evaluation\.
We run 100 simulation replicates for each simulation setting\. For each replicate, we record: \(1\) which null hypothesis was rejected \(H0modH\_\{0\}^\{\\texttt\{mod\}\}orH0audH\_\{0\}^\{\\texttt\{aud\}\}\); \(2\) the stopping time \(number of observations until correct rejection or the maximum sample size is reached\)\. We report rejection rates and boxplots of stopping times across trials\.
##### Compute time\.
The procedure is quick to run and requires minimal infrastructure\. With only two cpu\-cords with 5GB memory, conducting a single audit completed within 15 minutes on average\. Most of the time was spent updating the auditor with each new observation, as the auditor conducted hyperparameter tuning of the NN ensemble at each iteration\. Significant speed\-ups are possible by parallelizing hyperparameter tuning\.
##### Ablation study on pre\-auditor’s budgetmm\.
We perform an ablation study to examine the effect of the budgetmmprior starting the auditor’s test on the performance of the proposed procedures\. Specifically, we evaluatem∈\{10,40,70\}m\\in\\\{10,40,70\\\}while holding all other experimental settings constant\. As shown in Figure[C\.1](https://arxiv.org/html/2605.07002#A3.F1)form=10m=10and Figure[C\.2](https://arxiv.org/html/2605.07002#A3.F2)form=70m=70, results are qualitatively similar to those withm=40m=40in Figure[2](https://arxiv.org/html/2605.07002#S4.F2)in the main text\. Thus, the procedure is relatively insensitive to the choice of budgetmm\.
\(a\) Auditor selecting unlabeled cases to annotate


\(b\) Auditor generates test cases to annotate 

Figure C\.1:Detecting failure modes with semi\-synthetic data \(Experiment 1\) with𝐦=𝟏𝟎\\mathbf\{m=10\}\. The rows correspond to settings where the model has a failure mode with decreasing magnitude/prevalence of the degradation \(*Large*,*Medium*,*Small*\), where the bottom corresponds to*No failure mode*\. Left column: rate of rejecting dueling null hypotheses, where orange indicates rate of detecting a failure mode, blue indicate passing the audit, and grey correspond to neither \(inconclusive\)\. Right column: boxplots of the number of cases annotated during the audit to reach the correct conclusion or the maximum sample size\.\(a\) Auditor selecting unlabeled cases to annotate


\(b\) Auditor generates test cases to annotate 

Figure C\.2:Detecting failure modes with semi\-synthetic data \(Experiment 1\) with𝐦=𝟕𝟎\\mathbf\{m=70\}\. The rows correspond to settings where the model has a failure mode with decreasing magnitude/prevalence of the degradation \(*Large*,*Medium*,*Small*\), where the bottom corresponds to*No failure mode*\. Left column: rate of rejecting dueling null hypotheses, where orange indicates rate of detecting a failure mode, blue indicate passing the audit, and grey correspond to neither \(inconclusive\)\. Right column: boxplots of the number of cases annotated during the audit to reach the correct conclusion or the maximum sample size\.
## Appendix DExperiment 2: Detailed Setup
We provide complete details for the clinical NLP experiment described in Section[4](https://arxiv.org/html/2605.07002#S4)\.
##### Data and task\.
We evaluate an LLM\-based pipeline for extracting social determinants of health \(SDoH\) from clinical notes at an urban medical center\. The pipeline processes discharge summaries and social work notes to extract structured information across 28 SDoH categories, including housing stability, substance use patterns, mental health needs, safety concerns, patient contacts, and care coordination details\. The dataset consists of 81 clinical notes with human expert annotations for each of the 28 categories, yielding 2,268 total note×\\timescategory comparisons\. Each comparison is scored as correct \(1\) if the LLM extraction matches the human annotation, and incorrect \(0\) otherwise\. Table[2](https://arxiv.org/html/2605.07002#A4.T2)provides the complete list of SDoH categories with their definitions as specified in the LLM prompts\.
##### Category\-level performance\.
The overall LLM\-human agreement rate is 88%, but performance varies substantially across categories\. The worst\-performing categories include: high prioritization \(58%\), patient contacts \(68%\), social work consult \(69%\), minimal contacts \(77%\), and mental health \(79%\)\. The best\-performing categories achieve 95–100% agreement, including substance use date of last use \(100%\), immigration interventions \(99%\), and several substance use subcategories\. We set the null thresholdq=0\.85q=0\.85, so 11 of the 28 categories constitute failure modes\.
##### Adaptive learning\.
For each note×\\timescategory pair, we construct input features by concatenating \(1\) the category\-specific prompt definition from Table[2](https://arxiv.org/html/2605.07002#A4.T2)and \(2\) the clinical note text\. We compute embeddings using thesentence\-transformers/allenai\-spectermodel\[Cohanet al\.,[2020](https://arxiv.org/html/2605.07002#bib.bib81)\]\. The accuracy model is the same neural network ensemble as that in Experiment 1\.
##### Testing parameters\.
We setm=40m=40\(audit budget\), maximum sample size of 400, andninit=20n\_\{\\text\{init\}\}=20initial random samples\. We report rejection rates and stopping times from 40 simulation replicates\.
Table 2:SDoH categories and their definitions as specified in the LLM extraction prompts\. Definitions are abbreviated for space; full prompts include additional examples and classification guidance\.
## Appendix EMore Results: CUB\-200\-2011 Image Data
##### Data and task\.
We evaluate an image classifier on the CUB\-200\-2011 dataset\[Wahet al\.,[2011](https://arxiv.org/html/2605.07002#bib.bib61)\], which contains 11,788 images of 200 bird species\. The model under audit is a ResNet\-18 classifier\. The auditor’s goal is to determine whether there exists a subgroup whose accuracy falls below a null thresholdq=0\.85q=0\.85\. Binary accuracy labels are simulated by assigning each species a ground\-truth accuracy probability, with white species receiving the highest accuracy, darker species moderate accuracy, and 18 black species the lowest\.
##### Adaptive learning\.
We trained an accuracy model using the same NN ensemble architecture with embeddings extracted from a pretrained ResNet\-18\[Heet al\.,[2015](https://arxiv.org/html/2605.07002#bib.bib4)\]as input features\.
##### Testing parameters\.
We set the budgetm=40m=40prior to start the auditor’s test, maximum sample budget of 250 observations, andninit=20n\_\{\\text\{init\}\}=20initial random observations before active selection begins\. ThePre\-learnedsampler uses an additional 30 held\-out observations for pre\-training\. We run 40 independent trials per method\. Results are shown in Figure[E\.3](https://arxiv.org/html/2605.07002#A5.F3)\.


Figure E\.3:Auditing a pre\-trained ResNet\-18 classifier for CUB\-200\-2011 bird classification\. Left: Rates for rejection of robustness \(orange\) and passage of the audit \(blue\)\. Right: boxplots of number of images annotated when robustness is rejected or the maximum budget is exhausted\.Similar Articles
MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills
This paper introduces MedSkillAudit, a domain-specific framework for auditing the safety and quality of medical research AI agent skills before deployment. The study demonstrates that the system achieves reliable assessment consistency comparable to or better than human expert review.
Improving verifiability in AI development
OpenAI publishes a report on mechanisms to improve verifiability in AI development, addressing how stakeholders can verify organizations' claims about AI system properties and safety practices.
Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.
AI safety via debate
OpenAI proposes a novel approach to AI safety where two AI agents debate each other while a human judge evaluates their arguments, allowing humans to supervise AI systems whose behavior is too complex to directly understand. The method leverages debate and adversarial reasoning to align advanced AI with human values and preferences.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
This paper introduces AgentForesight, a framework for online auditing and early failure prediction in LLM-based multi-agent systems. It presents a new dataset, AFTraj-22K, and a specialized model, AgentForesight-7B, which outperforms leading proprietary models in detecting decisive errors during trajectory execution.