Halt Fast! Early Stopping for Certified Robustness
Summary
This paper introduces a meta-learning framework for anytime-valid certified robustness that uses sequential E-processes to adaptively allocate compute, achieving a 20-fold reduction in sample complexity compared to traditional randomized smoothing while maintaining rigorous statistical guarantees.
View Cached Full Text
Cached at: 06/29/26, 05:24 AM
# Halt Fast! Early Stopping for Certified Robustness
Source: [https://arxiv.org/html/2606.27694](https://arxiv.org/html/2606.27694)
Andrew C\. Cullen University of Melbourne andrew\.cullen@unimelb\.edu\.au &Paul Montague DST Group, Adelaide &Benjamin I\. P\. Rubinstein University of Melbourne
###### Abstract
Randomized Smoothing \(RS\) provides rigorous robustness guarantees for neural networks without architectural constraints, yet its adoption is limited by extreme computational costs\. Standard RS requires tens of thousands of model evaluations per input and forces practitioners to commit to fixed sample sizes*a priori*\. In this work, we present a novel meta\-learning framework for anytime\-valid certified robustness that adaptively deploys computational resources\. By using a lightweight meta\-learner to predict image\-specific priors for a sequential E\-process, we achieve a 20\-fold reduction in sample complexity compared to traditional methods while maintaining rigorous statistical guarantees\. Beyond raw efficiency, we demonstrate how anytime\-validity enables adaptively allocating compute based upon application\-specific risk thresholds, a form of resource triage impossible under classic certification frameworks\. That this is achievable while also providing similar certification performance demonstrates that our approach provides a pathway for real\-time, safety\-critical certification deployments\.
## 1Introduction
For all of their transformative utility, neural networks remain notoriously sensitive to oftentimes semantically meaningless modifications\(Szegedyet al\.,[2013](https://arxiv.org/html/2606.27694#bib.bib1)\)\. These modifications, now known as adversarial examples\(Goodfellowet al\.,[2014](https://arxiv.org/html/2606.27694#bib.bib2); Madryet al\.,[2018](https://arxiv.org/html/2606.27694#bib.bib3)\), have spurred research that has consistently demonstrated that model decision boundaries often lack the semantic alignment required for safety\-critical deployments\.
While defenses have been proposed against these manipulations, they are fundamentally constrained within a technological arms race, where each new defense provides something new to attack\(Goodfellowet al\.,[2014](https://arxiv.org/html/2606.27694#bib.bib2); Cullenet al\.,[2025](https://arxiv.org/html/2606.27694#bib.bib4)\)\. By contrast, Certified Robustness has emerged as a rigorous framework for mathematically guaranteeing that a model’s prediction remains invariant within a defined neighborhood of an input\(Lecuyeret al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib6); Cohenet al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib5); Cullenet al\.,[2022](https://arxiv.org/html/2606.27694#bib.bib7)\)\. In the case of classification systems, this formulation is typically defined in terms of a radiusrrsuch that for a modelFFwe can guarantee thatF\(x\)=F\(x′\)F\(x\)=F\(x^\{\\prime\}\)for allx′x^\{\\prime\}in the ballBp\(x,r\)=\{x′:‖x′−x‖p≤r\}B\_\{p\}\(x,r\)=\\\{x^\{\\prime\}:\\\|x^\{\\prime\}\-x\\\|\_\{p\}\\leq r\\\}\.
Among certifications, Randomised Smoothing \(RS\) is unique in that it can be applied to any model without architectural modifications\. However, this flexibility comes at a cost—to complete a certification, a sample must be passed through the model tens of thousands of times, with each copy being offset by a small perturbation drawn from the Normal distribution\. This high sample complexity, which is required to control for the Type\-I \(false positive\) errors render real\-time applications of certifications nigh\-on impossible, especially for models at scale\.
Recent advances have sought to alleviate this overhead through sequential testing and early stopping\. Most notably, the introduction ofE\-values\(Shafer and Vovk,[2019](https://arxiv.org/html/2606.27694#bib.bib15); Ramdaset al\.,[2023](https://arxiv.org/html/2606.27694#bib.bib16)\)and test martingales have allowed for anytime\-valid certifications\(Voráček,[2024](https://arxiv.org/html/2606.27694#bib.bib8)\)\. By framing certification as a super\-martingale wealth process, sampling can be halted the moment sufficient evidence for robustness is accumulated without violating statistical safety\. However, current E\-value applications in robustness have primarily focused on binary hypothesis testing \(e\.g\., asking ifr≥cr\\geq cor not\), reducing the certifier to a simple threshold\-based classifier\. This limitation robs the process of the discriminatory detail required to assess relative security across different samples\.
In this work, we argue that the primary utility of anytime\-valid certifications is not necessarily their computational efficiency, but rather the ability to adaptively shape termination conditions around application\-based workflows\. In aid of this goal, our contributions are three\-fold:
1. 1\.Method of Mixtures for Continuous Hypotheses:We extend E\-value certifications to a mixture\-based multiple\-hypothesis approach to match traditional certification workflows\.
2. 2\.Sample\-Adaptive Meta\-Learning:We introduce an optimized E\-value formulation, where a meta\-learner predicts the prior distribution to enhance efficiency\. We employ a lightweight meta\-learner to analyze initial model glimpses, utilizing a Bayesian Negative Log\-Likelihood \(NLL\) objective to fit the distribution of the smoothed model’s success rate for a given input\.
3. 3\.Adversarial Exits:We introduce task\-adaptive termination conditions, that allow for early\-termination based upon prespecified domain tasks, producing highly efficient certifications and rejections in a manner that optimizes global compute\.
Through experimental validation, we demonstrate that our approach can construct certifications significantly faster than fixed\-sample methods—demonstrating that viable certifications can be constructed in less than500500samples, a2020\-fold decrease over prior certification workflows\. Perhaps more importantly, our innovations allow for application\- and sample\-specific exit conditions, which are a crucial innovation for helping certifications transition from a numerically\-costly concept to a viable framework for producing real\-world security\.
### 1\.1Motivating Cases
To underscore the utility of the methods introduced within this paper, we highlight three scenarios where anytime\-valid certifications are well motivated\. The first is the most natural:computational efficiency\. By halting as soon as a target precision is reached, we remove a primary barrier to real\-world deployment\. The second isresource triage: in large\-scale systems, certificates can be used to route inputs into different verification pathways based on their robustness\. In such cases, proving that a sample falls within a specific risk bucket is more critical than its exact radius\. Finally, we also suggest thatstreaming contexts\(e\.g\., autonomous driving\) could also be an application of this approach, where prior information from preceding frames can be used to initialize E\-value certifications, allowing for faster convergence in temporally evolving environments\.
## 2Related Work
RS has evolved from its Differential Privacy foundations\(Lecuyeret al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib6); Dworket al\.,[2006](https://arxiv.org/html/2606.27694#bib.bib36)\)to the current state\-of\-the\-art based on the Neyman\-Pearson lemma\(Cohenet al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib5)\)\. At their core, all RS\-based certifications transform a base modelffinto a smoothed counterpartggwith provableℓp\\ell\_\{p\}margin guarantees\. As established byCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\), for a noise levelσ\\sigma, the certified radiusrris a function of the success probabilitypA=ℙϵ∼𝒩\(0,σ2I\)\[f\(x\+ϵ\)=cA\]p\_\{A\}=\\mathbb\{P\}\_\{\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)\}\[f\(x\+\\epsilon\)=c\_\{A\}\]of the most likely classcAc\_\{A\}
r=σΦ−1\(pA\)\.r=\\sigma\\Phi^\{\-1\}\(p\_\{A\}\)\\kern 5\.0pt\.\(1\)In practice, certifications are constructed through an independent two\-phase approach, where Phase I employs an initial batch to establish the target class, before then estimatingpAp\_\{A\}through Monte\-Carlo sampling in Phase II\. To control the risk of false certification, standard approaches use the Clopper\-Pearson interval\(Clopper and Pearson,[1934](https://arxiv.org/html/2606.27694#bib.bib9)\)to obtain a high\-probability lower boundpA¯\\underline\{p\_\{A\}\}\.
The need for tight lower bounds is the primary driver of RS’s high computational cost\. High\-variance inputs may require tens of thousands of samples to produce meaningful certificates\. Crucially, the peaking problem\(Johariet al\.,[2017](https://arxiv.org/html/2606.27694#bib.bib42)\)invalidates the Clopper\-Pearson bounds on the Type\-I error rate if a practitioner monitors the empirical mean and stops early\. Consequently,NNmust be fixed*a priori*, leading to massive over\-sampling for easy inputs and total failure to certify for marginal ones\.
In response, sequential testing has emerged as a pathway to efficient robustness\. To achieve this, the significance budgetα\\alphais partitioned across multiple distinct stopping points\{n1,n2,…,nk\}\\\{n\_\{1\},n\_\{2\},\\ldots,n\_\{k\}\\\}by way of the Bonferroni correction or alpha\-spending functions\(Horváthet al\.,[2022](https://arxiv.org/html/2606.27694#bib.bib32)\)\. While these approaches allow for early stopping, they still retain the fundamental drawback of the frequentist framework: after eachnin\_\{i\}, if the model fails to certify, then the model samples further toni\+1n\_\{i\+1\}\. However, the correction for examining multiple stopping points requiresα\\alphato be scaled by the number of potential comparisons \(which also must be set*a priori*\), increasing the number of samples required to certify to a given level\. As such, these early\-stopping frameworks potentially require significantly more net samples to be evaluated in order to certify a sample than a more naive implementation ofCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)\. While additional efficiency can be found in*a priori*estimation of appropriate sample counts\(Chenet al\.,[2022](https://arxiv.org/html/2606.27694#bib.bib33)\), these approaches are still inherently conservative and expensive\.
E\-values provide a natural response to circumvent these limitations, in that they are anytime\-valid and immune to the peeking problem\. This allows certifications to be constructed in a manner that allows for early\-stopping, based upon arbitrary criteria\. The original progenitor of this approach,Voráček \([2024](https://arxiv.org/html/2606.27694#bib.bib8)\)primarily considered how this could be applied to binary robustness hypotheses \(e\.g\.,r≥r0r\\geq r\_\{0\}\)\.
In this work we leverage theMethod of Mixtures\(Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.27694#bib.bib17); Grünwaldet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib18)\)to support continuous radius estimation\. Notably, prior mixture\-based approaches rely on the Krichevsky–Trofimov \(KT\) estimator\(Krichevsky and Trofimov,[1981](https://arxiv.org/html/2606.27694#bib.bib13)\)which is optimized for*arbitrary*sequences\. However, we contend that RS sequences are not arbitrary; they are tied to specific input manifolds\. This discrepancy is the theoretical foundation for our meta\-learning framework, which learns to parameterize bespoke priors that sufficiently accelerate anytime\-validity\.
For narrative clarity further discussions of certifications can be found in Appendix[A](https://arxiv.org/html/2606.27694#A1)\.
## 3Anytime\-Valid Radius Certification
To evaluate the robustness of a model prediction of a classcAc\_\{A\}at a pointxx, we consider the success probabilityp=P\(f\(x\+ϵ\)=cA\)p=P\(f\(x\+\\epsilon\)=c\_\{A\}\)of the smoothed classifier \(we stress that we presuppose knowledge ofcAc\_\{A\}for mathematical convenience, and that the class is estimated appropriately in our algorithm\)\. Assume that we observe an infinite sequence of i\.i\.d\. Bernoulli trialsX1,X2,…X\_\{1\},X\_\{2\},\\ldots, where eachXi=𝕀\[f\(x\+ϵi\)=cA\]X\_\{i\}=\\mathbb\{I\}\[f\(x\+\\epsilon\_\{i\}\)=c\_\{A\}\]denotes whether theii\-th perturbation ofxxwith noiseϵi∼𝒩\(0,σ2I\)\\epsilon\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)aligns with the target class\. To construct a certification, we must be able to test the null hypothesisH0:p≤p0H\_\{0\}:p\\leq p\_\{0\}for any thresholdp0∈\[0,1\]p\_\{0\}\\in\[0,1\]to construct an anytime\-valid lower confidence bound onpp\.
##### Test Martingales and E\-values
To achieve this, we leverage the*E\-value*, a non\-negative random variableEEsuch that𝔼H0\[E\]≤1\\mathbb\{E\}\_\{H\_\{0\}\}\[E\]\\leq 1\(Vovk and Wang,[2021](https://arxiv.org/html/2606.27694#bib.bib19); Shafer,[2019](https://arxiv.org/html/2606.27694#bib.bib44)\)\. In our Bernoulli setting, for a point null hypothesisH0:p=p0H\_\{0\}:p=p\_\{0\}and a point alternative hypothesisH1:p=qH\_\{1\}:p=q, the appropriate E\-value is the likelihood ratio\(Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.27694#bib.bib17)\)
Ei=qXi\(1−q\)1−Xip0Xi\(1−p0\)1−Xi\.E\_\{i\}=\\frac\{q^\{X\_\{i\}\}\(1\-q\)^\{1\-X\_\{i\}\}\}\{p\_\{0\}^\{X\_\{i\}\}\(1\-p\_\{0\}\)^\{1\-X\_\{i\}\}\}\\kern 5\.0pt\.\(2\)
To verify the significance of an accumulated E\-value, we employ a process inspired by betting games\(Shafer,[2019](https://arxiv.org/html/2606.27694#bib.bib44); Vovk and Wang,[2021](https://arxiv.org/html/2606.27694#bib.bib19)\)\. Consider the wealth processWt\(p0\)=∏i=1tEiW\_\{t\}\(p\_\{0\}\)=\\prod\_\{i=1\}^\{t\}E\_\{i\}, that is the accumulation of E\-values overttsamples\. Then ifht=∑i=1tXih\_\{t\}=\\sum\_\{i=1\}^\{t\}X\_\{i\}is the number of successes in the firsttttrials, the total wealth must be
Wt\(p0\)=qht\(1−q\)t−htp0ht\(1−p0\)t−ht\.W\_\{t\}\(p\_\{0\}\)=\\frac\{q^\{h\_\{t\}\}\(1\-q\)^\{t\-h\_\{t\}\}\}\{p\_\{0\}^\{h\_\{t\}\}\(1\-p\_\{0\}\)^\{t\-h\_\{t\}\}\}\\kern 5\.0pt\.\(3\)This expression represents the ratio of the likelihood of observing the sequence under the alternative hypothesis versus the null hypothesis\.
The process\(Wt\(p0\)\)t≥1\(W\_\{t\}\(p\_\{0\}\)\)\_\{t\\geq 1\}is a non\-negative martingale with𝔼\[Wt\]=1\\mathbb\{E\}\[W\_\{t\}\]=1underp=p0p=p\_\{0\}\. By Ville’s inequality\(Ville,[1939](https://arxiv.org/html/2606.27694#bib.bib11); Doob,[1940](https://arxiv.org/html/2606.27694#bib.bib10)\), the set of allp0p\_\{0\}for which the wealth has not yet crossed the rejection threshold1/α1/\\alphaforms the confidence intervalCt=\{p0∈\[0,1\]:maxτ≤tWτ\(p0\)<1α\}C\_\{t\}=\\left\\\{p\_\{0\}\\in\[0,1\]:\\max\_\{\\tau\\leq t\}W\_\{\\tau\}\(p\_\{0\}\)<\\frac\{1\}\{\\alpha\}\\right\\\}where the Lower Confidence Bound \(LCB\) ispt¯=infCt\\underline\{p\_\{t\}\}=\\inf C\_\{t\}\.
###### Theorem 1\(Soundness\)
For a target classcAc\_\{A\}, significance levelα∈\(0,1\)\\alpha\\in\(0,1\), and prior mixtureQQ, the lower confidence boundpt¯=inf\{p0:maxτ≤tW¯τ\(p0\)<1/α\}\\underline\{p\_\{t\}\}=\\inf\\\{p\_\{0\}:\\max\_\{\\tau\\leq t\}\\overline\{W\}\_\{\\tau\}\(p\_\{0\}\)<1/\\alpha\\\}satisfiesP\(∃t:p<pt¯\)≤αP\(\\exists t:p<\\underline\{p\_\{t\}\}\)\\leq\\alpha\.
This result follows from thetest martingale inversionprinciple\(Howardet al\.,[2021](https://arxiv.org/html/2606.27694#bib.bib34); Ramdaset al\.,[2023](https://arxiv.org/html/2606.27694#bib.bib16)\)\. For any fixedp0∈\[0,1\]p\_\{0\}\\in\[0,1\], the mixture wealth processW¯t\(p0\)\\bar\{W\}\_\{t\}\(p\_\{0\}\)is a non\-negative martingale withW¯0=1\\bar\{W\}\_\{0\}=1under the point null hypothesisH0:p=p0H\_\{0\}:p=p\_\{0\}\. By Ville’s inequality, the probability that the wealth ever exceeds1/α1/\\alphais bounded byα\\alpha\. Since the confidence setCtC\_\{t\}is constructed by inverting this test, and by utilizing the running maximum of the wealth process, we ensure that the confidence bounds monotonically tighten over time, guaranteeing that the set contains the true success probabilityppfor allt≥1t\\geq 1with probability1−α1\-\\alpha\. Finally, because the mixture of E\-values is generally log\-convex with respect top0p\_\{0\}, the set of non\-rejected hypotheses forms a contiguous interval, ensuring that the anytime\-valid lower boundp¯t=infCt\\underline\{p\}\_\{t\}=\\inf C\_\{t\}is well defined\.
##### The Method of Mixtures
In the context of certifications, we do not know*a priori*what the alternativeqqis, and yet knowing this on a sample\-by\-sample basis is crucially important for providing enough discriminatory information to support downstream applications\. As such, we instead propose utilizing the*Method of Mixtures*\(Grünwaldet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib18); Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.27694#bib.bib17)\), which tests against the set of hypotheses defined over a prior distributionQ\(q\)Q\(q\)of alternative hypotheses, by way of the mixture E\-value
W¯t\(p0\)=∫01qht\(1−q\)t−htp0ht\(1−p0\)t−htdQ\(q\)=m\(ht,t\)p0ht\(1−p0\)t−ht,\\bar\{W\}\_\{t\}\(p\_\{0\}\)=\\int\_\{0\}^\{1\}\\frac\{q^\{h\_\{t\}\}\(1\-q\)^\{t\-h\_\{t\}\}\}\{p\_\{0\}^\{h\_\{t\}\}\(1\-p\_\{0\}\)^\{t\-h\_\{t\}\}\}dQ\(q\)\\qquad=\\frac\{m\(h\_\{t\},t\)\}\{p\_\{0\}^\{h\_\{t\}\}\(1\-p\_\{0\}\)^\{t\-h\_\{t\}\}\}\\kern 5\.0pt,\(4\)wherem\(h,t\)=∫01qh\(1−q\)t−h𝑑Q\(q\)m\(h,t\)=\\int\_\{0\}^\{1\}q^\{h\}\(1\-q\)^\{t\-h\}dQ\(q\)is the integrated likelihood\.
###### Lemma 1\(Mixture E\-values\)
For any probability measureQQon\[0,1\]\[0,1\], the mixtureW¯t\(p0\)\\bar\{W\}\_\{t\}\(p\_\{0\}\)is a non\-negative martingale underH0:p=p0H\_\{0\}:p=p\_\{0\}, and thus an anytime\-valid E\-process\.
In the case whereQQis a Beta distributionB\(β,γ\)B\(\\beta,\\gamma\), then the integrated likelihood admits the closed\-form solution\(Lai,[1976](https://arxiv.org/html/2606.27694#bib.bib45)\)
m\(h,t\)=B\(h\+β,t−h\+γ\)B\(β,γ\),m\(h,t\)=\\frac\{B\(h\+\\beta,t\-h\+\\gamma\)\}\{B\(\\beta,\\gamma\)\}\\kern 5\.0pt,\(5\)allowing wealth accumulation without the imprecision of numerical integration\.
While this solution is numerically beneficial, the convenience of using the Beta distribution to unimodal distributions, or very restricted bimodal options\. Real\-world success probabilities are rarely governed by a single mode; instead they may cluster into distinct groups under some contexts\. To support flexible prior distributions in heterogeneously clustered data, we extend this approach to a mixture ofKKBeta distributions, as any convex combination of E\-values is also an E\-value\(Vovk and Wang,[2021](https://arxiv.org/html/2606.27694#bib.bib19); Grünwaldet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib18); Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.27694#bib.bib17)\)\.
###### Lemma 2\(Convexity of E\-values\)
LetE\(1\),E\(2\),…,E\(K\)E^\{\(1\)\},E^\{\(2\)\},\\ldots,E^\{\(K\)\}be a collection of E\-values for the same null hypothesisH0H\_\{0\}\. For any set of non\-negative weightswkw\_\{k\}such that∑k=1Kwk=1\\sum\_\{k=1\}^\{K\}w\_\{k\}=1, the weighted sumE=∑k=1KwkE\(k\)E=\\sum\_\{k=1\}^\{K\}w\_\{k\}E^\{\(k\)\}is also an E\-value forH0H\_\{0\}\.
The integrated likelihood under thekk\-th priormkm\_\{k\}yields the total wealth \(under∑wk=1\\sum w\_\{k\}=1\)
W¯t\(p0\)=∑k=1Kwk⋅\(mk\(ht,t\)p0ht\(1−p0\)t−ht\),\\overline\{W\}\_\{t\}\(p\_\{0\}\)=\\sum\_\{k=1\}^\{K\}w\_\{k\}\\cdot\\left\(\\frac\{m\_\{k\}\(h\_\{t\},t\)\}\{p\_\{0\}^\{h\_\{t\}\}\(1\-p\_\{0\}\)^\{t\-h\_\{t\}\}\}\\right\)\\kern 5\.0pt,\(6\)
## 4Sample\-Adaptive Meta\-Learning
The efficacy of the Mixture E\-value process is fundamentally determined by how the probability massQ\(q\)Q\(q\)is allocated across the space of alternatives to maximize wealth accumulation\. Traditionally, this would be approached with the Krichevsky–Trofimov \(KT\) estimator\(Krichevsky and Trofimov,[1981](https://arxiv.org/html/2606.27694#bib.bib13)\)—equivalent to aBeta\(0\.5,0\.5\)\\text\{Beta\}\(0\.5,0\.5\)—as it will provably produce minimum regret for worst\-case arbitrary sequences\(Xie and Barron,[2000](https://arxiv.org/html/2606.27694#bib.bib14)\)\.
However, we ask the question:what happens if our inputs are not considered arbitrary?Not only are they drawn from some distribution𝒟\\mathcal\{D\}, we also have knowledge of the glimpse ofNselN\_\{sel\}samples, taken through Phase I of the certification\. If this information could be employed, then it would produce an information advantage relative to the naturally conservative KT prior\(Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.27694#bib.bib17); Grünwaldet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib18)\)\.
Drawing upon this intuition, we introduce a Sample\-Adaptive Meta\-Learnerℳθ\\mathcal\{M\}\_\{\\theta\}to perform amortized Bayesian inference\(Gershman and Goodman,[2014](https://arxiv.org/html/2606.27694#bib.bib47)\), predicting a bespoke mixture prior for each input\. Our meta\-learner leverages the Phase I information to parameterize a prior that can then inform the betting strategy for Phase II\. Because theseNselN\_\{sel\}samples are strictly discarded before wealth accumulation begins, the sequential processWtW\_\{t\}remains a predictable super\-martingale\(Shafer and Vovk,[2019](https://arxiv.org/html/2606.27694#bib.bib15)\), and the statistical guarantee of Theorem[1](https://arxiv.org/html/2606.27694#Thmtheorem1)remains untainted by the prior’s data\-dependency\(Ramdaset al\.,[2023](https://arxiv.org/html/2606.27694#bib.bib16)\)\.
Our meta\-learner ingests three distinct signals: the semantic contextϕ\(x\)\\phi\(x\), representing the penultimate layer embedding of the base classifier, providing a high\-dimensional representation of the image’s difficulty and class; the raw softmax vector of the clean image𝐩\(x\)\\mathbf\{p\}\(x\), representing the classifier confidence; and the empirical success ratep^sel\\hat\{p\}\_\{sel\}observed over the glimpse ofNselN\_\{sel\}samples\.
Of these, the latter provides the most direct evidence of the true success probabilitypp\. Based upon our experiments, we extract scalar proxies from the classifier confidence𝐩\(x\)\\mathbf\{p\}\(x\)vector so that it returns either the marginpmax−pnextp\_\{max\}\-p\_\{next\}or the entropy, acting as proxies for the model’s epistemic uncertainty\.ℳθ\\mathcal\{M\}\_\{\\theta\}can thus be trained through a Kelly Criterion based loss, which is equivalent to minimizing the Negative Log\-Likelihood \(NLL\) of the binomial sequenceX1:NX\_\{1:N\}under the predicted mixture of Beta distributions by
ℒ\(θ\)=−𝔼x∼𝒟\[log\(∑k=1KwkB\(hN\+βk,N−hN\+γk\)B\(βk,γk\)\)\]\+λt𝔼x∼𝒟\[∑k=1Kdist\(p^mle,ℛk\)\]⏟Containment Penalty,\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\left\[\\log\\left\(\\sum\_\{k=1\}^\{K\}w\_\{k\}\\frac\{B\(h\_\{N\}\+\\beta\_\{k\},N\-h\_\{N\}\+\\gamma\_\{k\}\)\}\{B\(\\beta\_\{k\},\\gamma\_\{k\}\)\}\\right\)\\right\]\+\\lambda\_\{t\}\\underbrace\{\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\[\\sum\_\{k=1\}^\{K\}\\text\{dist\}\(\\hat\{p\}\_\{mle\},\\mathcal\{R\}\_\{k\}\)\]\}\_\{\\text\{Containment Penalty\}\}\\kern 5\.0pt,\(7\)where\(βk,γk,wk\)=ℳθ\(features\)\(\\beta\_\{k\},\\gamma\_\{k\},w\_\{k\}\)=\\mathcal\{M\}\_\{\\theta\}\(\\text\{features\}\), andλt\\lambda\_\{t\}controls a regularizing penalty that minimizes theℓ1\\ell\_\{1\}distance betweenp^mle\\hat\{p\}\_\{mle\}and the nearest boundary ofℛk\\mathcal\{R\}\_\{k\}, whereλt\\lambda\_\{t\}starts at10\.010\.0and linearly decays to1\.01\.0over the first 80% of training epochs\. This formalism ensures that the expected growth rate of the wealth process is maximized\(Kelly,[1956](https://arxiv.org/html/2606.27694#bib.bib46)\), while penalizing over\-confident predictions as the log\-wealth approaches−∞\-\\inftywhen the prior mass is zero at the truepp\. Consequently, the meta\-learner learns to output broader, more resilient priors for high\-variance inputs while maintaining aggressive, concentrated bets for unambiguous samples so that it safely maximizes wealth accumulation\. As the Phase I glimpse is inherently noisy, we augment the training process by samplingMMpossible realizations ofp^sel\\hat\{p\}\_\{sel\}from the ground\-truth bitstream by way of draws from a binomial distribution\. This forces the meta\-learner to learn a mapping\(features,glimpse\)→Prior\(\\text\{features\},\\text\{glimpse\}\)\\to\\text\{Prior\}that is robust to the variance of the initial sampling\. To ensure the anytime\-valid responsiveness of the mixture, we constrain the predicted Beta parameters toβ,γ∈\[0\.1,500\.0\]\\beta,\\gamma\\in\[0\.1,500\.0\]\. The lower bound of0\.10\.1allows the meta\-learner to predict distributions that are sharper than the non\-informative KT prior, while the upper bound of500500acts as a survival bias, preventing a single sampling anomaly triggering wealth bankruptcy\.
The meta\-learner in our experiments takes the form of a 4\-layer Residual MLP that maps these features to aKK\-component mixture prior\. For each componentkk, the model predicts a weightwkw\_\{k\}, Beta parameters\(βk,γk\)\(\\beta\_\{k\},\\gamma\_\{k\}\), and—as will be discussed in Section[4\.1](https://arxiv.org/html/2606.27694#S4.SS1)—the boundaries of the support regionℛk\\mathcal\{R\}\_\{k\}\. Our formalism provides significant flexibility in how the meta\-learner is constructed\. Across our experiments, we consider frameworks where the number of E\-values is varied, alongside
- •Full Support: Where the Beta distribution is defined overp∈\[0,1\]p\\in\[0,1\]\.
- •Hybrid Support: A fixed partition where components are pre\-assigned to the robust\[0\.5,1\.0\]\[0\.5,1\.0\]or non\-robust\[0\.0,0\.5\]\[0\.0,0\.5\]regions\.
- •Dynamic Support: The model learns to focus on specific probability intervals, concentrating its betting resolution where it predicts the trueppwill lie\.
From this point on, we will employ the naming schemeMeta\-\{K\}\-\{Support\}, where K refers to the number of mixed distributions, and Support refers to one of the three convergence frameworks above\.
### 4\.1Truncated Betas
To realize the partitioned support modes introduced in the preceding section, we formalize the priorQQas a mixture of truncated Beta distributions\. By restricting the probability mass of each mixture componentkkto the specific \(and potentially learned\) intervalℛk=\[ak,bk\]\\mathcal\{R\}\_\{k\}=\[a\_\{k\},b\_\{k\}\], the meta\-learner allocates probability mass to capture potential multi\-modal distributions\.
A truncated Beta prior is equivalent to a standard Beta distribution that is rejection\-sampled to lie within\[ak,bk\]\[a\_\{k\},b\_\{k\}\], subject to normalization by the mass of the original distribution contained within the regionZk=Ibk\(βk,γk\)−Iak\(βk,γk\)Z\_\{k\}=I\_\{b\_\{k\}\}\(\\beta\_\{k\},\\gamma\_\{k\}\)\-I\_\{a\_\{k\}\}\(\\beta\_\{k\},\\gamma\_\{k\}\), whereIxI\_\{x\}is the regularized incomplete Beta function \(the CDF of the Beta distribution\)\. Under this prior, the integrated likelihoodmk\(ht,t;ℛk\)m\_\{k\}\(h\_\{t\},t;\\mathcal\{R\}\_\{k\}\)—the marginal likelihood of the data given the regional priorQkQ\_\{k\}—admits the closed form
mk\(ht,t;ℛk\)=B\(ht\+βk,t−ht\+γk\)B\(βk,γk\)⏟Standard Beta Update⋅Ibk\(ht\+βk,t−ht\+γk\)−Iak\(ht\+βk,t−ht\+γk\)Zk⏟Truncation Correction\.m\_\{k\}\(h\_\{t\},t;\\mathcal\{R\}\_\{k\}\)=\\underbrace\{\\frac\{B\(h\_\{t\}\+\\beta\_\{k\},t\-h\_\{t\}\+\\gamma\_\{k\}\)\}\{B\(\\beta\_\{k\},\\gamma\_\{k\}\)\}\}\_\{\\text\{Standard Beta Update\}\}\\cdot\\underbrace\{\\frac\{I\_\{b\_\{k\}\}\(h\_\{t\}\+\\beta\_\{k\},t\-h\_\{t\}\+\\gamma\_\{k\}\)\-I\_\{a\_\{k\}\}\(h\_\{t\}\+\\beta\_\{k\},t\-h\_\{t\}\+\\gamma\_\{k\}\)\}\{Z\_\{k\}\}\}\_\{\\text\{Truncation Correction\}\}\\kern 5\.0pt\.\(8\)Note that this formalism requires Equation[7](https://arxiv.org/html/2606.27694#S4.E7)to be updated to the form seen within Algorithm[1](https://arxiv.org/html/2606.27694#alg1)\. Based upon this, the total wealthWt\(p0\)W\_\{t\}\(p\_\{0\}\)is then the weighted sum of the localized likelihood ratios
Ek\(p0\)=mk\(ht,t;ℛk\)p0ht\(1−p0\)t−htandWtmeta\(p0\)=∑k=1KwkEk\(p0\)\.E\_\{k\}\(p\_\{0\}\)=\\frac\{m\_\{k\}\(h\_\{t\},t;\\mathcal\{R\}\_\{k\}\)\}\{p\_\{0\}^\{h\_\{t\}\}\(1\-p\_\{0\}\)^\{t\-h\_\{t\}\}\}\\quad\\text\{and\}\\quad W\_\{t\}^\{meta\}\(p\_\{0\}\)=\\sum\_\{k=1\}^\{K\}w\_\{k\}E\_\{k\}\(p\_\{0\}\)\\kern 5\.0pt\.\(9\)
##### Rejection Martingales and Bankruptcy Exits
The truncated beta framework enables both heuristic and mathematically rigorous exit strategies\. Under the Hybrid Support mode, we concentrate a subset of mixture heads on the robust interval\[0\.5,1\.0\]\[0\.5,1\.0\]\. If the true success probabilityppis significantly below0\.50\.5, the likelihood ratios for these robust components will exponentially decay\.
At every check intervalBB, we monitor if the total wealthWt\(p0\)W\_\{t\}\(p\_\{0\}\)atp0=0\.5p\_\{0\}=0\.5falls below a failure thresholdϵfail=0\.1\\epsilon\_\{fail\}=0\.1—a bankruptcy exit\. If this occurs, the certifier immediately halts and rejects the sample\. While this may result in rare false negatives for samples very near the decision boundary, it functions as a high\-speed rejection filter that preserves global compute\. This design is predicated upon a core principle of certified robustness: that conservative rejections are vastly preferable to certifications that are either computationally expensive or statistically invalid\. In practice we require the process to run for400400samples before triggering an early\-rejection\.
##### The Safety Anchor: Guaranteeing Survival
To guard against bankruptcy caused by misspecification of the meta\-prior, we supplement the meta\-learned mixture with a global safety anchor, allocating a small fixed weightwsafety=0\.01w\_\{safety\}=0\.01to a KT prior\. This ensures the E\-process remains robust even if the meta\-learned components decay, by defining that
Wt\(p0\)=wsafetyEsafety\(p0\)\+\(1−wsafety\)Wtmeta\(p0\),W\_\{t\}\(p\_\{0\}\)=w\_\{safety\}E\_\{safety\}\(p\_\{0\}\)\+\(1\-w\_\{safety\}\)W\_\{t\}^\{meta\}\(p\_\{0\}\)\\kern 5\.0pt,\(10\)whereEsafety\(p0\)E\_\{safety\}\(p\_\{0\}\)is the E\-value generated by aBeta\(0\.5,0\.5\)\\text\{Beta\}\(0\.5,0\.5\)prior over the full support\[0,1\]\[0,1\]\. HereEsafety\(p0\)=mKT\(ht,t\)/\[p0ht\(1−p0\)t−ht\]E\_\{safety\}\(p\_\{0\}\)=m\_\{KT\}\(h\_\{t\},t\)/\[p\_\{0\}^\{h\_\{t\}\}\(1\-p\_\{0\}\)^\{t\-h\_\{t\}\}\]andmKTm\_\{KT\}is the integrated likelihood under the KT prior\. This hybrid strategy provides a statistical safety net, where convergence can be reached even in cases where the meta\-learners predictions would otherwise compromise convergence\.
## 5The Certification Engine: Termination Policies
While the structural design of the E\-process outlined in the preceding subsection ensures safety, efficiency is realized through active termination policies\. At each check intervalBB, we solve for the LCBpt¯\\underline\{p\_\{t\}\}using the Brent\-Dekker method\(Brent,[1971](https://arxiv.org/html/2606.27694#bib.bib12)\)to provide a continuous radius estimatert=σΦ−1\(pt¯\)r\_\{t\}=\\sigma\\Phi^\{\-1\}\(\\underline\{p\_\{t\}\}\)\. Based upon this, we consider two early\-stopping frameworks\.
##### Precision\-Based Stopping
We implement a heuristic that defines a time\-varying precision thresholdϵt\\epsilon\_\{t\}\. We terminate the sequence if the gap between the Maximum Likelihood radius \(rmler\_\{mle\}\) and the certified radius \(rlcbr\_\{lcb\}\) satisfies\(rmle−rlcb\)≤ϵt\(r\_\{mle\}\-r\_\{lcb\}\)\\leq\\epsilon\_\{t\}, yielding a base threshold
ϵtbase=Δ⋅\[ϵstart−\(ϵstart−ϵend\)⋅tNmax−Nsel\],\\epsilon\_\{t\}^\{base\}=\\Delta\\cdot\\left\[\\epsilon\_\{start\}\-\(\\epsilon\_\{start\}\-\\epsilon\_\{end\}\)\\cdot\\frac\{t\}\{N\_\{max\}\-N\_\{sel\}\}\\right\]\\kern 5\.0pt,\(11\)whereΔ\\Deltais an aggression factor \(typically1\.21\.2\),ϵstart=0\.1\\epsilon\_\{start\}=0\.1, andϵend=0\.042\\epsilon\_\{end\}=0\.042\. This allows for rapid early exits on unambiguous samples while ensuring tight bounds for marginal cases\. To support application\-specific requirements, we can further modulate this threshold with a specialization biasb\(rmle\)b\(r\_\{mle\}\)—for instance, requiring higher precision \(b<1b<1\) in a target radius zone while allowing relaxed exits \(b\>1b\>1\) elsewhere \(see Section[6](https://arxiv.org/html/2606.27694#S6)for further details\)\.
##### Adversarial and Plateau Exits
We prioritize rapid rejection for non\-robust samples \(p<0\.5p<0\.5\)\. IfWt\(0\.5\)≥1/αW\_\{t\}\(0\.5\)\\geq 1/\\alphaand the empirical meanp^mle<0\.5\\hat\{p\}\_\{mle\}<0\.5, it implies the Upper Confidence Bound is less than0\.50\.5, and the sample is rejected\. Furthermore, we monitor the Radius Velocity: ifrlcbr\_\{lcb\}does not improve by more than5%5\\%over four consecutive intervals, the system triggers a certification based upon the current LCB, effectively rejecting the sample if the accumulated evidence remains below the robustness threshold, and eschewing further calculations to preserve the computational budget\.
##### Algorithm
The complete operational flow of our sequential radius estimation is synthesized in Algorithms[1](https://arxiv.org/html/2606.27694#alg1)and[2](https://arxiv.org/html/2606.27694#alg2)in Appendix[B](https://arxiv.org/html/2606.27694#A2)\. This procedure integrates bespoke prior initialization with anytime\-valid monitoring, allowing for a dynamic trade\-off between certification tightness and computational budget that automatically adapts to the difficulty of the input\.
## 6Results and Discussion
While raw efficiency gains are compelling, it is important to address a fundamental question: why favor an anytime\-valid approach over a fixed\-horizon Clopper\-Pearson bound, which is inherently tighter for a givenNN? After all, there is no free lunch when it comes to anytime validity\. However, as we will now argue, the advantages are two\-fold: theability to dynamically triage samplesand the capacity to constructapplication\-aware termination conditions\.
We first evaluate the performance of our framework using theMeta\-1\-hybridon a classic certification analysis, as shown in Figure[1](https://arxiv.org/html/2606.27694#S6.F1)\(Left\)\. This approach provides similar certification dynamics to theN=10,000N=10,000Cohen baseline \(hereafter ‘Cohen\-10k’\) while requiring an order of magnitude fewer samples\. As the average number of samples required by Meta\-1\-hybrid wasN=454N=454, we also test Cohen against the same number of samples\. In this case the anytime\-valid bounds induce the slight \(and expected\) decrease in certification performance relative to the tighter bounds ofCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)\. However, we stress that*a priori*knowledge of this number of samples is impossible\. When exploring the comprehensive set of results in Table[5](https://arxiv.org/html/2606.27694#A4.T5), the Meta\-RS approach typically reduces the average samples required by88to15%15\\%, relative to KT, with consistent gains in the high\-noise regimes, while also producing an up to4%4\\%tighter bounds for ImageNet\.
The true strength of anytime\-validity is revealed when stopping conditions are tuned to specific downstream regions\. In many applications, the importance of an input’s radius may depend upon its scale\. Such dynamics are likely in safety critical deployments, where it may be desirable to devote computational resources to refining small radius certifications \(to maximize the accuracy of these certifications\), or potentially on larger\-radius regions, if the small\-radius samples are to be subjected to manual verification irrespective of their final radii\.
To model the implications of such dynamics, we implementSmall\-R\(focusing compute uponR<0\.5R<0\.5\) andLarge\-R Specialists\(focusing compute uponR\>1\.0R\>1\.0\)—details of how these are implemented can be found in Appendix[B\.1](https://arxiv.org/html/2606.27694#A2.SS1)\. As seen in Figure[1](https://arxiv.org/html/2606.27694#S6.F1)\(Middle\), by biasing the stopping epsilonϵt\\epsilon\_\{t\}toward targeted regions, we can perfectly recover the Cohen\-10k accuracy curve in the region of interest while sacrificing precision elsewhere to save compute\. These specialists effectively reduce the sample budget to a minimal rejection floor of∼\\sim300 samplesfor inputs outside their target zones\. This represents a33×\\timesspeedupover the baseline for non\-robust samples, focusing computational attention exclusively on the samples that matter\.
### 6\.1Early Rejection
An alternative perspective on early\-stopping is that it should only be applied toreject non\-robust samples, with the total sample count otherwise being let to run to the full10,00010,000sample horizon\. Under these conditions, we only allow the model to exit if its UCB falls below0\.50\.5or its wealth process falls below0\.10\.1\(bankruptcy\)\. Such an approach is impossible under traditional certification workflows, which must still expend their full computational budget on non\-robust samples\.
Table[1](https://arxiv.org/html/2606.27694#S6.T1)demonstrates that orders of magnitude decreases in computational cost for rejections only elicits a small decrease in certification performance, while requiring an up to45×45\\timesspeedup for sample rejections\. That this occurs with only a minor penalty in terms of the achievable radius—stemming from the anytime\-valid tax—demonstrates the utility of this approach for allocating computational load to the samples that matter\.
Table 1:Performance of the Efficiency Champion Model\(Meta\-1\-Dynamic\-Margin \- see Appendix[C](https://arxiv.org/html/2606.27694#A3)\) vs KT Prior baseline, when samples are allowed to run to10,00010,000if robust\. Cert \(%\): Certified accuracy at the full horizon;δr\\delta r: Mean % discrepancy in certified radius relative to the Cohen\-10k ground truth; Exit: System\-wide average sample latency \(including early rejections\)\.
### 6\.2Case\-Study: Operational Triage Framework
To further consider the potential for E\-values to support dynamic triage, we consider a scenario where the exact values of the certification hold little value, relative to the positioning of the certification within a region defined by its perceived risk\. Under fixed\-horizon methods, adapting to such a downstream scenario would be impossible—the sample budget must be committed to all inputs\. In contrast, E\-value certifications halt as soon as the confidence interval\[LCBt,UCBt\]\[LCB\_\{t\},UCB\_\{t\}\]is entirely contained within one of these buckets\. To test this, we consider the partitioning:
- •Bucket A \(Non\-Robust\):r=0\.0r=0\.0\. Immediate rejection or human intervention\.
- •Bucket B \(Low Robustness\):0\.0<r≤1\.00\.0<r\\leq 1\.0\. Automated scrutiny\.
- •Bucket C \(Medium Robustness\):1\.0<r≤1\.51\.0<r\\leq 1\.5\. Monitored deployment\.
- •Bucket D \(High Robustness\):r\>1\.5r\>1\.5\. Autonomous deployment\.
To explore the performance under this operational triage framework, we testedCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)at10,00010,000samples, a KT Prior Method\-of\-Mixtures approach with a constantα=0\.01\\alpha=0\.01, and two application specific approaches\. The Tiered Spec Race assigns each bin a specific budget\{αA,…,αD\}=\{0\.05,0\.05,0\.025,0\.01\}\\\{\\alpha\_\{A\},\\dots,\\alpha\_\{D\}\\\}=\\\{0\.05,0\.05,0\.025,0\.01\\\}, representing the levels of tolerance of a false\-positive in each category, and performs four independent E\-processes for each operational bucket\. Samples are assigned to bucketiiat timestepTTif its LCB and UCB are contained within the associated bin withMi\(X1:T\)≥1/αi′M\_\{i\}\(X\_\{1:T\}\)\\geq 1/\\alpha^\{\\prime\}\_\{i\}, whereαi′=αi⋅\(0\.2\+0\.8⋅wmeta,i\)\\alpha^\{\\prime\}\_\{i\}=\\alpha\_\{i\}\\cdot\(0\.2\+0\.8\\cdot w\_\{meta,i\}\)\.
The Relaxed Tiered Cascade follows the above process, however, if samples remain unresolved after400400samples the system attempts to prove proximity to any bin boundarypb∈\{0\.5,0\.8413,0\.9332\}p\_\{b\}\\in\\\{0\.5,0\.8413,0\.9332\\\}\(corresponding tor=\{0,1,1\.5\}r=\\\{0,1,1\.5\\\}\), through a dual\-rejection process consideringH0−:p=pb−ϵH\_\{0\}^\{\-\}:p=p\_\{b\}\-\\epsilonandH0\+:p=pb\+ϵH\_\{0\}^\{\+\}:p=p\_\{b\}\+\\epsilon, using anϵ=0\.05\\epsilon=0\.05\. IfM\(X;pb−ϵ\)≥1/αM\(X;p\_\{b\}\-\\epsilon\)\\geq 1/\\alphaandM\(X;pb\+ϵ\)≥1/αM\(X;p\_\{b\}\+\\epsilon\)\\geq 1/\\alpha, the sample is mathematically trapped within the\[pb−ϵ,pb\+ϵ\]\[p\_\{b\}\-\\epsilon,p\_\{b\}\+\\epsilon\]buffer\. If this is achieved, the sample is classified to bin, but provided a secondary label denoting boundary ambiguity\. This approach prevents allocating redundant samples to distinguishing infinitesimally close probabilities, such ap=0\.841p=0\.841fromp=0\.842p=0\.842\. Table[2](https://arxiv.org/html/2606.27694#S6.T2)demonstrates that the Meta\-learner’s ability to partition risk and predict sample\-specific density leads to an up to2\.4×\\timesspeedupover the KT Prior baseline\.
Table 2:Operational Triage Performance\(CIFAR\-10,σ=1\.0\\sigma=1\.0,N=3000N=3000\)\. Success is defined as proving containment within the correct triage bucket\.Figure 1:Specialist Triage via Radius\-Biased Stopping for CIFAR\-10 atσ=1\.0\\sigma=1\.0\.\(Left\) Generalist accuracy parity with Cohen\-10k using 22×\\timesfewer samples\. \(Middle\) Specialist accuracy recovery in targeted zones \(R<0\.5R<0\.5orR\>1\.0R\>1\.0\)\. \(Right\) CorrespondingSample Count \(NN\)across radii, showing the significant compute reduction outside target zones\. Metrics metrics represent the average for samples with a radius larger thanrr, following Equation[12](https://arxiv.org/html/2606.27694#A1.E12), except the dashed lines in \(Right\), which represent the average counts*including rejected samples*\.
### 6\.3Ablation
##### Component\-Wise Efficiency Attribution
To understand the drivers of the 20x–45x speedups observed in our champions, we perform an attribution analysis, the results of which are further detailed in Figure[11](https://arxiv.org/html/2606.27694#A5.F11)\. While Precision\-Based Stopping provides the bulk of the speedup for certifiable samples, we find that the combination of Bankruptcy and UCB exits is critical for minimizing the costs associated with rejecting non\-robust samples, enabling best in class performance\.
##### Statistical Universality and Zero\-Shot Transfer
A key concern for meta\-learned certifiers is dataset dependency\. To evaluate generalization, we employ ImageNet to test a Meta\-Learner trained exclusively on CIFAR\-10\. Figure[10](https://arxiv.org/html/2606.27694#A5.F10)demonstrates that there is some correlation in termination latencies\. This suggests the meta\-learner captures universal properties of the classifiers rather than dataset\-specific artifacts, enabling zero\-shot deployment on new architectures\.
##### The Precision\-Efficiency Pareto Frontier
Anytime\-validity transforms certified robustness from a binary outcome into a tunable resource\. As detailed in Figure[12](https://arxiv.org/html/2606.27694#A5.F12), sample complexity grows only logarithmically withϵ\\epsilonover a limited window\. This suggests that there is some ability to reliably trade compute for tighter bounds, or conversely, to relax precision for real\-time throughput\.
## 7Conclusion
In this work, we demonstrated that randomized smoothing can be transformed from a static, computationally exhaustive process into a dynamic, application\-aware certification pipeline optimized forstrategic resource management\. While computational latency has always presented significant concerns for certified robustness, our E\-value meta\-learner produces substantial empirical gains in this space, without compromising robustness\. Achieving a89%89\\%reduction in computational load relative to standard certification workflows, or15%15\\%relative to the KT prior, validates the utility of our meta\-learning approach\.
However, headline computational metrics do not tell the full story of this work\. Our framework demonstrates that it is possible to dynamically balance robustness with compute on a sample\-by\-sample basis, through early rejections or downstream\-focused triage\. While anytime\-validity incurs a marginal cost regarding the realizable tightness of bounds, our results highlight the overwhelming utility of transitioning to meta\-learned E\-values\. The true value of this approach lies not just in its computational efficiency, but in proving that dynamic, application\-specific triage is finally possible, marking a definitive leap toward the real\-world deployment of certified machine learning\.
## Acknowledgments
This work was supported by the Australian Defence Science and Technology \(DST\) Group via the Advanced Strategic Capabilities Accelerator \(ASCA\) program\.
## Impact Statement
This work explores the potential for improving the computational efficiency of mechanisms for achieving Certified Robustness, with the aim of significantly enhancing the range of applicability for these systems\. Improving the robustness of models, and lowering the energy consumption of our validation processes, has clear societal implications, especially as we move to a landscape where AI is increasingly integrated into society\. However, with that said, we feel that there are two key societal concerns with pursuing robustness research, which we must note here\.
The first of which is that there are some applications where a lack of robustness in a model may be positive\. In a world where broad scale surveillance is increasingly normalized, it may well be the case that adversarial attacks may induce privacy, creating a net public good\.
The second relates to how works like this position risk and harm\. A common precept within the Adversarial Machine Learning community is to assume a particular threat model, with the nature of academic comparisons often incentivizing us to then follow in the footsteps of those who came before us\. However, in doing so, we inadvertently create—and, crucially, present—a myopic view of the risk landscape\. In essence, we portray to practitioners that risk is concentrated within the areas in which we act as a community, when our investigations may be more motivated by historic alignment to academic norms and mathematical convenience\. This work considersℓ2\\ell\_\{2\}perturbations, which while aligned with acoustic threat models, still represent a restriction relative to the overall threat landscape\.
We emphasize the above point not just for the risks of erroneous portrayals of risk to practitioners, but also because our focus on these spaces inherently biases real attacker behavior away from these threat models\. After all, if an attacker understands that anℓ2\\ell\_\{2\}threat model is likely defended against, they’re naturally incentivized to consider an alternative pathway for model manipulation\.
With these points made, we still believe that research into defenses, and in particular certified defenses, induce a net societal gain\. Improving robustness to natural or adversarial perturbations will improve the performance of systems that are already one of the dominant access portals for AI within the community\.
## References
- An Algorithm with Guaranteed Convergence for Finding a Zero of a Function\.The Computer Journal14\(4\),pp\. 422–425\.Cited by:[§5](https://arxiv.org/html/2606.27694#S5.p1.3)\.
- R\. Chen, J\. Li, J\. Yan, P\. Li, and B\. Sheng \(2022\)Input\-Specific Robustness Certification for Randomized Smoothing\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 6295–6303\.Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p3.5)\.
- P\. Chiang, R\. Ni, A\. Abdelkader, C\. Zhu, C\. Studer, and T\. Goldstein \(2020\)Certified Defenses for Adversarial Patches\.arXiv preprint arXiv:2003\.06693\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p2.1)\.
- C\. J\. Clopper and E\. S\. Pearson \(1934\)The Use of Confidence or Fiducial Limits Illustrated in the case of the Binomial\.Biometrika26\(4\),pp\. 404–413\.Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p1.9)\.
- J\. Cohen, E\. Rosenfeld, and Z\. Kolter \(2019\)Certified Adversarial Robustness via Randomized Smoothing\.InInternational Conference on Machine Learning,pp\. 1310–1320\.Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2606.27694#A1.SS2.p1.3),[Appendix D](https://arxiv.org/html/2606.27694#A4.p1.1),[Table 6](https://arxiv.org/html/2606.27694#A5.T6),[Table 7](https://arxiv.org/html/2606.27694#A5.T7),[Table 8](https://arxiv.org/html/2606.27694#A5.T8),[§1](https://arxiv.org/html/2606.27694#S1.p2.5),[§2](https://arxiv.org/html/2606.27694#S2.p1.7),[§2](https://arxiv.org/html/2606.27694#S2.p3.5),[§6\.2](https://arxiv.org/html/2606.27694#S6.SS2.p2.7),[§6](https://arxiv.org/html/2606.27694#S6.p2.5)\.
- A\. C\. Cullen, P\. Montague, S\. Liu, S\. M\. Erfani, and B\. I\.P\. Rubinstein \(2022\)Double Bubble, Toil and Trouble: Enhancing Certified Robustness through Transitivity\.Advances in Neural Information Processing Systems35,pp\. 19099–19112\.Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.27694#S1.p2.5)\.
- A\. C\. Cullen, P\. Montague, S\. M\. Erfani, and B\. I\. P\. Rubinstein \(2025\)Position: Certified Robustness Does Not \(Yet\) Imply Model Security\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 81185–81198\.Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p2.5)\.
- J\. L\. Doob \(1940\)Regularity Properties of Certain Families of Chance Variables\.Transactions of the American Mathematical Society47\(3\),pp\. 455–486\.Cited by:[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p3.7)\.
- C\. Dwork, F\. McSherry, K\. Nissim, and A\. Smith \(2006\)Calibrating Noise to Sensitivity in Private Data Analysis\.InTheory of Cryptography Conference,TCC,pp\. 265–284\.Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.27694#S2.p1.7)\.
- S\. Gershman and N\. Goodman \(2014\)Amortized Inference in Probabilistic Reasoning\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.36\.Cited by:[§4](https://arxiv.org/html/2606.27694#S4.p3.3)\.
- I\. J\. Goodfellow, J\. Shlens, and C\. Szegedy \(2014\)Explaining and Harnessing Adversarial Examples\.arXiv preprint arXiv:1412\.6572\.Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p1.1),[§1](https://arxiv.org/html/2606.27694#S1.p2.5)\.
- P\. Grünwald, R\. de Heide, and W\. M\. Koolen \(2020\)Safe Testing\.In2020 Information Theory and Applications workshop \(ITA\),pp\. 1–54\.Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p5.1),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px2.p1.2),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px2.p3.1),[§4](https://arxiv.org/html/2606.27694#S4.p2.2)\.
- M\. Hein and M\. Andriushchenko \(2017\)Formal Guarantees on the Robustness of a Classifier Against Adversarial Manipulation\.InAdvances in Neural Information Processing Systems,NeurIPS, Vol\.30\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.27694#A1.SS1.SSS2.p1.9)\.
- M\. Z\. Horváth, M\. N\. Mueller, M\. Fischer, and M\. Vechev \(2022\)Boosting randomized smoothing with variance reduced classifiers\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p3.5)\.
- S\. R\. Howard, A\. Ramdas, J\. McAuliffe, and J\. Sekhon \(2021\)Time\-Uniform, Nonparametric, Nonasymptotic Confidence Sequences\.The Annals of Statistics49\(2\),pp\. 1055–1080\.Cited by:[§A\.2](https://arxiv.org/html/2606.27694#A1.SS2.p1.3),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p4.12)\.
- R\. Johari, P\. Koomen, L\. Pekelis, and D\. Walsh \(2017\)Peeking at A/B Tests: Why it matters, and what to do about it\.InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 1517–1525\.Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p2.1)\.
- J\. L\. Kelly \(1956\)A New Interpretation of Information Rate\.The Bell System Technical Journal35\(4\),pp\. 917–926\.Cited by:[§4](https://arxiv.org/html/2606.27694#S4.p5.21)\.
- R\. Krichevsky and V\. Trofimov \(1981\)The Performance of Universal Encoding\.IEEE Transactions on Information Theory27\(2\),pp\. 199–207\.Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p5.1),[§4](https://arxiv.org/html/2606.27694#S4.p1.2)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning Multiple Layers of Features from Tiny Images\.Technical reportUniversity of Toronto\.Cited by:[Appendix B](https://arxiv.org/html/2606.27694#A2.SS0.SSS0.Px1.p1.1)\.
- T\. L\. Lai \(1976\)On Confidence Sequences\.The Annals of Statistics,pp\. 265–280\.Cited by:[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px2.p2.2)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-Based Learning Applied to Document Recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.Cited by:[Appendix B](https://arxiv.org/html/2606.27694#A2.SS0.SSS0.Px1.p1.1)\.
- M\. Lecuyer, V\. Atlidakis, R\. Geambasu, D\. Hsu, and S\. Jana \(2019\)Certified Robustness to Adversarial Examples with Differential Privacy\.In2019 IEEE Symposium on Security and Privacy \(SP\),pp\. 656–672\.Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.27694#S1.p2.5),[§2](https://arxiv.org/html/2606.27694#S2.p1.7)\.
- K\. Leino, Z\. Wang, and M\. Fredrikson \(2021\)Globally\-Robust Neural Networks\.InInternational Conference on Machine Learning,pp\. 6212–6222\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.27694#A1.SS1.SSS2.p1.9)\.
- A\. Levine and S\. Feizi \(2020\)\(De\)Randomized Smoothing for Certifiable Defense against Patch Attacks\.Advances in Neural Information Processing Systems33,pp\. 6465–6475\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p2.1)\.
- B\. Li, C\. Chen, W\. Wang, and L\. Carin \(2019\)Certified Adversarial Robustness with Additive Noise\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\. 9459–9469\.Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p1.1)\.
- Z\. Lyu, M\. Guo, T\. Wu, G\. Xu, K\. Zhang, and D\. Lin \(2021\)Towards Evaluating and Training Verifiably Robust Neural Networks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 4308–4317\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
- A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu \(2018\)Towards Deep Learning Models Resistant to Adversarial Attacks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p1.1)\.
- M\. Mirman, T\. Gehr, and M\. Vechev \(2018\)Differentiable Abstract Interpretation for Provably Robust Neural Networks\.InInternational Conference on Machine Learning,pp\. 3578–3586\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
- J\. Mohapatra, T\. Weng, P\. Chen, S\. Liu, and L\. Daniel \(2020\)Towards Verifying Robustness of Neural Networks against a family of Semantic Perturbations\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 244–252\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Kopf, E\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala \(2019\)PyTorch: An Imperative Style, High\-Performance Deep Learning Library\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\. 8024–8035\.Cited by:[Appendix B](https://arxiv.org/html/2606.27694#A2.SS0.SSS0.Px1.p1.1)\.
- A\. Ramdas, P\. Grünwald, V\. Vovk, and G\. Shafer \(2023\)Game\-Theoretic Statistics and Safe Anytime\-Valid Inference\.Statistical Science38\(4\),pp\. 576–601\.Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p4.1),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p4.12),[§4](https://arxiv.org/html/2606.27694#S4.p3.3)\.
- O\. Russakovsky, J\. Deng, H\. Su, J\. Krause, S\. Satheesh, S\. Ma, Z\. Huang, A\. Karpathy, A\. Khosla, M\. Bernstein,et al\.\(2015\)ImageNet Large Scale Visual Recognition Challenge\.International Journal of Computer Vision115\(3\),pp\. 211–252\.Cited by:[Appendix B](https://arxiv.org/html/2606.27694#A2.SS0.SSS0.Px1.p1.1)\.
- H\. Salman, J\. Li, I\. Razenshteyn, P\. Zhang, H\. Zhang, S\. Bubeck, and G\. Yang \(2019a\)Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\. 11292–11303\.Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p2.1)\.
- H\. Salman, G\. Yang, H\. Zhang, C\. Hsieh, and P\. Zhang \(2019b\)A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
- G\. Shafer and V\. Vovk \(2019\)Game\-Theoretic Foundations for Probability and Finance\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p4.1),[§4](https://arxiv.org/html/2606.27694#S4.p3.3)\.
- G\. Shafer \(2019\)The Language of Betting as a Strategy for Statistical and Scientific Communication\.arXiv preprint arXiv:1903\.06991\.Cited by:[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p2.4)\.
- Z\. Shi, Q\. Jin, H\. Zhang, Z\. Kolter, S\. Jana, and C\. Hsieh \(2023\)Formal Verification for Neural Networks with General Nonlinearities via Branch\-And\-Bound\.In2nd Workshop on Formal Verification of Machine Learning \(WFVML 2023\),Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p2.1)\.
- G\. Singh, T\. Gehr, M\. Püschel, and M\. Vechev \(2019\)An Abstract Domain for Certifying Neural Networks\.Proceedings of the ACM on Programming Languages3\(POPL\),pp\. 1–30\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
- C\. Szegedy, W\. Zaremba, I\. Sutskever, J\. Bruna, D\. Erhan, I\. Goodfellow, and R\. Fergus \(2013\)Intriguing Properties of Neural Networks\.arXiv preprint arXiv:1312\.6199\.Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p1.1)\.
- Y\. Tsuzuku, I\. Sato, and M\. Sugiyama \(2018\)Lipschitz\-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[§A\.1\.2](https://arxiv.org/html/2606.27694#A1.SS1.SSS2.p1.9)\.
- J\. Ville \(1939\)Etude Critique de la Notion de Collectif\.Vol\.3,Gauthier\-Villars Paris\.Cited by:[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p3.7)\.
- V\. Voráček \(2024\)Treatment of Statistical Estimation Problems in Randomized Smoothing for Adversarial Robustness\.Advances in Neural Information Processing Systems37,pp\. 133464–133486\.Cited by:[§1](https://arxiv.org/html/2606.27694#S1.p4.1),[§2](https://arxiv.org/html/2606.27694#S2.p4.1)\.
- V\. Vovk and R\. Wang \(2021\)E\-values: Calibration, Combination and Applications\.The Annals of Statistics49\(3\),pp\. 1736–1754\.Cited by:[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p2.4),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px2.p3.1)\.
- S\. Wang, H\. Zhang, K\. Xu, X\. Lin, S\. Jana, C\. Hsieh, and J\. Z\. Kolter \(2021\)Beta\-CROWN: Efficient Bound Propagation with Per\-Neuron Split Constraints for Neural Network Robustness Verification\.Advances in Neural Information Processing Systems34\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p2.1)\.
- I\. Waudby\-Smith and A\. Ramdas \(2024\)Estimating Means of Bounded Random Variables by Betting\.Journal of the Royal Statistical Society: Series B\-Statistical Methodology86\(1\),pp\. 1–27\.Cited by:[§2](https://arxiv.org/html/2606.27694#S2.p5.1),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px2.p1.2),[§3](https://arxiv.org/html/2606.27694#S3.SS0.SSS0.Px2.p3.1),[§4](https://arxiv.org/html/2606.27694#S4.p2.2)\.
- L\. Weng, H\. Zhang, H\. Chen, Z\. Song, C\. Hsieh, L\. Daniel, D\. Boning, and I\. Dhillon \(2018\)Towards Fast Computation of Certified Robustness for ReLU Networks\.InInternational Conference on Machine Learning,pp\. 5276–5285\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
- Q\. Xie and A\. R\. Barron \(2000\)Asymptotic Minimax Regret for Data Compression, Gambling, and Prediction\.IEEE Transactions on Information Theory46\(2\),pp\. 431–445\.Cited by:[§4](https://arxiv.org/html/2606.27694#S4.p1.2)\.
- K\. Xu, Z\. Shi, H\. Zhang, Y\. Wang, K\. Chang, M\. Huang, B\. Kailkhura, X\. Lin, and C\. Hsieh \(2020\)Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond\.Advances in Neural Information Processing Systems33\.Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p2.1)\.
- R\. Zhai, C\. Dan, D\. He, H\. Zhang, B\. Gong, P\. Ravikumar, C\. Hsieh, and L\. Wang \(2020\)Macer: Attack\-Free and Scalable Robust Training via Maximizing Certified Radius\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.27694#A1.SS1.p2.1)\.
- H\. Zhang, T\. Weng, P\. Chen, C\. Hsieh, and L\. Daniel \(2018\)Efficient Neural Network Robustness Certification with General Activation Functions\.InNeural Information Processing Systems \(NeurIPS\),Cited by:[§A\.1\.1](https://arxiv.org/html/2606.27694#A1.SS1.SSS1.p1.1)\.
## Appendix AExpanded Related Work
### A\.1Certification Mechanisms
First introduced byLecuyeret al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib6)\), randomized smoothing based certified robustness builds upon Monte Carlo estimators of the expectation of a class prediction\. While the original formulation was constructed in terms of differential privacy\(Dworket al\.,[2006](https://arxiv.org/html/2606.27694#bib.bib36)\), recent approaches have improved performance through Rényi divergence\(Liet al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib35)\)and parameterising worst\-case behaviors\(Cohenet al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib5); Salmanet al\.,[2019a](https://arxiv.org/html/2606.27694#bib.bib41); Cullenet al\.,[2022](https://arxiv.org/html/2606.27694#bib.bib7)\)\.
Significant research has focused on improving the underlying base classifierffto be more amenable to randomized smoothing\.Salmanet al\.\([2019a](https://arxiv.org/html/2606.27694#bib.bib41)\)employs adversarial training to harden the model against perturbed inputs, whileMACER\(Zhaiet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib43)\)directly optimizes a robustness loss that encourages large classification margins\. Our work is orthogonal to these training\-time interventions; we focus on the*inference\-time*efficiency of the certification process itself, allowing for a 20x–45x reduction in sample complexity for*any*base model, including those that have undergone the training\-time interventions of MACER andSalmanet al\.\([2019a](https://arxiv.org/html/2606.27694#bib.bib41)\)\.
In all works, the primary metric for evaluating Randomized Smoothing is theCertified Accuracyat radiusrr, defined as the fraction of the test set that is both correctly classified by the smoothed model and has a certified radiusR≥rR\\geq r
Acc\(r\)=ℙ\(x,y\)∼𝒟\[g\(x\)=yandR\(x\)≥r\]\\text\{Acc\}\(r\)=\\mathbb\{P\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\[g\(x\)=y\\text\{ and \}R\(x\)\\geq r\]\(12\)whereggis the smoothed classifier\.
#### A\.1\.1Interval Bound Propagation
In the absence of probabilistic methods, conservative certificates upon the impact of norm\-bounded perturbations can be constructed by way of either interval bound propagation \(IBP\) which propagates interval bounds through the model; or convex relaxation, which utilizes linear relaxation to construct bounding output polytopes over input bounded perturbations\. In contrast to randomized smoothing, which constructs isotropic measures ofℓp\\ell\_\{p\}\-robustness, interval bound propagation and its associated techniques attempt to propagate the potential influence of all possible perturbations through the model, producing an anisotropic measure of the potential response of a model to any potential perturbation\(Salmanet al\.,[2019b](https://arxiv.org/html/2606.27694#bib.bib20); Mirmanet al\.,[2018](https://arxiv.org/html/2606.27694#bib.bib21); Wenget al\.,[2018](https://arxiv.org/html/2606.27694#bib.bib22); Zhanget al\.,[2018](https://arxiv.org/html/2606.27694#bib.bib24); Singhet al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib25); Mohapatraet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib26)\)\. Of these, IBP is more general, while convex relaxation typically provides tighter bounds\(Lyuet al\.,[2021](https://arxiv.org/html/2606.27694#bib.bib27)\)\.
Utilizing these techniques requires introducing an augmented loss function during training to promote tight output bounds\(Xuet al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib28)\)—creating significant architectural friction relative to RS style certifications, which can be applied to any model architecture\. Bound propagation schemes have also, until very recently, been heavily limited in the types of network architectures that they can successfully construct bounds through, with only recent works demonstrating an applicability to a nonlinear activation functions beyond ReLU\(Shiet al\.,[2023](https://arxiv.org/html/2606.27694#bib.bib37)\)\. Moreover they both exhibit a time and memory complexity that makes them infeasible for complex model architectures or high\-dimensional data\(Wanget al\.,[2021](https://arxiv.org/html/2606.27694#bib.bib29); Chianget al\.,[2020](https://arxiv.org/html/2606.27694#bib.bib30); Levine and Feizi,[2020](https://arxiv.org/html/2606.27694#bib.bib31)\)\.
#### A\.1\.2Global Lipschitz
Global Lipschitz takes an alternative approach to constructing certifications, a point that they distinguish through the framing of local and global robustness\. The guarantees provided by prior works, which can take the form
‖𝐱−𝐱′‖p≤ϵ⟹F\(𝐱\)=F\(𝐱′\)\\\|\\mathbf\{x\}\-\\mathbf\{x\}^\{\\prime\}\\\|\_\{p\}\\leq\\epsilon\\implies F\(\\mathbf\{x\}\)=F\(\\mathbf\{x\}^\{\\prime\}\)\(13\)are considered to be local properties, that relate𝐱\\mathbf\{x\}andϵ\\epsilon\. Lipschitz based techniques instead attempt to construct their certifications in terms of*global*robustness, where
∀𝐱1,𝐱2:‖𝐱1−𝐱2‖p≤ϵ⟹F\(𝐱1\)=⟂F\(𝐱2\)\.\\forall\\mathbf\{x\}\_\{1\},\\mathbf\{x\}\_\{2\}:\\\|\\mathbf\{x\}\_\{1\}\-\\mathbf\{x\}\_\{2\}\\\|\_\{p\}\\leq\\epsilon\\implies F\(\\mathbf\{x\}\_\{1\}\)\\overset\{\\perp\}\{=\}F\(\\mathbf\{x\}\_\{2\}\)\\kern 5\.0pt\.\(14\)Here⟂\\perpis the marker for an*abstained*class prediction, andc1=⟂c2c\_\{1\}\\overset\{\\perp\}\{=\}c\_\{2\}denotes that eitherc1=⟂c\_\{1\}=\\perp,c2=⟂c\_\{2\}=\\perp, orc1=c2c\_\{1\}=c\_\{2\}\. In essence such a form of certification involves constructing a model that has not only an infinitesimally thin decision boundary, but a margin between the regions associated with each class, whereϵ\\epsilonthen becomes the shortestℓp\\ell\_\{p\}distance to span the boundary\. Several attempts have been made to use Lipschitz bounds during training to promote robustness\. These include constructing provable lower bounds on the norm of the input manipulation required to change classifier decisions based upon the network architecture\(Hein and Andriushchenko,[2017](https://arxiv.org/html/2606.27694#bib.bib38)\); modifying the loss associated with logits different than the ground\-truth class\(Tsuzukuet al\.,[2018](https://arxiv.org/html/2606.27694#bib.bib39)\); and GloRoNets, which add an additional logit corresponding to the predicted class at a point\(Leinoet al\.,[2021](https://arxiv.org/html/2606.27694#bib.bib40)\)\. While these techniques can be an order of magnitude faster than randomized smoothing, they are both less flexible—in terms of the architectures they support—and often produce smaller certifications than randomized smoothing\.\(Leinoet al\.,[2021](https://arxiv.org/html/2606.27694#bib.bib40)\)\.
### A\.2Anytime Valid Mechanisms
As an alternative to betting\-based martingales, anytime\-valid confidence sequences can be constructed using the Law of the Iterated Logarithm \(LIL\)\. Of particular note is the*stitching*construction fromHowardet al\.\([2021](https://arxiv.org/html/2606.27694#bib.bib34)\), which provides a boundaryutu\_\{t\}such that the probability of the empirical mean ever crossingutu\_\{t\}is bounded byα\\alpha\. While LIL\-stitching is robust to prior mis\-specification, it is generally less efficient than Mixture E\-values when a reasonably accurate prior \(like our Meta\-Learner\) is available, as it lacks the ability to aggressively allocate its convergence on specific regions of the hypothesis space\. In our testing, LIL based approaches provide a small improvement on traditionalCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)style certified robustness, however, it is significantly slower to converge than our Meta\-approach\.
## Appendix BAlgorithms
To provide further details to the processes outlined in Sections[4](https://arxiv.org/html/2606.27694#S4)and[5](https://arxiv.org/html/2606.27694#S5), we present the full expanded algorithm for the training of the Meta\-Learner using a Kelly\-Optimal inspired process in Algorithm[1](https://arxiv.org/html/2606.27694#alg1), and the overall certification approach in Algorithm[2](https://arxiv.org/html/2606.27694#alg2)\.
Algorithm 1Meta\-Learner Training \(Kelly\-Optimal Betting\)1:Input:Training dataset
𝒟train\\mathcal\{D\}\_\{train\}, noise level
σ\\sigma, base classifier
ff, augmentation factor
MM, epochs
EE\.
2:Phase I: Data Collection \(Offline\)
3:
𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset
4:foreach image
x\(i\)∈𝒟trainx^\{\(i\)\}\\in\\mathcal\{D\}\_\{train\}do
5:Extract embedding
ϕ\(i\)\\phi^\{\(i\)\}and clean\-image softmax
𝐩\(i\)\\mathbf\{p\}^\{\(i\)\}\.
6:Draw
NmaxN\_\{max\}samples to obtain bitstream successes
hN\(i\)h\_\{N\}^\{\(i\)\}and rate
ptrue\(i\)p\_\{true\}^\{\(i\)\}\.
7:
𝒮←𝒮∪\{\(ϕ\(i\),𝐩\(i\),hN\(i\),ptrue\(i\)\)\}\\mathcal\{S\}\\leftarrow\\mathcal\{S\}\\cup\\\{\(\\phi^\{\(i\)\},\\mathbf\{p\}^\{\(i\)\},h\_\{N\}^\{\(i\)\},p\_\{true\}^\{\(i\)\}\)\\\}⊳\\trianglerightStore ground\-truth tuples
8:endfor
9:Phase II: Kelly Optimization \(Online\)
10:Initialize Meta\-Learner
ℳθ\\mathcal\{M\}\_\{\\theta\}\.
11:forepoch
e=1e=1to
EEdo
12:Sample an index set
ℐ⊂\{1,…,\|𝒮\|\}\\mathcal\{I\}\\subset\\\{1,\\dots,\|\\mathcal\{S\}\|\\\}of size
nnuniformly at random\.
13:for
i∈ℐi\\in\\mathcal\{I\}do
14:
h^sel\(i\)∼Binomial\(Nsel,ptrue\(i\)\)\\hat\{h\}\_\{sel\}^\{\(i\)\}\\sim\\text\{Binomial\}\(N\_\{sel\},p\_\{true\}^\{\(i\)\}\)⊳\\trianglerightSample a synthetic Phase I glimpse
15:
p^sel\(i\)←h^sel\(i\)/Nsel\\hat\{p\}\_\{sel\}^\{\(i\)\}\\leftarrow\\hat\{h\}\_\{sel\}^\{\(i\)\}/N\_\{sel\}\.
16:
\(𝐰\(i\),𝐳\(i\)\)←ℳθ\(ϕ\(i\),𝐩\(i\),p^sel\(i\)\)\(\\mathbf\{w\}^\{\(i\)\},\\mathbf\{z\}^\{\(i\)\}\)\\leftarrow\\mathcal\{M\}\_\{\\theta\}\(\\phi^\{\(i\)\},\\mathbf\{p\}^\{\(i\)\},\\hat\{p\}\_\{sel\}^\{\(i\)\}\)\.⊳\\trianglerightPredict weights and raw parameters
17:
𝜷\(i\),𝜸\(i\)←Clamp\(Softplus\(𝐳\(i\)\)\+0\.5,\[0\.1,500\.0\]\)\\boldsymbol\{\\beta\}^\{\(i\)\},\\boldsymbol\{\\gamma\}^\{\(i\)\}\\leftarrow\\text\{Clamp\}\(\\text\{Softplus\}\(\\mathbf\{z\}^\{\(i\)\}\)\+0\.5,\[0\.1,500\.0\]\)\.⊳\\trianglerightSurvival bias & Stability
18:endfor
19:// Minimize negative expected log\-wealth \(Kelly Loss\)
20:
λe←max\(1\.0,10\.0⋅\(1−e0\.8E\)\)\\lambda\_\{e\}\\leftarrow\\max\(1\.0,10\.0\\cdot\(1\-\\frac\{e\}\{0\.8E\}\)\)⊳\\trianglerightDecay Penalty
21:
ℒ\(θ\)=−1n∑i\[log∑kwk⋅P\(hi\|N,βk,γk\)Z\(ak,bk,βk,γk\)−λt⋅dist\(p^mle,ℛk\)\]\\mathcal\{L\}\(\\theta\)=\-\\frac\{1\}\{n\}\\sum\_\{i\}\\left\[\\log\\sum\_\{k\}w\_\{k\}\\cdot\\frac\{P\(h\_\{i\}\|N,\\beta\_\{k\},\\gamma\_\{k\}\)\}\{Z\(a\_\{k\},b\_\{k\},\\beta\_\{k\},\\gamma\_\{k\}\)\}\-\\lambda\_\{t\}\\cdot\\text\{dist\}\(\\hat\{p\}\_\{mle\},\\mathcal\{R\}\_\{k\}\)\\right\]
22:
θ←θ−η∇θℒ\(θ\)\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\(\\theta\)\.
23:endfor
24:Return:Trained Meta\-Learner
ℳθ\\mathcal\{M\}\_\{\\theta\}\.
Algorithm 2Sequential Radius Estimation with Dynamic Fast\-Exits1:Input:Test image
xx, noise level
σ\\sigma, target classifier
ff, Meta\-Learner
ℳ\\mathcal\{M\}\.
2:Parameters:Significance
α\\alpha, batch size
BB, max samples
NmaxN\_\{max\}, failure threshold
ϵfail\\epsilon\_\{fail\}, parameters
ϵstart,ϵend\\epsilon\_\{start\},\\epsilon\_\{end\}\.
3:Phase I: Holdout and Prior Initialization
4:Extract
ϕ\(x\)\\phi\(x\)and clean\-image softmax
𝐩\(x\)\\mathbf\{p\}\(x\)from
ff\.
5:Draw
NselN\_\{sel\}samples to identify target class
cAc\_\{A\}and empirical mean
p^sel\\hat\{p\}\_\{sel\}\.
6:
ℳ\(ϕ\(x\),𝐩\(x\),p^sel\)→\{wk,βk,γk,ℛk\}k=1K\\mathcal\{M\}\(\\phi\(x\),\\mathbf\{p\}\(x\),\\hat\{p\}\_\{sel\}\)\\to\\\{w\_\{k\},\\beta\_\{k\},\\gamma\_\{k\},\\mathcal\{R\}\_\{k\}\\\}\_\{k=1\}^\{K\}\.
7:Phase II: Sequential Certification
8:
W0\(p0\)←1W\_\{0\}\(p\_\{0\}\)\\leftarrow 1for all
p0∈\[0,1\]p\_\{0\}\\in\[0,1\];
rlcb←0r\_\{lcb\}\\leftarrow 0;
ℋ←∅\\mathcal\{H\}\\leftarrow\\emptyset⊳\\trianglerightInitialize wealth, radius, and radius history
9:for
t=1,2,…,Nmax−Nselt=1,2,\\ldots,N\_\{max\}\-N\_\{sel\}do
10:Compute
Wt\(p0\)W\_\{t\}\(p\_\{0\}\)using Mixture of Truncated Betas\.
11:if
t\(modB\)==0t\\pmod\{B\}==0or
t=Nmax−Nselt=N\_\{max\}\-N\_\{sel\}then
12:
pt¯←BrentSolver\(Wt\(p0\)=1/α\)\\underline\{p\_\{t\}\}\\leftarrow\\text\{BrentSolver\}\(W\_\{t\}\(p\_\{0\}\)=1/\\alpha\);
rlcb←σΦ−1\(max\(pt¯,0\.5\)\)r\_\{lcb\}\\leftarrow\\sigma\\Phi^\{\-1\}\(\\max\(\\underline\{p\_\{t\}\},0\.5\)\)\.
13:
rmle←σΦ−1\(max\(p^mle,0\.5\)\)r\_\{mle\}\\leftarrow\\sigma\\Phi^\{\-1\}\(\\max\(\\hat\{p\}\_\{mle\},0\.5\)\); Append
rlcbr\_\{lcb\}to
ℋ\\mathcal\{H\}\.
14:if
Wt\(0\.5\)≥1/αW\_\{t\}\(0\.5\)\\geq 1/\\alphaand
p^mle<0\.5\\hat\{p\}\_\{mle\}<0\.5thenReturn
r=0r=0⊳\\trianglerightUCB Exit
15:elseif
Wt\(0\.5\)≤ϵfailW\_\{t\}\(0\.5\)\\leq\\epsilon\_\{fail\}thenReturn
r=0r=0⊳\\trianglerightBankruptcy Exit
16:else
17:
ϵt←Δ⋅\[ϵstart−\(ϵstart−ϵend\)⋅tNmax−Nsel\]⋅b\(rmle\)\\epsilon\_\{t\}\\leftarrow\\Delta\\cdot\[\\epsilon\_\{start\}\-\(\\epsilon\_\{start\}\-\\epsilon\_\{end\}\)\\cdot\\frac\{t\}\{N\_\{max\}\-N\_\{sel\}\}\]\\cdot b\(r\_\{mle\}\)\.
18:Plateau
←\|ℋ\|≥4\\leftarrow\|\\mathcal\{H\}\|\\geq 4and
\(ℋlast−ℋlast−3\)<0\.05⋅ℋlast−3\(\\mathcal\{H\}\_\{last\}\-\\mathcal\{H\}\_\{last\-3\}\)<0\.05\\cdot\\mathcal\{H\}\_\{last\-3\}\.
19:if
\(rmle−rlcb\)≤ϵt\(r\_\{mle\}\-r\_\{lcb\}\)\\leq\\epsilon\_\{t\}orPlateauthenReturn
r=rlcbr=r\_\{lcb\}
20:endif
21:endif
22:endif
23:endfor
24:Return
r=rlcbr=r\_\{lcb\}\.
##### Datasets and Backbone Architecture
Our experiments consider attacks against MNIST\(LeCunet al\.,[1998](https://arxiv.org/html/2606.27694#bib.bib49)\)\(GNU v3\.0 license\), CIFAR\-1010\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2606.27694#bib.bib50)\)\(MIT license\), and ImageNet\(Russakovskyet al\.,[2015](https://arxiv.org/html/2606.27694#bib.bib53)\)\(which uses a custom, non\-commercial license\)\. In the case of models defended by randomised smoothing, each model was trained in PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2606.27694#bib.bib52)\)
We trained three backbone models as the base classifiers, each of which were trained to robust under Gaussian noise through augmentations drawn from𝒩\(0,σ2I\)\\mathcal\{N\}\(0,\\sigma^\{2\}I\)\. For MNIST and CIFAR\-10, we employed a ResNet\-18 architecture, which was modified for a single input channel for MNIST\. In the case of ImageNet, we considered a ResNet\-50 architecture\. Features employed by the Meta\-Learner are extracted from the finalavgpoollayer of their respective backbones \(512\-dimensional for ResNet\-18, or 2048\-dimensional with a linear projection to 512\-dimensional for ResNet\-50\)\.
##### Meta\-Learner Architecture
The Meta modelℳθ\\mathcal\{M\}\_\{\\theta\}builds upon aLayerNormstage followed by a linear projection to 512 dimensions\. This is then followed by a sequential MLP mapping, containing two linear layers \(512 units\) withLayerNormandReLUactivations, before branching into independent linear heads for mixture weights \(π\\pi\), Beta parameters \(β,γ\\beta,\\gamma\), and optionally, dynamic support boundaries \(a,ba,b\)\.
##### Optimization and Training
As discussed in the main body of the text, the meta\-learner is trained to maximize theExpected Log\-Wealth\(Kelly Criterion\) over full bitstream sequences\. Optimization employed an Adam learner for6060epochs with a fixed learning rate of10−310^\{\-3\}and a weight decay of10−410^\{\-4\}\. To support training, we employed randomized deterministic data splits, with5,0005,000samples employed for training with MNIST and CIFAR\-10, and10,00010,000samples for ImageNet\. In all cases, the amount of evaluation samples was set to10%10\\%of the training samples\. The base models for MNIST and CIFAR\-10 were trained on a 12GB RTX2080 Ti GPU, with a training time of less than11hour perσ\\sigma\. For speed, ImageNet was trained on a 80GB H100 GPU, with a total training time of22hours perσ\\sigma\. Meta\-learner training was performed on the 2080Ti, taking about22minutes per each Meta\-learner for ImageNet, and less than9090seconds for MNIST and CIFAR\-10\.
##### Differentiable Truncation via Series Expansion
Training the meta\-learner with a Kelly\-optimal loss on truncated Beta distributions requires a differentiable implementation of the regularized incomplete Beta function,Ix\(a,b\)I\_\{x\}\(a,b\)\. As implementations ofIxI\_\{x\}are typically not compatible with automatic differentiation, we employ a4th4^\{th\}\-order Taylor series expansion to approximate the integral mass in the cases where derivatives are required\. For a given support boundaryx∈\[ak,bk\]x\\in\[a\_\{k\},b\_\{k\}\], the log\-mass is approximated as
logIx\(α,β\)≈alogx\+blog\(1−x\)−loga−logB\(a,b\)\+log\(1\+∑n=14Tn\),\\log I\_\{x\}\(\\alpha,\\beta\)\\approx a\\log x\+b\\log\(1\-x\)\-\\log a\-\\log B\(a,b\)\+\\log\\left\(1\+\\sum\_\{n=1\}^\{4\}T\_\{n\}\\right\)\\kern 5\.0pt,\(15\)whereTnT\_\{n\}corresponds to the Pochhammer ratios
T1=a\+ba\+1x,Tn=Tn−1⋅a\+b\+n−1a\+nx\.T\_\{1\}=\\frac\{a\+b\}\{a\+1\}x,\\quad T\_\{n\}=T\_\{n\-1\}\\cdot\\frac\{a\+b\+n\-1\}\{a\+n\}x\\kern 5\.0pt\.\(16\)To ensure numerical stability across the entire\[0,1\]\[0,1\]probability range, we leverage the symmetry propertyIx\(a,b\)=1−I1−x\(b,a\)I\_\{x\}\(a,b\)=1\-I\_\{1\-x\}\(b,a\)\. When the truncation boundary isx\>0\.5x\>0\.5, the model computes the mass in the complement space using the same expansion, preventing the vanishing gradient issues associated with the high\-probability tails of the Beta distribution\.
##### Certification
For certifications, we set the error probabilityα\\alphato0\.0010\.001, corresponding to a99\.9%99\.9\\%confidence interval\. Unless otherwise stated, the selection glimpse took place overNsel=100N\_\{sel\}=100, and similarly the checking intervalBBwas also set to100100\. All certifications were capped at10,00010,000max samples\.
### B\.1Radius\-Specialized Stopping Dynamics
To articulate the influence of focusing computational allocation across varying difficulty levels, we define two specialists—Small\-RandLarge\-R—using a biased stopping criterion\. These conditions incorporate an augmented sequential exit condition defined by the precision thresholdϵt\\epsilon\_\{t\}, whereby
\(Rmle−Rlcb\)≤ϵt⋅b\(Rmle\)\.\(R\_\{mle\}\-R\_\{lcb\}\)\\leq\\epsilon\_\{t\}\\cdot b\(R\_\{mle\}\)\\kern 5\.0pt\.\(17\)HereRmleR\_\{mle\}is the current estimated radius,RlcbR\_\{lcb\}is the anytime\-valid lower bound, andb\(Rmle\)b\(R\_\{mle\}\)is a radius\-dependent bias factor\.
For the Small\-R specialist, we optimize forR<0\.5R<0\.5\. This specialist employs an aggressive biasb=0\.2b=0\.2whenRmle<0\.5R\_\{mle\}<0\.5, facilitating rapid exit for low\-margin samples\. The Large\-R specialist is optimized forR\>1\.0R\>1\.0by way of a biasb=0\.6b=0\.6whenRmle\>1\.0R\_\{mle\}\>1\.0\.
For samples falling outside the radius regime, the biasbbis increased to2\.02\.0, to force the specialist to either exit immediately \(if the plateua condition is met\), or to continue sampling until high precision is achieved\. In doing so, the specialists effectively de\-prioritize non\-target compute, allowing us to demonstrate that specialization is a tool that be employed to leverage the anytime\-valid properties of E\-values to dynamically adjust sample budgets without sacrificing statistical integrity\.
### B\.2Computational Costs
A critical consideration for sequential certification frameworks is the wall\-clock overhead introduced by the decision\-making logic \(meta\-learning and interval checks\) relative to the time saved by reducing base model forward passes\. In this section, we provide a holistic breakdown of the computational costs associated with the Meta\-RS framework\. At inference time, the Meta\-RS model involvesFeature Extractionfrom the penultimate layer embedding from the backbone model \(conducted atN=100N=100\); andMeta\-Learner Inferenceas a single forward pass through the 4\-layer Residual MLP to predict the image\-specific prior\. After the Meta\-RS model has been produced the prediction of the prior,Brent\-Dekker Root Findingis performed everyBBsteps, which is required to search for the LCB\.
We benchmarked these components on an NVIDIA 2080Ti GPU using ResNet\-18 \(CIFAR\-10\) and ResNet\-50 \(ImageNet\) backbones\. Table[3](https://arxiv.org/html/2606.27694#A2.T3)summarizes the results\.
Table 3:Wall\-Clock Timing Breakdown\. We compare the marginal overhead of the Meta\-RS components against the cost of base model forward passes on a 2080Ti GPU\.The total computational overhead for a fullN=10,000N=10,000ImageNet certification is10\.4 s\(comprising 1 batch for class prediction and 100 batches\)\. In contrast, if the Meta learner is able to produce a1010fold reduction in the number of samples, then the total cost is1\.15 s, an89%89\\%reduction in the computational cost for ImageNet\. This confirms that the computational cost of the anytime\-valid logic is negligible—accounting for less than2%2\\%of the total budget—allowing the sample complexity gains to translate directly into massive operational speedups in high\-optimized inference systems\.
## Appendix CChampion Selection Processes
To evaluate the operational versatility of our framework, we identify three distinct*Champion*archetypes\. Each represents a specific optimization of our Meta\-learner architecture designed to address different real\-world deployment constraints\. In short, the Global Champion demonstrates generalisability of the meta\-learner, and provides a universal speedup for standard certification; while the Specialists demonstrate the framework’s ability to maximize performance\.
Of these, the Global Champion was set asMeta\-1\-Margin\-Hybridacross all experiments\. This technique produced the optima of
minArchitectureN¯subject toAccMeta≥AccCohen, 10k−0\.02\\min\_\{\\text\{Architecture\}\}\\bar\{N\}\\quad\\text\{subject to\}\\quad\\text\{Acc\}\_\{\\text\{Meta\}\}\\geq\\text\{Acc\}\_\{\\text\{Cohen, 10k\}\}\-0\.02\(18\)when testing across all datasets and noise levels, based upon a sweep over the architectural space \(K∈\{1,3,6,10\}K\\in\\\{1,3,6,10\\\}, Entropy or Margin feature modes, and ranges\)\. This configuration demonstrates that a single meta\-learned prior can capture universal statistical convictions across multiple semantic domains\.
For each dataset, we constructed an Accuracy Champion \(otherwise labeled as the Dataset Specialist\), which restricted the above optimization criteria to a single dataset\-and\-noise configuration\. The choice of champion for each dataset can be found in Table[4](https://arxiv.org/html/2606.27694#A3.T4)\.
Finally, the Efficiency Champion is optimized using a similar objective, targeting parity with a moderate Cohen\-1k baseline
minArchitectureN¯subject toAccMeta≥AccCohen, 1k\.\\min\_\{\\text\{Architecture\}\}\\bar\{N\}\\quad\\text\{subject to\}\\quad\\text\{Acc\}\_\{\\text\{Meta\}\}\\geq\\text\{Acc\}\_\{\\text\{Cohen, 1k\}\}\\kern 5\.0pt\.\(19\)
Table 4:Champion Configurations by Dataset and Noise Level\. We define three champion archetypes:Global\(fixed Meta\-1\-margin\-hybrid\),Efficiency\(absolute minimum average sample countN¯\\bar\{N\}\), andAccuracy\(minimumN¯\\bar\{N\}maintaining accuracy parity with Cohen\-10k\)\. All configurations use an aggression factorλ=1\.2\\lambda=1\.2\.
## Appendix DDetailed Results and Aggregated Performance
The full suite of certification and efficiency results for individual datasets and noise levels is provided in Figures[2](https://arxiv.org/html/2606.27694#A4.F2)through to[9](https://arxiv.org/html/2606.27694#A4.F9)\. These figures are included to demonstrate the framework’s consistent performance across three distinct data manifolds \(MNIST, CIFAR\-10, ImageNet\) and various noise levels\. By visualizing the*Average Sample Count*\(dashed lines\) alongside the*Certified Accuracy*\(solid lines\), we provide empirical proof that the meta\-learned priors effectively deliver 20x–45x speedups overCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)while maintaining a similar accuracy profile\. When comparing against the KT prior, our Meta\-learner is able to consistently reduce the required number of samples to both certify and reject samples\. This serves as the primary evidence for the system’s operational viability in high\-throughput environments\.
Table 5:Comprehensive performance comparison across datasets and noise levels\.Figure 2:Certification and efficiency results for ImageNet atσ=0\.25\\sigma=0\.25\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 3:Certification and efficiency results for ImageNet atσ=0\.5\\sigma=0\.5\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 4:Certification and efficiency results for ImageNet atσ=1\.0\\sigma=1\.0\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 5:Certification and efficiency results for MNIST atσ=0\.25\\sigma=0\.25\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 6:Certification and efficiency results for MNIST atσ=0\.5\\sigma=0\.5\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 7:Certification and efficiency results for MNIST atσ=1\.0\\sigma=1\.0\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 8:Certification and efficiency results for CIFAR10 atσ=0\.25\\sigma=0\.25\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.Figure 9:Certification and efficiency results for CIFAR10 atσ=0\.5\\sigma=0\.5\. Curves compare the Anytime\-Valid Global Champ and Specialist Meta\-models against Cohen\-10k/1k and KT Prior baselines\.
## Appendix EAblation Studies and Sensitivity Analysis
This section provides empirical justification for the architectural and heuristic choices in the Meta\-RS framework\. All studies are conducted using ImageNet atσ=0\.25\\sigma=0\.25unless otherwise specified\.
##### Statistical Universality and Zero\-Shot Transferability
\(Figure[10](https://arxiv.org/html/2606.27694#A5.F10)\)\. This log\-log scatter plot correlates termination latencies across datasets, and demonstrates that the meta\-learner may be able to capture fundamental statistical properties of classifier conviction rather than dataset\-specific artifacts, justifying its use as a plug\-and\-play engine for new models\.
##### Algorithmic Logic Waterfall
Figure[11](https://arxiv.org/html/2606.27694#A5.F11)breaks down the efficiency of our approach into the drivers\. It is included to demonstrate the additive value of our Dual\-Exit strategy, showing that while precision stopping drives certification, with performance being further refined by the Bankruptcy/UCB suite\.
##### Precision\-Efficiency Pareto Frontier
\(Figure[12](https://arxiv.org/html/2606.27694#A5.F12)\)\. This plot maps sample complexity against radius precision targetϵ\\epsilon, confirming the influence of this parameterization upon resource demands\.
Figure 10:Zero\-Shot Transferability\.A log\-log scatter plot correlating termination latencies of an ImageNet\-native prior vs\. a CIFAR\-trained prior tested on ImageNet\. These results suggest that the meta\-learner captures universal statistical properties of classifier conviction that transcend specific datasets\.Figure 11:Algorithmic Logic Waterfall\.Incremental gains from Anytime\-Valid LCB, Precision Stopping, and Rejection Heuristics\. The comparison shows that while Precision Stopping provides the bulk of certification speedup, the Bankruptcy/UCB rejection suite is critical for minimizing the cost of non\-robust samples\.Figure 12:Precision\-Efficiency Pareto Frontier\.This plot maps the scaling of sample complexity against the radius precision targetϵ\\epsilon\. As the requirement for tightness increases \(ϵ→0\.01\\epsilon\\to 0\.01\), the sample cost grows logarithmically, allowing practitioners to choose an operating point based on their computational budget\.
### E\.1Parameter Influence
To interrogate the influence of different parameter strategies, we considered meta\-model variants across slices in terms of the range \(Table[6](https://arxiv.org/html/2606.27694#A5.T6)\), mixture component counts \(Table[7](https://arxiv.org/html/2606.27694#A5.T7)\), and input features \(Table[8](https://arxiv.org/html/2606.27694#A5.T8)\)\. The clear signal from these is the*lack of signal*, in that on average there is no consistent trends to any of these approaches\. This is not to say that these factors are not influential, but rather that the drivers of performance are multifactorial\.
Table 6:Global Influence of prior range strategies\. Average influence of different range settings for the Meta\-model, radii relative toCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)\.Table 7:Global Influence of mixture components \(K\)\. Average influence of different range settings for the Meta\-model, radii relative toCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)\.Table 8:Global Influence of input features\. Average influence of different range settings for the Meta\-model, radii relative toCohenet al\.\([2019](https://arxiv.org/html/2606.27694#bib.bib5)\)\.Similar Articles
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
This paper introduces a compute-aware evaluation framework for adversarial robustness of LLMs, proposing risk-compute curves and metrics based on FLOPs to better assess attack costs, finding that alignment training has non-monotonic effects and compute costs vary across models and harm categories.
Robust Shielding for Safe Reinforcement Learning
Introduces a novel shielding framework for robust Markov decision processes (RMDPs) that formally guarantees safety under uncertain transition dynamics, proving soundness and optimality. The approach combines with PAC guarantees for learned models, enabling safe reinforcement learning in unknown environments.
Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations
Position paper arguing for a post-solve robustness layer for MILP decision engines, formalizing feasible neighborhoods and solution smoothness under perturbations, and calling for certified inner approximations and adversarial robustness margins.
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
This paper introduces SAGE, a framework for scalable automated robustness augmentation of LLM knowledge evaluation benchmarks. It uses fine-tuned smaller models with reinforcement learning to generate and verify question variants at a lower cost than existing methods.
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
This paper addresses the challenge of robust checkpoint selection for multimodal LLMs under evaluation uncertainty, proposing a multi-stage framework that integrates curated real-world data, LLM-based judgment, and ranking protocols with confidence estimation.