False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control
Summary
This paper audits the reliability of distribution-free risk control methods for selective classification in signal-domain detectors, finding that naive thresholding often exceeds its declared budget and that exchangeability violations cause certificate failures.
View Cached Full Text
Cached at: 06/16/26, 11:38 AM
# False sense of safety in selective signal classification: auditing bound tightness and exchangeability for risk control
Source: [https://arxiv.org/html/2606.15153](https://arxiv.org/html/2606.15153)
###### Abstract
Selective prediction with distribution\-free risk control promises that, with confidence1−δ1\-\\deltaover the calibration draw, the error rate of accepted inputs stays below a user budgetα\\alpha\. We audit this promise on signal\-domain detectors—machine anomalous\-sound detection \(ASD\) and AI\-generated\-image forensics—for four calibration rules: uncertified empirical thresholding \(Naive\) and certified Hoeffding, Clopper–Pearson \(CP\), and betting \(WSR\) upper confidence bounds\. We report three findings\. \(i\)Naivethresholding, common in practice, exceeds its declared budget in 49–73% of synthetic trials \(n=200n\{=\}200calibration points\) and in up to 68% of real\-data splits: a false sense of safety rather than a broken theorem, since the rule never had a certificate\. \(ii\) Tightness matters:CPandWSRcertify substantial coverage where Hoeffding certifies none, with zero observed budget overruns under exchangeable splits\. \(iii\) Under grouped deployment \(unseen machine types or generators\), certified rules overrun in 9–30% of trials—far aboveδ\\delta—showing the failure lies in the broken exchangeability premise, not in the bounds; a conservative per\-group threshold restores validity at a severe coverage cost\.
Index Terms—selective prediction, distribution\-free risk control, conformal prediction, anomalous sound detection, image forensics
## 1Introduction
Signal\-domain detectors are rarely perfect, and many applications would rather abstain than err: an anomalous\-sound\-detection \(ASD\) system can defer a machine to human inspection, and an image\-forensics system can flag an image as “undecided”\. Selective classification, from Chow’s classical reject option\[[1](https://arxiv.org/html/2606.15153#bib.bib1)\]to its modern treatments\[[2](https://arxiv.org/html/2606.15153#bib.bib2),[3](https://arxiv.org/html/2606.15153#bib.bib3)\], formalizes this with an accept/abstain threshold on a confidence score\. Distribution\-free risk control—risk\-controlling prediction sets \(RCPS\)\[[4](https://arxiv.org/html/2606.15153#bib.bib4)\], Learn\-then\-Test \(LTT\)\[[5](https://arxiv.org/html/2606.15153#bib.bib5)\], and conformal risk control\[[6](https://arxiv.org/html/2606.15153#bib.bib6)\], all rooted in conformal prediction\[[7](https://arxiv.org/html/2606.15153#bib.bib7),[8](https://arxiv.org/html/2606.15153#bib.bib8)\]and its split variant\[[9](https://arxiv.org/html/2606.15153#bib.bib9)\]—promises a finite\-sample certificate: with probability at least1−δ1\-\\deltaover the calibration draw, the error rate among accepted inputs \(the*selective risk*\) is at most a user budgetα\\alpha\. Such certificates now reach applications from set\-valued classification\[[10](https://arxiv.org/html/2606.15153#bib.bib10)\]and biomedical imaging\[[11](https://arxiv.org/html/2606.15153#bib.bib11)\]to language\-model factuality\[[12](https://arxiv.org/html/2606.15153#bib.bib12)\], making their reliability on deployed signal systems a timely question\.
In practice, two failure modes undermine this promise and are routinely ignored in signal\-domain deployments\. First, practitioners often tune the threshold so that the*empirical*calibration risk is belowα\\alphaand report the system as “risk\-controlled”\. This rule—we call itNaive—carries no certificate, so its failures break no theorem; the harm is the*false sense of safety*created when its declared budget is read as a guarantee, much as miscalibrated confidence scores mislead downstream decisions\[[13](https://arxiv.org/html/2606.15153#bib.bib13)\]\. How often, and by how much, the budget is exceeded at test time has not been quantified on signal data\. Second, all certificates assume calibration and test data are exchangeable, but realistic signal\-domain splits are*grouped*: the deployed system meets machine types or generative models absent from calibration\. Weighted and adaptive conformal methods for covariate shift\[[14](https://arxiv.org/html/2606.15153#bib.bib14),[15](https://arxiv.org/html/2606.15153#bib.bib15)\], robust prediction sets underff\-divergence shifts\[[16](https://arxiv.org/html/2606.15153#bib.bib16)\], and conformal inference beyond exchangeability\[[17](https://arxiv.org/html/2606.15153#bib.bib17)\]all relax this premise for coverage guarantees, but the magnitude of the damage to selective\-risk certificates on signal data is unknown\.
Closest to our work, Basu\[[18](https://arxiv.org/html/2606.15153#bib.bib18)\]ablates nine finite\-sample bound families \(including Hoeffding, Clopper–Pearson, and WSR betting\) for selective prediction on NLP intent\-classification benchmarks and proposes a transfer\-informed warm start for the betting bound\. Our audit is complementary and differs in three ways: we target signal domains \(ASD, image forensics\) with their characteristic grouped deployment structure; we quantify the false\-safety behaviour of the uncertifiedNaiverule, which\[[18](https://arxiv.org/html/2606.15153#bib.bib18)\]does not consider; and rather than assuming a benign source domain, we isolate—via matched random\-vs\-grouped splits on the same data—how much of the observed failure is attributable to broken exchangeability rather than to the bound itself\.
Contributions\.\(1\) A unified audit of four selective thresholds \(Naive;Hoeffding\[[19](https://arxiv.org/html/2606.15153#bib.bib19)\]; Clopper–Pearson\[[20](https://arxiv.org/html/2606.15153#bib.bib20)\]; the Waudby\-Smith–Ramdas betting bound,WSR\[[21](https://arxiv.org/html/2606.15153#bib.bib21)\]\) under one protocol on synthetic data, real ASD scores \(DCASE 2023 Task 2 development set, 7 machine types, BEATs embeddings, 4 score backends\)\[[22](https://arxiv.org/html/2606.15153#bib.bib22),[23](https://arxiv.org/html/2606.15153#bib.bib23)\], and real forensics scores \(GenImage, 7 generators\)\[[24](https://arxiv.org/html/2606.15153#bib.bib24)\]\. \(2\) Quantitative evidence of the false sense of safety and of the coverage recovered by tight bounds\. \(3\) A controlled exchangeability ablation showing budget overruns under group shift are caused by the broken premise, not the rules, plus a simple per\-group conservative mitigation and its trade\-off\.
## 2Selective risk control: rules and bounds
### 2\.1Setup and notation
A frozen detector emits a scalar risk scores\(x\)s\(x\), where larger means more likely to be an error eventy=1y\{=\}1\(an anomaly missed, a fake accepted as real, etc\.\)\. Given a thresholdλ\\lambda, the system acceptsxxiffs\(x\)≤λs\(x\)\\leq\\lambda\. The*selective risk*isR\(λ\)=Pr\{y=1∣s\(x\)≤λ\}R\(\\lambda\)=\\Pr\\\{y\{=\}1\\mid s\(x\)\\leq\\lambda\\\}and the*coverage*isC\(λ\)=Pr\{s\(x\)≤λ\}C\(\\lambda\)=\\Pr\\\{s\(x\)\\leq\\lambda\\\}, i\.e\. the fraction of inputs the system answers\. Given calibration data\{\(si,yi\)\}i=1n\\\{\(s\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, budgetα\\alpha, and confidence parameterδ\\delta, every rule picks the largest threshold whose certified \(or, forNaive, empirical\) risk is within budget:
λ^=max\{λ∈\{s\(k\)\}k=1n:UCBδ/n\(\{yi:si≤λ\}\)≤α\},\\hat\{\\lambda\}=\\max\\bigl\\\{\\lambda\\in\\\{s\_\{\(k\)\}\\\}\_\{k=1\}^\{n\}:\\mathrm\{UCB\}\_\{\\delta/n\}\\bigl\(\\\{y\_\{i\}:s\_\{i\}\\leq\\lambda\\\}\\bigr\)\\leq\\alpha\\bigr\\\},\(1\)wheres\(1\)≤⋯≤s\(n\)s\_\{\(1\)\}\\leq\\dots\\leq s\_\{\(n\)\}are the sorted calibration scores andUCBδ′\(⋅\)\\mathrm\{UCB\}\_\{\\delta^\{\\prime\}\}\(\\cdot\)is an upper confidence bound at levelδ′\\delta^\{\\prime\}on the mean of the indicated losses\. The Bonferroni correctionδ/n\\delta/nover thenncandidate thresholds makes the certificate valid simultaneously over the search, as in LTT\[[5](https://arxiv.org/html/2606.15153#bib.bib5)\]\. If no candidate passes, the system always abstains \(C=0C\{=\}0, trivially safe\)\. We call a test run a*budget violation*if the empirical test selective risk exceedsα\\alphawith at least one accepted sample\. For certified rules this event has probability at mostδ\\deltaunder exchangeability; forNaiveno such promise exists, and its observed violation rate quantifies the false sense of safety, not a broken proof\.
### 2\.2The four rules
Naive:UCB:=p^\\mathrm\{UCB\}:=\\hat\{p\}, the empirical risk of the accepted prefix\.Hoeffding:UCB:=p^\+ln\(1/δ′\)/\(2k\)\\mathrm\{UCB\}:=\\hat\{p\}\+\\sqrt\{\\ln\(1/\\delta^\{\\prime\}\)/\(2k\)\}for a prefix of sizekk\[[19](https://arxiv.org/html/2606.15153#bib.bib19)\]; valid but loose at smallα\\alpha\. \(Empirical\-Bernstein refinements\[[25](https://arxiv.org/html/2606.15153#bib.bib25)\]sharpen the constant but not the qualitative picture below\.\)CP:the exact binomial upper limit\[[20](https://arxiv.org/html/2606.15153#bib.bib20)\],Beta−1\(1−δ′;x\+1,k−x\)\\mathrm\{Beta\}^\{\-1\}\(1\-\\delta^\{\\prime\};x\{\+\}1,k\{\-\}x\)forxxerrors amongkk; admissible for binary losses\.WSR:the betting/hedged bound of\[[21](https://arxiv.org/html/2606.15153#bib.bib21)\], an instance of the confidence\-sequence and e\-value machinery of game\-theoretic statistics\[[26](https://arxiv.org/html/2606.15153#bib.bib26),[27](https://arxiv.org/html/2606.15153#bib.bib27)\]\. With running meanμ^t=1/2\+∑i≤txit\+1\\hat\{\\mu\}\_\{t\}=\\tfrac\{1/2\+\\sum\_\{i\\leq t\}x\_\{i\}\}\{t\+1\}and varianceσ^t2=1/4\+∑i≤t\(xi−μ^i\)2t\+1\\hat\{\\sigma\}\_\{t\}^\{2\}=\\tfrac\{1/4\+\\sum\_\{i\\leq t\}\(x\_\{i\}\-\\hat\{\\mu\}\_\{i\}\)^\{2\}\}\{t\+1\}, form bets and wealth
νt=min\{1,2ln\(1/δ′\)nσ^t−12\},Kt\(R\)=∏i≤t\(1−νi\(xi−R\)\)\.\\nu\_\{t\}=\\min\\Bigl\\\{1,\\sqrt\{\\tfrac\{2\\ln\(1/\\delta^\{\\prime\}\)\}\{n\\hat\{\\sigma\}\_\{t\-1\}^\{2\}\}\}\\Bigr\\\},\\quad K\_\{t\}\(R\)=\\prod\_\{i\\leq t\}\\bigl\(1\-\\nu\_\{i\}\(x\_\{i\}\-R\)\\bigr\)\.\(2\)Kt\(R\)K\_\{t\}\(R\)is a nonnegative supermartingale when𝔼\[x\]≥R\\mathbb\{E\}\[x\]\\geq R, soUCB:=inf\{R:maxtKt\(R\)\>1/δ′\}\\mathrm\{UCB\}:=\\inf\\\{R:\\max\_\{t\}K\_\{t\}\(R\)\>1/\\delta^\{\\prime\}\\\}, found by bisection since the wealth is monotone inRR\. Three conservative implementation choices: the bisection returns the rejected endpoint \(upward bias\); each candidate prefix is fed to the martingale in a data\-independent random order \(score\-sorted order would break the martingale property\); and we keep the Bonferroni correction over candidate thresholds rather than exploiting time\-uniformity\. A Monte\-Carlo self\-test \(p=0\.1p\{=\}0\.1,n=200n\{=\}200,δ=0\.05\\delta\{=\}0\.05, 200 draws\) gives mean upper bounds0\.1870\.187\(Hoeffding\)\>0\.156\>0\.156\(WSR\)\>0\.143\>0\.143\(CP\), with empirical coverage of theWSRbound≥94\.5%\\geq 94\.5\\%; Fig\.[1](https://arxiv.org/html/2606.15153#S2.F1)traces the gap over calibration sizes\.
When can a bound certify anything?The zero\-error prefix gives a useful closed form\. Withx=0x\{=\}0errors amongkkaccepted calibration points andδ′=δ/n\\delta^\{\\prime\}\{=\}\\delta/n,Hoeffdingcan certifyln\(1/δ′\)/\(2k\)≤α\\sqrt\{\\ln\(1/\\delta^\{\\prime\}\)/\(2k\)\}\\leq\\alphaonly whenk≥ln\(1/δ′\)/\(2α2\)k\\geq\\ln\(1/\\delta^\{\\prime\}\)/\(2\\alpha^\{2\}\), whereasCPneeds onlyk≥ln\(1/δ′\)/ln11−αk\\geq\\ln\(1/\\delta^\{\\prime\}\)/\\ln\\tfrac\{1\}\{1\-\\alpha\}, smaller by a factor of about2α2\\alpha\. Atn=200n\{=\}200andδ=0\.1\\delta\{=\}0\.1\(δ′=5×10−4\\delta^\{\\prime\}\{=\}5\{\\times\}10^\{\-4\}\) the requirements are380380vs\.7373accepted points atα=0\.1\\alpha\{=\}0\.1and15201520vs\.149149atα=0\.05\\alpha\{=\}0\.05: on200200calibration pointsHoeffdingcannot certifyα≤0\.1\\alpha\{\\leq\}0\.1*even for a perfect detector*, whileCPsucceeds as soon as a clean prefix of7373points exists\. This single calculation predicts the pattern of zeros in Table[1](https://arxiv.org/html/2606.15153#S3.T1)\.
Fig\. 1:Bound tightness: mean95%95\\%upper confidence bound on a Bernoulli ratep=0\.1p\{=\}0\.1vs\. calibration sizenn\(200 Monte\-Carlo draws per point\)\.CPis tightest for binary losses;WSRtracks it closely andHoeffdinglags, which explains its zero certified coverage at smallα\\alphain Table[1](https://arxiv.org/html/2606.15153#S3.T1)\.
### 2\.3Exchangeability and a conservative mitigation
Certificate \([1](https://arxiv.org/html/2606.15153#S2.E1)\) requires calibration and test points to be exchangeable\. In grouped deployments \(calibrate on machine types or generators𝒢cal\\mathcal\{G\}\_\{\\mathrm\{cal\}\}, test on disjoint𝒢te\\mathcal\{G\}\_\{\\mathrm\{te\}\}\) this premise fails\. As a baseline mitigation we compute a per\-group thresholdλ^g\\hat\{\\lambda\}\_\{g\}on each calibration groupggwith budget splitδ/\|𝒢cal\|\\delta/\|\\mathcal\{G\}\_\{\\mathrm\{cal\}\}\|and deploy the most conservative one,λ^mit=mingλ^g\\hat\{\\lambda\}\_\{\\mathrm\{mit\}\}=\\min\_\{g\}\\hat\{\\lambda\}\_\{g\}\. This guards against the calibration mixture hiding a hard group but still carries no formal guarantee for unseen groups\.
## 3Experiments
All experiments are post hoc on frozen scores and CPU\-only;δ=0\.1\\delta\{=\}0\.1andα∈\{0\.05,0\.1,0\.2\}\\alpha\\in\\\{0\.05,0\.1,0\.2\\\}\(plus0\.40\.4for the weak real scores\) throughout\. Violation rates are over independent repetitions \(100 synthetic, 50 real\); under exchangeability, certified rules are allowed up toδ=10%\\delta\{=\}10\\%\. As a global negative control, pooling*every*certified\-rule run under exchangeable random splits yields 0 violations in 2700 synthetic runs, 2 in 900 controlled\-shift runs, and 2 in 3000 real\-score runs—so the violations reported below are not implementation artifacts\.
Data\.*Synthetic*:y∼Bern\(π\)y\\sim\\mathrm\{Bern\}\(\\pi\),s∣y∼𝒩\(d′y,1\)s\\mid y\\sim\\mathcal\{N\}\(d^\{\\prime\}y,1\)withd′=1\.5d^\{\\prime\}\{=\}1\.5,π∈\{0\.1,0\.3,0\.5\}\\pi\\in\\\{0\.1,0\.3,0\.5\\\},ncal=200n\_\{\\mathrm\{cal\}\}\{=\}200,nte=5000n\_\{\\mathrm\{te\}\}\{=\}5000\.*ASD*: the DCASE 2023 Task 2 development set\[[22](https://arxiv.org/html/2606.15153#bib.bib22)\], built on ToyADMOS2\[[28](https://arxiv.org/html/2606.15153#bib.bib28)\]and MIMII DG\[[29](https://arxiv.org/html/2606.15153#bib.bib29)\]\(7 machine types: ToyCar, ToyTrain, bearing, fan, gearbox, slider, valve; 200 clips each, balanced normal/anomalous, base rate0\.50\.5\); we extract BEATs embeddings\[[23](https://arxiv.org/html/2606.15153#bib.bib23)\]and compute four standard anomaly\-score backends \(twokk\-NN variants, Mahalanobis, PCA residual; test AUC 0\.60–0\.65\)\.*Forensics*: a GenImage\[[24](https://arxiv.org/html/2606.15153#bib.bib24)\]subset with 800 real images and 7 generators \(ADM, BigGAN, GLIDE, Midjourney, SD1\.5, VQDM, Wukong\)×\\times100 fakes each, scored by a VGG\[[30](https://arxiv.org/html/2606.15153#bib.bib30)\]feature detector trained independently of calibration \(AUC0\.8350\.835\); although such detectors can transfer across GAN architectures with careful augmentation\[[31](https://arxiv.org/html/2606.15153#bib.bib31)\], generalization to unseen \(e\.g\. diffusion\-based\) generators remains the known hard case\[[32](https://arxiv.org/html/2606.15153#bib.bib32)\], which is precisely the grouped deployment we audit\. Error events arey=1y\{=\}1: a missed anomaly \(ASD\) or a fake accepted as real \(forensics\)\.
Table 1:Synthetic audit: mean coverage / violation rate over 100 repetitions \(ncal=200n\_\{\\mathrm\{cal\}\}\{=\}200,nte=5000n\_\{\\mathrm\{te\}\}\{=\}5000,δ=0\.1\\delta\{=\}0\.1\)\.Fig\. 2:Certified coverage vs\. risk budgetα\\alphaon synthetic data \(ncal=200n\_\{\\mathrm\{cal\}\}\{=\}200, mean over 100 repetitions,δ=0\.1\\delta\{=\}0\.1\)\.Naive\(orange diamonds\) buys its high coverage with budget violations \(annotated rates\); certified rules never violated in these runs, andCP/WSRrecover most of the coverage thatHoeffdingforfeits\.### 3\.1Synthetic audit: false safety and bound tightness
Table[1](https://arxiv.org/html/2606.15153#S3.T1)and Fig\.[2](https://arxiv.org/html/2606.15153#S3.F2)show the main audit\.Naiveattains high coverage but exceeds its declared budget in 49–73% of trials atα=0\.05\\alpha\{=\}0\.05across base rates—worst exactly where a guarantee is needed most\.Hoeffdingis safe but certifies zero coverage in 8 of 9 settings\.CPrecovers substantial coverage at zero observed violations \(0→0\.5450\\\!\\to\\\!0\.545atπ=0\.1,α=0\.1\\pi\{=\}0\.1,\\alpha\{=\}0\.1;0→0\.4810\\\!\\to\\\!0\.481atπ=0\.3,α=0\.2\\pi\{=\}0\.3,\\alpha\{=\}0\.2\), andWSRtracks it closely \(0\.4940\.494,0\.4360\.436\)\. For binary lossesCPis exact, soWSR’s value lies in extending to bounded non\-binary losses at near\-CPtightness\.
### 3\.2Real scores under exchangeable splits
Table[2](https://arxiv.org/html/2606.15153#S3.T2)evaluates the frozen real scores with random 50/50 calibration/test splits, where exchangeability holds by construction\. The false sense of safety reproduces:Naiveviolates its budget in 20–68% of ASD splits and 34–52% of forensics splits\. All certified rules stay at 0–2%, well withinδ\\delta\. The price of certification depends on score quality: on the strong forensics score,CPcertifies coverage0\.1030\.103atα=0\.2\\alpha\{=\}0\.2and0\.7200\.720atα=0\.4\\alpha\{=\}0\.4; on the weak ASD scores \(AUC≤0\.65\{\\leq\}0\.65at base rate0\.50\.5\) every certified rule abstains almost everywhere forα≤0\.2\\alpha\\leq 0\.2\. We report this honest negative result deliberately: with weak detectors there is nothing to certify, and onlyNaivepretends otherwise\.
Table 2:Real scores, random splits \(exchangeable\), 50 repetitions,δ=0\.1\\delta\{=\}0\.1: mean coverage / violation rate per backend \(two of four ASD backends shown; the others behave alike\)\.
### 3\.3Exchangeability ablation: shift, not the bounds
We isolate the effect of grouped deployment with matched splits on the same data\.*Controlled synthetic shift*: six groups with heterogeneous base rates \(0\.050\.05–0\.450\.45\), separations \(d′=1\.4d^\{\\prime\}\{=\}1\.4–2\.52\.5\), and score offsets \(−0\.4\-0\.4–0\.60\.6\);randommixes all groups into calibration and test,groupedcalibrates on 3 groups and tests on the other 3 \(100 repetitions\)\. Table[3](https://arxiv.org/html/2606.15153#S3.T3)and Fig\.[3](https://arxiv.org/html/2606.15153#S3.F3): under random splitsCP/WSRviolate in at most 1% of trials \(withinδ=10%\\delta\{=\}10\\%\); under grouped splits the*same rules on the same data*violate in 9–30%, exceedingδ\\deltaatα≥0\.1\\alpha\\geq 0\.1\. Binomial uncertainty does not explain this: the exact 95% CI for the groupedCPrate is\[0\.19,0\.37\]\[0\.19,0\.37\]atα=0\.1\\alpha\{=\}0\.1and\[0\.21,0\.40\]\[0\.21,0\.40\]atα=0\.2\\alpha\{=\}0\.2, both excludingδ\\delta\. The failure is therefore a property of the broken exchangeability premise, not of the bounds\.Naiveoverruns in both conditions \(44–51%\)\.
On the real scores the pattern recurs at practical budgets\. Forensics \(generator holdout: calibrate on Real \+ 4 generators, test on Real \+ 3 unseen generators\):CPviolations rise from 0% \(random\) to 2% atα=0\.1\\alpha\{=\}0\.1and 6% atα=0\.2\\alpha\{=\}0\.2;WSRto 2% / 4%\. ASD \(machine\-type holdout,α=0\.4\\alpha\{=\}0\.4where certified coverage is non\-zero\):CPrises from 0–2% to 18–26% on three of four backends, andWSRbehaves similarly\.
Table 3:Controlled synthetic group shift, 100 repetitions,δ=0\.1\\delta\{=\}0\.1: violation rate by split protocol \(mean coverage in parentheses for theα=0\.1\\alpha\{=\}0\.1rows\)\.Fig\. 3:Group shift, not the bounds, causes budget violations \(controlled synthetic shift, 100 repetitions; dotted line marks the toleranceδ=0\.1\\delta\{=\}0\.1that certified rules are allowed under exchangeability\)\. Random splits keepCP/WSRat≤1%\{\\leq\}1\\%; grouped splits push the same rules to 9–30%; the per\-group conservative threshold \(mitig\.\) restores validity\.Naiveoverruns its budget throughout\.
### 3\.4Failure mechanism: the hardest group breaks the budget
The grouped\-split violations are not diffuse: they trace to identifiable hard groups that the pooled calibration mixture averages away\. On the forensics score the seven generators are far from homogeneous—the per\-generator AUC against the shared real class spans0\.6900\.690\(VQDM\) to0\.9800\.980\(GLIDE\)—and at the pooledCPthreshold forα=0\.2\\alpha\{=\}0\.2the fake\-acceptance rate among accepted images ranges from0–2%2\\%\(BigGAN, GLIDE\) to2323–30%30\\%\(Midjourney, VQDM\)\. Whether a grouped run violates is then almost a deterministic function of*which*generators are held out: over 100 grouped repetitions,CPatα=0\.2\\alpha\{=\}0\.2violated in0%0\\%of the runs whose test set contained at most one of the three hardest generators \(VQDM, ADM, SD1\.5\), in21%21\\%of those containing two, and in100%100\\%of those containing all three\. In ASD the anatomy concentrates in one group: the weakest machine type \(valve, per\-machine AUC0\.5450\.545,kk\-NN backend\) accounts for essentially all overruns—atα=0\.4\\alpha\{=\}0\.4,CPviolated in63%63\\%of grouped runs with valve in the test set and in0%0\\%of the rest\. The general reading: a pooled certificate is an*on\-average*statement over the calibration mixture, and grouped deployment re\-weights the test distribution toward exactly the groups the certificate never audited\. It also explains why the controlled shift violates more often than the forensics holdout: its groups differ in base rate and offset simultaneously, the worst case for a pooled threshold\.
### 3\.5Mitigation trade\-off
The per\-group conservative threshold of Sec\.[2\.3](https://arxiv.org/html/2606.15153#S2.SS3)restores empirical validity: on the controlled shift it pushesCPviolations from 11–30% back to 0–7% andWSRfrom 9–29% to 0–3% \(Table[3](https://arxiv.org/html/2606.15153#S3.T3)\); on the forensics holdout it removes all observedCP/WSRviolations\. The price is severe: certified coverage collapses \(CPatα=0\.1\\alpha\{=\}0\.1:0\.62→0\.020\.62\\\!\\to\\\!0\.02; forensics atα=0\.4\\alpha\{=\}0\.4:0\.70→0\.030\.70\\\!\\to\\\!0\.03\), because each group alone offers few calibration points and the minimum is dominated by the hardest group\. Two observations sharpen this trade\-off\. First, the collapse is partly structural: each cell retains onlyng=200n\_\{g\}\{=\}200of the calibration points, so by the closed form of Sec\. 2\.2CPneeds a clean prefix of≈39\{\\approx\}39accepted points per cell atα=0\.2\\alpha\{=\}0\.2, which the hardest cell rarely offers; the minimum then certifies little \(mean coverage0\.90→0\.280\.90\\\!\\to\\\!0\.28atα=0\.2\\alpha\{=\}0\.2, now valid\)\. Second, the mitigation only repairs rules that had a certificate to begin with:Naiveunder the same per\-group minimum still violated in 20–23% of the controlled\-shift runs \(Table[3](https://arxiv.org/html/2606.15153#S3.T3)\), so added conservatism is no substitute for a bound\. Together these quantify an open robustness–utility gap that group\-shift\-aware certificates \(e\.g\. weighted, adaptive, robust, or beyond\-exchangeability conformal methods\[[14](https://arxiv.org/html/2606.15153#bib.bib14),[15](https://arxiv.org/html/2606.15153#bib.bib15),[16](https://arxiv.org/html/2606.15153#bib.bib16),[17](https://arxiv.org/html/2606.15153#bib.bib17)\]\) would need to close for selective risk on signals\.
## 4Conclusion and limitations
We audited selective\-prediction risk certificates on signal\-domain scores\. Uncertified empirical thresholding exceeds its declared budget in up to 73% of trials—a false sense of safety whenever it is reported as “risk\-controlled”\. Exact and betting bounds \(CP,WSR\) are the practical choices, recovering coverage that Hoeffding forfeits while keeping violations withinδ\\deltaunder exchangeability\. Grouped deployment breaks the premise of all certificates, and a conservative per\-group fix trades almost all coverage for validity\. We recommend that papers reporting risk\-controlled selective systems state the bound type,δ\\delta, and the split protocol; report violation rates over repeated splits with binomial confidence intervals; include a random\-split negative control to separate shift effects from implementation error; and release frozen scores so the audit is reproducible\.*Limitations*: our losses are binary, whereCPis already exact andWSR’s advantage is its generality to bounded losses; the real\-data detectors are deliberately simple frozen baselines, and stronger scores would enlarge the certified regimes; the mitigation is a heuristic without guarantees for unseen groups; and our group\-shift evidence covers two signal tasks, not all deployment regimes\.
## References
- \[1\]C\. K\. Chow,“On optimum recognition error and reject tradeoff,”IEEE Transactions on Information Theory, vol\. 16, no\. 1, pp\. 41–46, 1970\.
- \[2\]Ran El\-Yaniv and Yair Wiener,“On the foundations of noise\-free selective classification,”Journal of Machine Learning Research, vol\. 11, pp\. 1605–1641, 2010\.
- \[3\]Yonatan Geifman and Ran El\-Yaniv,“Selective classification for deep neural networks,”inAdvances in Neural Information Processing Systems \(NeurIPS\), 2017\.
- \[4\]Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael I\. Jordan,“Distribution\-free, risk\-controlling prediction sets,”Journal of the ACM, vol\. 68, no\. 6, 2021\.
- \[5\]Anastasios N\. Angelopoulos, Stephen Bates, Emmanuel J\. Candès, Michael I\. Jordan, and Lihua Lei,“Learn then test: Calibrating predictive algorithms to achieve risk control,”Annals of Applied Statistics, vol\. 19, no\. 2, pp\. 1641–1662, 2025\.
- \[6\]Anastasios N\. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster,“Conformal risk control,”inProc\. International Conference on Learning Representations \(ICLR\), 2024\.
- \[7\]Vladimir Vovk, Alexander Gammerman, and Glenn Shafer,Algorithmic Learning in a Random World,Springer, 2005\.
- \[8\]Anastasios N\. Angelopoulos and Stephen Bates,“Conformal prediction: A gentle introduction,”Foundations and Trends in Machine Learning, vol\. 16, no\. 4, pp\. 494–591, 2023\.
- \[9\]Harris Papadopoulos, Kostas Proedrou, Vladimir Vovk, and Alexander Gammerman,“Inductive confidence machines for regression,”inProc\. European Conference on Machine Learning \(ECML\), 2002\.
- \[10\]Mauricio Sadinle, Jing Lei, and Larry Wasserman,“Least ambiguous set\-valued classifiers with bounded error levels,”Journal of the American Statistical Association, vol\. 114, no\. 525, pp\. 223–234, 2019\.
- \[11\]Anastasios N\. Angelopoulos, Amit Pal Kohli, Stephen Bates, Michael I\. Jordan, Jitendra Malik, Thayer Alshaabi, Srigokul Upadhyayula, and Yaniv Romano,“Image\-to\-image regression with distribution\-free uncertainty quantification and applications in imaging,”inProc\. International Conference on Machine Learning \(ICML\), 2022\.
- \[12\]Christopher Mohri and Tatsunori Hashimoto,“Language models with conformal factuality guarantees,”inProc\. International Conference on Machine Learning \(ICML\), 2024, pp\. 36029–36047\.
- \[13\]Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q\. Weinberger,“On calibration of modern neural networks,”inProc\. International Conference on Machine Learning \(ICML\), 2017\.
- \[14\]Ryan J\. Tibshirani, Rina Foygel Barber, Emmanuel J\. Candès, and Aaditya Ramdas,“Conformal prediction under covariate shift,”inAdvances in Neural Information Processing Systems \(NeurIPS\), 2019\.
- \[15\]Isaac Gibbs and Emmanuel J\. Candès,“Adaptive conformal inference under distribution shift,”inAdvances in Neural Information Processing Systems \(NeurIPS\), 2021\.
- \[16\]Maxime Cauchois, Suyash Gupta, Alnur Ali, and John C\. Duchi,“Robust validation: Confident predictions even when distributions shift,”Journal of the American Statistical Association, vol\. 119, no\. 548, pp\. 3033–3044, 2024\.
- \[17\]Rina Foygel Barber, Emmanuel J\. Candès, Aaditya Ramdas, and Ryan J\. Tibshirani,“Conformal prediction beyond exchangeability,”Annals of Statistics, vol\. 51, no\. 2, pp\. 816–845, 2023\.
- \[18\]Abhinaba Basu,“Cross\-domain uncertainty quantification for selective prediction: A comprehensive bound ablation with transfer\-informed betting,”arXiv preprint arXiv:2603\.08907, 2026\.
- \[19\]Wassily Hoeffding,“Probability inequalities for sums of bounded random variables,”Journal of the American Statistical Association, vol\. 58, no\. 301, pp\. 13–30, 1963\.
- \[20\]C\. J\. Clopper and E\. S\. Pearson,“The use of confidence or fiducial limits illustrated in the case of the binomial,”Biometrika, vol\. 26, no\. 4, pp\. 404–413, 1934\.
- \[21\]Ian Waudby\-Smith and Aaditya Ramdas,“Estimating means of bounded random variables by betting,”Journal of the Royal Statistical Society: Series B, vol\. 86, no\. 1, pp\. 1–27, 2024\.
- \[22\]Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, and Yohei Kawaguchi,“Description and discussion on DCASE 2023 challenge task 2: First\-shot unsupervised anomalous sound detection for machine condition monitoring,”inProc\. Workshop on Detection and Classification of Acoustic Scenes and Events \(DCASE\), 2023\.
- \[23\]Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei,“BEATs: Audio pre\-training with acoustic tokenizers,”inProc\. International Conference on Machine Learning \(ICML\), 2023\.
- \[24\]Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang,“GenImage: A million\-scale benchmark for detecting AI\-generated image,”inAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks, 2023\.
- \[25\]Andreas Maurer and Massimiliano Pontil,“Empirical Bernstein bounds and sample variance penalization,”inProc\. Conference on Learning Theory \(COLT\), 2009\.
- \[26\]Steven R\. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon,“Time\-uniform, nonparametric, nonasymptotic confidence sequences,”Annals of Statistics, vol\. 49, no\. 2, pp\. 1055–1080, 2021\.
- \[27\]Aaditya Ramdas, Peter Grünwald, Vladimir Vovk, and Glenn Shafer,“Game\-theoretic statistics and safe anytime\-valid inference,”Statistical Science, vol\. 38, no\. 4, pp\. 576–601, 2023\.
- \[28\]Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito,“ToyADMOS2: Another dataset of miniature\-machine operating sounds for anomalous sound detection under domain shift conditions,”inProc\. Workshop on Detection and Classification of Acoustic Scenes and Events \(DCASE\), 2021, pp\. 1–5\.
- \[29\]Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi,“MIMII DG: Sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task,”inProc\. Workshop on Detection and Classification of Acoustic Scenes and Events \(DCASE\), 2022\.
- \[30\]Karen Simonyan and Andrew Zisserman,“Very deep convolutional networks for large\-scale image recognition,”inProc\. International Conference on Learning Representations \(ICLR\), 2015\.
- \[31\]Sheng\-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A\. Efros,“CNN\-generated images are surprisingly easy to spot… for now,”inProc\. IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\), 2020\.
- \[32\]Utkarsh Ojha, Yuheng Li, and Yong Jae Lee,“Towards universal fake image detectors that generalize across generative models,”inProc\. IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\), 2023, pp\. 24480–24489\.Similar Articles
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.
Selective Control under Noisy Perception: Governance Failures Hidden by Aggregate Metrics in Modular Networks
This paper demonstrates that content moderation systems can cause disproportionate harm to bridge users connecting separate communities, even when aggregate accuracy metrics appear satisfactory, with governance loss increasing under false-positive-heavy conditions.
When Sample Selection Bias Precipitates Model Collapse
This paper demonstrates that data selection in low-resource verification regimes, where verifiers only have access to fragmented and biased slices of the target distribution, can paradoxically accelerate model collapse by pruning globally relevant tail modes. The authors provide theoretical proof and propose a collaborative proxy reference mechanism as a mitigation strategy.
AI safety is arguing about the wrong boundary
This article argues that the AI safety debate is misdirected, focusing on model alignment and internal controls instead of the critical boundary: external admission authority over agent execution. It warns that systems capable of self-authorizing high-impact actions (e.g., deploying code, moving money) pose a fundamental risk that logging and monitoring cannot mitigate.
When Determinants Are Not Enough: Private Rare Switching
This note presents a research moment where Codex helped find a new rare-switching rule for private linear bandits, using the generalized Rayleigh quotient to overcome the failure of determinant-based monotonicity due to Gaussian noise.