When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation

arXiv cs.LG Papers

Summary

This paper characterizes when conformal risk control can certify structured LLM outputs, proving impossibility bounds and analyzing certification hierarchies across different bounds. Empirical validation on six open-weight models shows that hard configurations are uncertifiable at low risk levels but practical certification is achievable at relaxed targets.

arXiv:2606.29054v1 Announce Type: new Abstract: Large language models (LLMs) deployed for structured generation (NER, JSON extraction, QA, and classification) lack formal reliability guarantees, and standard heuristic abstention policies miss user-specified risk targets by 7.5--12.5%. We characterize when conformal risk control (CRC) can certify structured LLM outputs and when it provably cannot. First, we prove an impossibility result: when the base risk (\mu > \alpha), any distribution-free method must abstain on at least ((\mu-\alpha)/(1-\alpha)) examples, yielding a closed-form feasibility test: one can check whether CRC will work before running it. Second, we analyze a certification hierarchy across Hoeffding, empirical Bernstein, and a betting-based e-CRC bound, with strict gains in low-variance/large-sample regimes: the Hoeffding-to-Bernstein step delivers the largest gain (+37% certified configurations), while e-CRC adds value when calibration data is scarce (10% certification at 20% data versus 0% for Hoeffding). Third, we validate adaptive conformal inference (ACI) under cross-dataset shift, reducing risk-target violations from 71% to 21%, with residual failures concentrated exactly where the impossibility bound predicts. Across six open-weight models (3B--72B parameters), eight datasets, four tasks, and six nonconformity scores, hard NER/QA/CLS configurations are uncertifiable at (\alpha = 0.10); relaxing to (\alpha = 0.30--0.40) unlocks practical certification (47% NER, 40% QA, 60% CLS). The framework gives a three-step deployment recipe: check feasibility, select the bound and score, then mitigate shift.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:31 AM

# When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation
Source: [https://arxiv.org/html/2606.29054](https://arxiv.org/html/2606.29054)
###### Abstract

Large language models \(LLMs\) deployed for structured generation \(NER, JSON extraction, QA, and classification\) lack formal reliability guarantees, and standard heuristic abstention policies miss user\-specified risk targets by 7\.5–12\.5%\. We characterize*when*conformal risk control \(CRC\) can certify structured LLM outputs and when it provably cannot\. First, we prove an impossibility result: when the base riskμ\>α\\mu\>\\alpha, any distribution\-free method must abstain on at least\(μ−α\)/\(1−α\)\(\\mu\-\\alpha\)/\(1\-\\alpha\)examples, yielding a closed\-form feasibility test: one can check whether CRC will work before running it\. Second, we analyze a certification hierarchyΛHoeff∗⊆ΛBern∗⊆Λe​\-​CRC∗\\Lambda^\{\*\}\_\{\\mathrm\{Hoeff\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{Bern\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{e\\text\{\-\}CRC\}\}across Hoeffding, empirical Bernstein, and a betting\-based e\-CRC bound, with strict gains in low\-variance/large\-sample regimes: the Hoeffding→\\toBernstein step delivers the largest gain \(\+37% certified configurations\), while e\-CRC adds value when calibration data is scarce \(10% certification at 20% data versus 0% for Hoeffding\)\. Third, we validate adaptive conformal inference \(ACI\) under cross\-dataset shift, reducing risk\-target violations from 71% to 21%, with residual failures concentrated exactly where the impossibility bound predicts\. Across six open\-weight models \(3B–72B parameters\), eight datasets, four tasks, and six nonconformity scores, hard NER/QA/CLS configurations are uncertifiable atα=0\.10\\alpha=0\.10; relaxing toα=0\.30\\alpha=0\.30–0\.400\.40unlocks practical certification \(47% NER, 40% QA, 60% CLS\)\. The framework gives a three\-step deployment recipe: check feasibility, select the bound and score, then mitigate shift\. Code and configurations will be released after cleanup\.

conformal risk control, conformal prediction, large language models, selective prediction, uncertainty quantification, named entity recognition, e\-values, betting, distribution shift, adaptive conformal inference, structured generation

## 1Introduction

Large language models \(LLMs\) are used for structured generation: extracting named entities, parsing JSON, answering factual questions, and classifying inputs, where the output must match a schema and can be scored against ground truth\. But they give no guarantee on output quality\. A model can emit a wrong entity span, or a wrong answer, with full confidence and no signal that anything is off\. In medical NER, legal extraction, or financial parsing, a silent error is costly, and a heuristic gives no guarantee it will not happen\.

Standard uncertainty quantification \(token probabilities, entropy thresholds, temperature scaling\) is miscalibrated for structured outputs where the loss is defined at the level of entities, fields, or semantic units\(Guoet al\.,[2017](https://arxiv.org/html/2606.29054#bib.bib30); Kadavathet al\.,[2022](https://arxiv.org/html/2606.29054#bib.bib29); Kuhnet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib32); Farquharet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib33)\)\. Heuristic abstention policies frequently miss user\-specified risk targets by 7\.5–12\.5%\. Medical NER makes this concrete: suppose we want to guarantee≤10%\\leq 10\\%entity\-level error\. If the model’s base riskμ\\muon the deployment population is above10%10\\%, no post\-hoc method can hit that target without abstaining on most inputs\. A heuristic never sees this floor; our impossibility result computes it in closed form\.

We ask a prior question, before “which bound is tightest?”:*when*can conformal risk control \(CRC\)\(Angelopouloset al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib3); Bateset al\.,[2021](https://arxiv.org/html/2606.29054#bib.bib6)\)certify structured LLM outputs at all, and when is certification provably impossible regardless of method? This characterization is our primary contribution\. We analyze a certification hierarchy \(Hoeffding⊆\\subseteqBernstein⊆\\subseteqe\-CRC, our betting\-based bound\) with strict gains in low\-variance/large\-sample regimes, showing where tighter bounds help: predominantly in low\-variance regimes where the Hoeffding→\\toBernstein step alone recovers 37% more certifications\. The e\-CRC bound completes the hierarchy as its tightest member, with additional value in data\-scarce settings\. The algebra behind our impossibility bound \(Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\) is elementary; the contribution is recognizing it as a deployment\-facing characterization: when base riskμ\>α\\mu\>\\alpha, no distribution\-free method can certify without abstaining on≥\(μ−α\)/\(1−α\)\\geq\(\\mu\-\\alpha\)/\(1\-\\alpha\)of examples\. Appendix Table[6](https://arxiv.org/html/2606.29054#A8.T6)shows these predictions closely match observed behavior across 3B–72B models\. This reorders the questions: the first is not “which bound?” but “isμ<α\\mu<\\alpha?”\. If yes, use variance\-aware bounds; if no, improve the model, relaxα\\alpha, or deploy adaptive methods\. We validate this framework with a broad empirical study of CRC for structured generation, including an evaluation of adaptive conformal inference \(ACI\)\(Gibbs and Candès,[2021](https://arxiv.org/html/2606.29054#bib.bib22)\)under cross\-dataset transfer\.

The framing complements concurrent CRC\-style work for LLMs:Quachet al\.\([2024](https://arxiv.org/html/2606.29054#bib.bib7)\)apply conformal prediction to language model generation by calibrating sampling stopping and rejection rules;Mohri and Hashimoto \([2024](https://arxiv.org/html/2606.29054#bib.bib8)\)certify factual claims via conformal back\-off;Abbasi\-Yadkoriet al\.\([2024](https://arxiv.org/html/2606.29054#bib.bib9)\)use conformal prediction to abstain on hallucinated outputs; andGuiet al\.\([2024](https://arxiv.org/html/2606.29054#bib.bib10)\)extend the conformal lens to alignment guarantees\. These methods focus on*which*conformal procedure to apply to LLM outputs; we study*when*any such procedure can succeed under task\-specific structured losses, and what minimum abstention is required when it cannot\. The companion work ofKotte \([2026](https://arxiv.org/html/2606.29054#bib.bib38)\)addresses joint coverage across multi\-stage pipelines and is complementary to the single\-output certification setting analyzed here\.

#### Contributions\.

1. 1\.Impossibility result and feasibility characterization\.We prove that whenμ\>α\\mu\>\\alpha, any valid method must abstain on≥\(μ−α\)/\(1−α\)\\geq\(\\mu\-\\alpha\)/\(1\-\\alpha\)examples \(Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\), providing a closed\-form feasibility test\. We show this bound closely matches observed certification across 3B–72B models, and that relaxingα\\alphato0\.300\.30–0\.400\.40makes hard tasks certifiable \(47% NER, 40% QA, 60% CLS\)\.
2. 2\.Bound hierarchy\.We establish the certification hierarchyΛHoeff∗⊆ΛBern∗⊆Λe​\-​CRC∗\\Lambda^\{\*\}\_\{\\mathrm\{Hoeff\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{Bern\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{e\\text\{\-\}CRC\}\}\(in the low\-variance/large\-sample regime\) and verify it across 656 configurations: the Hoeffding→\\toBernstein step provides the largest gain \(\+37%\), while e\-CRC completes the hierarchy with additional value in data\-scarce regimes\.
3. 3\.ACI under shift\.Empirical ACI validation for structured generation under temporal shift \(60%→\\to4% violations\), severity sweep, and cross\-dataset transfer \(71%→\\to21%\)\.
4. 4\.Score fusion and benchmark\.Calibrated score fusion improves AUROC on 89% of configurations\. Evaluation spans six models \(3B–72B\) from three families, eight datasets, four tasks, and six scores\.

Table 1:Positioning relative to closely related lines of work\. Our goal is not a new conformal primitive, but a deployment\-facing characterization for structured LLM generation: feasibility \(viaμ\\muvs\.α\\alpha\), tightest bounds, and shift adaptation\.

## 2Method

#### Problem setup\.

An LLMf:𝒳→𝒴f:\\mathcal\{X\}\\to\\mathcal\{Y\}produces predictiony^=f​\(x\)\\hat\{y\}=f\(x\)\. A confidence scores​\(x,y^\)∈\[0,1\]s\(x,\\hat\{y\}\)\\in\[0,1\]quantifies reliability; a selective predictor emitsy^\\hat\{y\}whens≥λs\\geq\\lambdaand abstains otherwise\. We seekλ∗\\lambda^\{\*\}such that

𝔼​\[Rtask​\(Y^,Y\)∣s​\(X\)≥λ∗\]≤α,\\mathbb\{E\}\[R\_\{\\mathrm\{task\}\}\(\\hat\{Y\},Y\)\\mid s\(X\)\\geq\\lambda^\{\*\}\]\\leq\\alpha,\(1\)with probability≥1−δ\\geq 1\-\\delta, whereRtaskR\_\{\\mathrm\{task\}\}is a task\-specific risk andα\\alphais the user\-specified target\. We define four risk functions:RNER=1−F1entityR\_\{\\mathrm\{NER\}\}=1\-F\_\{1\}^\{\\mathrm\{entity\}\},RJSON=1−F1fieldR\_\{\\mathrm\{JSON\}\}=1\-F\_\{1\}^\{\\mathrm\{field\}\},RQA=1−𝕀​\(EM\)R\_\{\\mathrm\{QA\}\}=1\-\\mathbb\{I\}\(\\mathrm\{EM\}\),RCLS=1−𝕀​\(correct\)R\_\{\\mathrm\{CLS\}\}=1\-\\mathbb\{I\}\(\\mathrm\{correct\}\); all bounded in\[0,1\]\[0,1\], satisfying CRC requirements\(Angelopouloset al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib3),[2025](https://arxiv.org/html/2606.29054#bib.bib4)\)\. Background on CRC and the Learn\-Then\-Test framework is in Appendix[A](https://arxiv.org/html/2606.29054#A1)\.

#### CRC calibration\.

Given calibration data\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, we compute scoressis\_\{i\}and risksrir\_\{i\}\. For each candidate thresholdλ\\lambda, letEλ=\{i:si≥λ\}E\_\{\\lambda\}=\\\{i:s\_\{i\}\\geq\\lambda\\\}be the emit set\. We compute the Hoeffding upper confidence bound \(UCB\):UH​\(λ\)=R^​\(λ\)\+log⁡\(2/δ\)/\(2​\|Eλ\|\)U\_\{H\}\(\\lambda\)=\\hat\{R\}\(\\lambda\)\+\\sqrt\{\\log\(2/\\delta\)/\(2\|E\_\{\\lambda\}\|\)\}and selectλ∗=min⁡\{λ:UH​\(λ\)≤α\}\\lambda^\{\*\}=\\min\\\{\\lambda:U\_\{H\}\(\\lambda\)\\leq\\alpha\\\}\. At test time, we emit predictions withs≥λ∗s\\geq\\lambda^\{\*\}and abstain otherwise\. This provides the finite\-sample guaranteeℙ​\(𝔼​\[R∣s≥λ∗\]\>α\)≤δ\\mathbb\{P\}\(\\mathbb\{E\}\[R\\mid s\\geq\\lambda^\{\*\}\]\>\\alpha\)\\leq\\deltaunder exchangeability\.

### 2\.1Betting\-Based CRC \(e\-CRC\)

Standard CRC uses the Hoeffding UCBUH​\(λ\)=R^​\(λ\)\+log⁡\(2/δ\)/\(2​nλ\)U\_\{H\}\(\\lambda\)=\\hat\{R\}\(\\lambda\)\+\\sqrt\{\\log\(2/\\delta\)/\(2n\_\{\\lambda\}\)\}, which ignores variance\. The empirical Bernstein bound\(Maurer and Pontil,[2009](https://arxiv.org/html/2606.29054#bib.bib19)\)incorporates sample varianceσ^2\\hat\{\\sigma\}^\{2\}:

UB​\(λ\)=R^\+2​σ^2​log⁡\(2/δ\)nλ\+7​log⁡\(2/δ\)3​\(nλ−1\),U\_\{B\}\(\\lambda\)=\\hat\{R\}\+\\sqrt\{\\frac\{2\\hat\{\\sigma\}^\{2\}\\log\(2/\\delta\)\}\{n\_\{\\lambda\}\}\}\+\\frac\{7\\log\(2/\\delta\)\}\{3\(n\_\{\\lambda\}\-1\)\},\(2\)which is tighter than Hoeffding wheneverσ^2<1/4\\hat\{\\sigma\}^\{2\}<1/4\(holding for all our datasets; see Table[3](https://arxiv.org/html/2606.29054#S3.T3)\)\. We go further using the testing\-by\-betting framework\(Shafer,[2021](https://arxiv.org/html/2606.29054#bib.bib13); Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.29054#bib.bib14); Ramdaset al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib15); Grünwaldet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib16); Vovk and Wang,[2021](https://arxiv.org/html/2606.29054#bib.bib18)\)\. For each candidateλ\\lambda, we construct a wealth process over the emit setEλE\_\{\\lambda\}:W0=1W\_\{0\}=1,Wj=Wj−1​\(1\+κj​\(α−rj\)\)W\_\{j\}=W\_\{j\-1\}\(1\+\\kappa\_\{j\}\(\\alpha\-r\_\{j\}\)\), whereκj\\kappa\_\{j\}is the Kelly\-optimal bet clipped to\[0,0\.5\]\[0,0\.5\]\. IfWm≥1/δW\_\{m\}\\geq 1/\\delta, we certifyλ\\lambda\(Appendix[B](https://arxiv.org/html/2606.29054#A2)for full pseudocode\)\.

###### Theorem 1\(Betting\-Based Risk Validity\)\.

For exchangeable calibration data, the e\-CRC procedure satisfiesℙ​\(𝔼​\[R​\(λ∗\)\]\>α​and​λ∗​certified\)≤δ\\mathbb\{P\}\(\\mathbb\{E\}\[R\(\\lambda^\{\*\}\)\]\>\\alpha\\text\{ and \}\\lambda^\{\*\}\\text\{ certified\}\)\\leq\\delta\.

###### Proof sketch\.

UnderH0:𝔼​\[R\]≥αH\_\{0\}:\\mathbb\{E\}\[R\]\\geq\\alpha, the wealth\{Wj\}\\\{W\_\{j\}\\\}is a non\-negative supermartingale\. By Ville’s inequality\(Howardet al\.,[2021](https://arxiv.org/html/2606.29054#bib.bib17)\),ℙ​\(supjWj≥1/δ\)≤δ\\mathbb\{P\}\(\\sup\_\{j\}W\_\{j\}\\geq 1/\\delta\)\\leq\\delta\. Full proof in Appendix[E](https://arxiv.org/html/2606.29054#A5)\. ∎

###### Proposition 2\(Bound Ordering\)\.

Fix a calibration set with bounded risksri∈\[0,1\]r\_\{i\}\\in\[0,1\], a shared candidate threshold set, and confidence levelδ\\delta\(withnλ≥2n\_\{\\lambda\}\\geq 2so Bernstein is defined\)\. Then the set of thresholds certified on this calibration set satisfies

ΛHoeff∗⊆ΛBern∗⊆Λe​\-​CRC∗\.\\Lambda^\{\*\}\_\{\\mathrm\{Hoeff\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{Bern\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{e\\text\{\-\}CRC\}\}\.In the regime where the variance\-aware bounds are tighter \(largenλn\_\{\\lambda\}, orσ^2\\hat\{\\sigma\}^\{2\}well below1/41/4\), the ordering holds deterministically conditional on the calibration sample; exchangeability is only needed for each method’s validity guarantee\. The inclusions are strict when the conditional variance on the emit set is small relative to1/41/4\.

Intuition: Hoeffding treatsRRas worst\-case uniform on\[0,1\]\[0,1\]; Bernstein exploits low variance to tighten the UCB; e\-CRC adapts fully to the empirical distribution via sequential bets\. Each strictly reduces the confidence interval width, certifying thresholds the previous method cannot\. Proof in Appendix[E](https://arxiv.org/html/2606.29054#A5)\.

### 2\.2Fundamental Limits of Certification

###### Proposition 3\(Minimum Abstention Lower Bound\)\.

For any distribution\-free selective predictor with𝔼​\[R∣emit\]≤α\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\leq\\alphaholding with probability≥1−δ\\geq 1\-\\delta:

abstention≥μ−α1−α−O​\(log⁡\(1/δ\)n\),\\mathrm\{abstention\}\\geq\\frac\{\\mu\-\\alpha\}\{1\-\\alpha\}\-O\\\!\\left\(\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{n\}\}\\right\),whereμ=𝔼​\[R\]\\mu=\\mathbb\{E\}\[R\]is the base risk\.

###### Proof sketch\.

Decompose base risk over emit and abstain:μ=𝔼​\[R∣emit\]⋅p\+𝔼​\[R∣abstain\]​\(1−p\)\\mu=\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\cdot p\+\\mathbb\{E\}\[R\\mid\\mathrm\{abstain\}\]\(1\-p\)wherep=ℙ​\(emit\)p=\\mathbb\{P\}\(\\mathrm\{emit\}\)\. The risk guarantee requires𝔼​\[R∣emit\]≤α\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\leq\\alpha; sinceR∈\[0,1\]R\\in\[0,1\]we have𝔼​\[R∣abstain\]≤1\\mathbb\{E\}\[R\\mid\\mathrm\{abstain\}\]\\leq 1, soμ≤α​p\+\(1−p\)\\mu\\leq\\alpha p\+\(1\-p\), forcingp≤\(1−μ\)/\(1−α\)p\\leq\(1\-\\mu\)/\(1\-\\alpha\)and abstention≥\(μ−α\)/\(1−α\)\\geq\(\\mu\-\\alpha\)/\(1\-\\alpha\)\. Full proof with finite\-sample constants in Appendix[E](https://arxiv.org/html/2606.29054#A5)\. ∎

#### Why this matters \(beyond the algebra\)\.

The inequality itself follows from a simple decomposition, but its role here is new: it is a deployment\-feasibility test for risk\-controlled abstention in structured LLM generation\. Classical reject\-option and selective prediction work studies accuracy–coverage tradeoffs\(Chow,[1970](https://arxiv.org/html/2606.29054#bib.bib24); Shalev\-Shwartz and Ben\-David,[2014](https://arxiv.org/html/2606.29054#bib.bib28); Geifman and El\-Yaniv,[2017](https://arxiv.org/html/2606.29054#bib.bib26)\), but does not connect this tradeoff to CRC\-style distribution\-free risk guarantees for complex structured losses\. In our setting, the bound explains in closed form why many NER/QA/CLS configurations cannot be certified at strictα\\alphaeven with sophisticated bounds, and it predicts the abstention levels we observe across 3B–72B models \(Appendix Table[6](https://arxiv.org/html/2606.29054#A8.T6)\)\.

### 2\.3Score Fusion and Adaptive Inference

#### Calibrated score fusion\.

Individual scores capture complementary signal: Self\-Consistency measures inter\-sample agreement, Token Margin captures per\-token confidence, and Semantic Entropy estimates output diversity\(Wanget al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib31); Kuhnet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib32); Farquharet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib33)\)\. We learn a fused scores~​\(x\)=σ​\(w⊤​\[s1,…,sK\]\+b\)\\tilde\{s\}\(x\)=\\sigma\(w^\{\\top\}\[s\_\{1\},\\ldots,s\_\{K\}\]\+b\)via logistic regression on a disjoint calibration split predicting𝕀​\(R≤α\)\\mathbb\{I\}\(R\\leq\\alpha\), then run CRC on the remaining calibration split usings~\\tilde\{s\}\. This preserves standard CRC validity while improving ranking quality\.

###### Proposition 5\(Fusion Preserves CRC Guarantee\)\.

Lets~​\(x\)=g​\(s1​\(x\),…,sK​\(x\)\)\\tilde\{s\}\(x\)=g\(s\_\{1\}\(x\),\\ldots,s\_\{K\}\(x\)\)whereggis a fixed, deterministic function independent of the calibration sample used by CRC \(e\.g\., learned on disjoint data or a held\-out split\)\. Then running CRC withs~\\tilde\{s\}preserves the finite\-sample guarantee:ℙ​\(𝔼​\[R​\(λ~∗\)\]\>α\)≤δ\\mathbb\{P\}\(\\mathbb\{E\}\[R\(\\tilde\{\\lambda\}^\{\*\}\)\]\>\\alpha\)\\leq\\delta\.

###### Proof sketch\.

Conditioned on fixedgg, the transformed scoress~i\\tilde\{s\}\_\{i\}are a deterministic function of\(xi,yi\)\(x\_\{i\},y\_\{i\}\)and remain exchangeable across calibration examples\. CRC validity then follows by the standard exchangeability argument for the chosen UCB/betting bound\. Full proof in Appendix[E](https://arxiv.org/html/2606.29054#A5)\. ∎

#### Adaptive conformal inference \(ACI\)\.

Under distribution shift, the exchangeability assumption underlying CRC is violated, and static thresholds can fail badly \(our experiments show 60–71% violation rates\)\. ACI\(Gibbs and Candès,[2021](https://arxiv.org/html/2606.29054#bib.bib22); Tibshiraniet al\.,[2019](https://arxiv.org/html/2606.29054#bib.bib23); Barberet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib5)\)addresses this by updating the threshold online:

λt\+1=λt\+γ​\(rt−α\),\\lambda\_\{t\+1\}=\\lambda\_\{t\}\+\\gamma\(r\_\{t\}\-\\alpha\),\(3\)whereγ\>0\\gamma\>0is the step size andrtr\_\{t\}is the observed risk at timett\(requiring feedback labels or a verification signal\)\. Intuitively, when the risk exceedsα\\alpha, the threshold increases \(more abstention\); when risk is belowα\\alpha, the threshold decreases \(less abstention\)\. Under bounded risk, the time\-averaged risk satisfies\|R¯T−α\|≤\(λmax−λmin\)/\(γ​T\)\|\\bar\{R\}\_\{T\}\-\\alpha\|\\leq\(\\lambda\_\{\\max\}\-\\lambda\_\{\\min\}\)/\(\\gamma T\), providing asymptotic \(not finite\-sample\) control regardless of shift magnitude\(Gibbs and Candès,[2021](https://arxiv.org/html/2606.29054#bib.bib22)\)\. We setγ=0\.01\\gamma=0\.01and clipλt\\lambda\_\{t\}to\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]\(the score range\)\.

## 3Experiments

#### Roadmap\.

We first use Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)as a feasibility test: whenμ\>α\\mu\>\\alpha, substantial abstention is unavoidable, so low certification at strict targets validates \(rather than contradicts\) the theory\. We then lead with the constructive result that relaxing the risk target \(§[3\.3](https://arxiv.org/html/2606.29054#S3.SS3)\) unlocks practical certification on hard tasks \(e\.g\., 47% NER, 40% QA, 60% CLS atα=0\.40\\alpha=0\.40\), before comparing bound tightness and shift\-handling\.

### 3\.1Setup

#### Models\.

Six open\-weight models from three families: Qwen2\.5\-Instruct \(3B, 7B, 14B, 72B\-AWQ\)\(Team,[2024](https://arxiv.org/html/2606.29054#bib.bib35)\), Gemma\-3 \(4B\)\(Gemma Team, Google DeepMind,[2025](https://arxiv.org/html/2606.29054#bib.bib36)\), and Ministral \(8B\), spanning 3B–72B parameters \(all publicly available for reproducibility\)\. Inference uses vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib37)\)on A100\-80GB with greedy decoding andK=10K=10temperature sampling \(T=0\.7T=0\.7\)\.

#### Datasets\.

Eight datasets across four tasks \(Table[2](https://arxiv.org/html/2606.29054#S3.T2)\): CoNLL\-2003 \(NER, 3,453, news\), Few\-NERD \(NER, 2,000, Wikipedia\), WNUT\-17 \(NER, 1,287, social media\), TriviaQA \(QA, 1,000, trivia\), NQ Open \(QA, 3,610, Wikipedia\), MMLU\-STEM \(CLS, 400, STEM\), MMLU\-Humanities \(CLS, 400, humanities\), JSON\-Extract \(JSON, 600, structured\)\.

#### Scores\.

Six nonconformity scores: Token Margin \(TM\), NLL, Self\-Consistency \(SC\), Semantic Entropy \(SE\), Entity Agreement \(EA, NER\-specific\), Field Completeness \(FC, JSON\-specific\)\. All normalized to\[0,1\]\[0,1\]; higher = more reliable\. Details in Appendix[D](https://arxiv.org/html/2606.29054#A4)\.

#### Protocol\.

60/40 calibration–test split \(seed 42\)\. Thresholds selected on calibration only; test used for evaluation\.δ=0\.1\\delta=0\.1\. We report 656 non\-degenerate model\-dataset\-score\-α\\alphaconfigurations; none excluded post hoc\. A configuration is “non\-degenerate” if\|Eλ\|≥10\|E\_\{\\lambda\}\|\\geq 10\. We evaluate three bound types \(Hoeffding, Bernstein, e\-CRC\)×\\timessix models×\\timeseight datasets×\\timesup to six scores per dataset, minus configurations with degenerate or missing scores \(e\.g\., Entity Agreement is undefined for non\-NER tasks\)\. The 72B model uses greedy decoding only \(Token Margin and NLL scores\); smaller models additionally use sampling\-based scores \(SC, SE, EA, FC\)\.

Table 2:Datasets used in our evaluation\.

### 3\.2LLMs Are Miscalibrated for Structured Tasks

Models report average confidence of 74–97% but actual risk ranges from 3% \(JSON\) to 80% \(NQ Open\)\. Expected Calibration Error \(ECE\) ranges from 0\.05 \(JSON\) to 0\.61 \(NQ Open\), showing that raw model confidence is unreliable for structured output quality\. This miscalibration is task\-dependent: JSON extraction \(structured, low\-entropy\) has ECE0\.050\.05, while open\-domain QA \(high\-entropy, diverse\) has ECE0\.610\.61\. The severity of miscalibration correlates withμ\\mu: tasks where models are most wrong are also where they are most overconfident, making heuristic abstention doubly unreliable\. This pattern aligns with prior calibration findings on LLMs\(Guoet al\.,[2017](https://arxiv.org/html/2606.29054#bib.bib30); Kadavathet al\.,[2022](https://arxiv.org/html/2606.29054#bib.bib29); Xionget al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib34)\)\.

Table 3:Base riskμ\\muand risk variance by dataset \(averaged over the five sampling\-based models; the 72B model is greedy\-only\)\. All datasets satisfyVar​\(R\)<1/4\\mathrm\{Var\}\(R\)<1/4, the condition for strict Bernstein improvement\. The minimum abstention lower bound \(Prop\.[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\)LB=max⁡\{0,\(μ−α\)/\(1−α\)\}\\mathrm\{LB\}=\\max\\\{0,\(\\mu\-\\alpha\)/\(1\-\\alpha\)\\\}gives the minimum abstention required by any valid method atα=0\.10\\alpha=0\.10\.
### 3\.3Extended Risk Targets Unlock Hard Tasks

Atα=0\.10\\alpha=0\.10, NER/QA/CLS certifications are near\-zero becauseμ≫α\\mu\\gg\\alpha, exactly as Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)predicts\. The impossibility bound\(μ−α\)/\(1−α\)\(\\mu\-\\alpha\)/\(1\-\\alpha\)evaluates to42%42\\%minimum abstention \(NER,μ=0\.48\\mu=0\.48\),50%50\\%\(QA,μ=0\.55\\mu=0\.55\), and27%27\\%\(CLS,μ=0\.34\\mu=0\.34\)\. This is a basic limit of distribution\-free abstention, not a limitation of any particular method\.

Relaxing the target changes this\. Atα=0\.30\\alpha=0\.30, CRC certifies 33% NER, 30% QA, 30% CLS configurations\. Atα=0\.40\\alpha=0\.40: NER 47%, QA 40%, CLS 60%\. These targets are realistic in practice \(e\.g\., 70% accuracy or 60% F1\), and are useful when the alternative is no guarantee at all\. In medical NER, guaranteeing≥60%\\geq 60\\%entity\-level F1 \(rather than requiring≥90%\\geq 90\\%\) still provides actionable quality assurance for initial triage, with flagged low\-confidence outputs routed to human review\.

Scaling to 72B \(μ=0\.24\\mu=0\.24on CoNLL vs\.0\.260\.26at 14B\) enables NER certification atα=0\.25\\alpha=0\.25where 14B fails, directly confirming Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3): the path to certification for hard tasks runs through reducingμ\\mu, not through tighter bounds\.

### 3\.4CRC Validity

Across all fourα\\alphalevels \(full per\-cell results in Appendix Table[8](https://arxiv.org/html/2606.29054#A8.T8)\), CRC maintains risk control: empirically, certified configurations exhibit near\-zero test violations \(0/57 for Hoeffding, 0/78 for Bernstein, and 1/80 for e\-CRC\)\. For JSON, the 14B model achieves guaranteed control with zero abstention, reflectingμ≈0\\mu\\approx 0\(the task is easy for all models\)\. For NQ\-Open \(μ=0\.80\\mu=0\.80\), all models must fully abstain even atα=0\.20\\alpha=0\.20, the impossibility bound at work\. Binomial validation: underδ=0\.1\\delta=0\.1, observing≥1\\geq 1violation inn=80n=80trials has probabilityℙ\(X≥1∣n=80,p=0\.1\)=0\.9998\\mathbb\{P\}\(X\\geq 1\\mid n=80,p=0\.1\)=0\.9998, so1/801/80is well within expectation; likewise0/780/78\(Bernstein\) and0/570/57\(Hoeffding\) are statistically consistent\.

The certification hierarchy predicted by Proposition[2](https://arxiv.org/html/2606.29054#Thmtheorem2)holds: Hoeffding certifies 57 \(8\.7%\), Bernstein 78 \(11\.9%\), e\-CRC 80 \(12\.2%\) of 656 configurations, with strict containment \(57⊆78⊆8057\\subseteq 78\\subseteq 80\)\. The Hoeffding→\\toBernstein step delivers the largest improvement \(\+37% certified\) and is the default upgrade\. e\-CRC is provably tighter and adds additional certified settings, with its practical value growing as calibration shrinks \(Table[4](https://arxiv.org/html/2606.29054#S3.T4)\)\. Among jointly\-certified configs, e\-CRC reduces abstention by 4\.8pp vs\. Hoeffding\.

The improvement concentrates in the low\-variance regime: among 72 configs withσ^2<0\.1\\hat\{\\sigma\}^\{2\}<0\.1, Hoeffding certifies 51 \(71%\), Bernstein 69 \(96%\), e\-CRC 70 \(97%\)\. For high\-variance configs, all methods certify<2%<2\\%, confirming the impossibility bound’s prediction\.

#### Theory matches practice \(Appendix Table[6](https://arxiv.org/html/2606.29054#A8.T6)\)\.

The impossibility bound\(μ−α\)/\(1−α\)\(\\mu\-\\alpha\)/\(1\-\\alpha\)yields a minimum abstention floor across all six datasets and four model scales\.μ\\mudecreases with scale across the NER/QA/CLS tasks: CoNLL NER0\.42→0\.33→0\.26→0\.240\.42\\to 0\.33\\to 0\.26\\to 0\.24, TriviaQA0\.55→0\.45→0\.34→0\.270\.55\\to 0\.45\\to 0\.34\\to 0\.27, MMLU\-STEM0\.53→0\.47→0\.38→0\.280\.53\\to 0\.47\\to 0\.38\\to 0\.28\. As long asμ\>α=0\.10\\mu\>\\alpha=0\.10for NER/QA/CLS, non\-trivial abstention is unavoidable, so low certification at strict targets is expected\. The closest approach is MMLU\-Humanities at 72B \(μ=0\.17\\mu=0\.17, floor 8%\), while NQ Open remains hardest: even 72B reaches onlyμ=0\.70\\mu=0\.70, requiring≥67%\\geq 67\\%abstention\. For JSON \(μ≈0\\mu\\approx 0\), certification succeeds at all scales with zero abstention\.

#### Three\-way bound comparison \(Table[4](https://arxiv.org/html/2606.29054#S3.T4)\)\.

Among the 57 configurations certified by all three methods \(the Hoeffding set, by nesting\), average abstention drops from 12\.9% \(Hoeffding\) to 8\.2% \(Bernstein\) to 8\.1% \(e\-CRC\): tighter bounds require less abstention to achieve the same guarantee\. At this calibration size \(n≈600n\\approx 600\), Bernstein captures most of the improvement, while e\-CRC provides an additional tightening of the feasible set\. e\-CRC’s advantage is most pronounced in data\-scarce settings: with only 20% calibration data, e\-CRC certifies 10% of configs vs\. 0% for Hoeffding, which matters in low\-resource extraction\.

Table 4:Three\-way comparison of certification bounds across 656 model\-dataset\-score\-α\\alphaconfigurations\. All three bounds maintain near\-zero violation rates among certified predictions, but variance\-aware methods certify substantially more configurations and achieve lower abstention\.
#### Model scale determines controllability\.

Qwen2\.5\-3B has average base risk 48%, requiring near\-total abstention on most tasks\. Scaling to 7B \(42%\) and 14B \(36%\) progressively improves controllability\. The 72B model \(μ=0\.24\\mu=0\.24on CoNLL NER\) enables certification atα=0\.25\\alpha=0\.25where 14B fails, and achieves near\-zero abstention atα=0\.30\\alpha=0\.30\. This confirms Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3): reducingμ\\mudirectly expands certifiable configurations\.

### 3\.5ACI Under Distribution Shift

We evaluate ACI under three shift types with increasing realism, each corresponding to a concrete deployment scenario\.

#### \(i\) Temporal shift

models data pipeline drift: in production, new documents arrive over time with gradually changing characteristics\. We use index\-order splitting as a proxy: CoNLL articles are ordered by publication date \(Reuters, 1996–97\) and TriviaQA by source document\. Static CRC violates 60% \(15/25\) of configs; ACI reduces to 4% \(1/25\)\.

#### \(ii\) Severity sweep

models gradual quality degradation\. We oversample high\-risk examples 1–3×\\times; static CRC rises from 8% to 56% violations, ACI stays<12%<12\\%\.

#### \(iii\) Cross\-dataset transfer

models genuine domain shift: a system calibrated on one dataset is deployed in a different domain without recalibration, the most common failure mode in practice\. We test three cross\-dataset transfers: CoNLL→\\toWNUT \(newswire→\\tosocial media\), CoNLL→\\toFewNERD \(newswire→\\toWikipedia\), and TriviaQA→\\toNQ \(trivia→\\toWikipedia questions\)\. Across 14 cross\-dataset configs, static CRC violates 71% \(10/14\); ACI reduces to 21% \(3/14\), a3\.4×3\.4\\timesreduction\. The 3 ACI violations occur whereμ\>0\.5\\mu\>0\.5, in the regime where the impossibility bound forces heavy abstention\. Full results in Appendix Table[10](https://arxiv.org/html/2606.29054#A9.T10)\.

### 3\.6Score Fusion

Calibrated score fusion improves AUROC on 33/37 model\-dataset configs \(89%\), with average gain\+0\.026\+0\.026and up to\+0\.14\+0\.14on individual configs\. The largest gains occur where individual scores are weakest: classification \(\+0\.09\+0\.09on MMLU\-STEM with Gemma\-3\-4B\) and NER \(\+0\.07\+0\.07on WNUT with Qwen2\.5\-3B\)\. This confirms that individual scores capture complementary signal: Self\-Consistency captures inter\-sample agreement while Token Margin captures per\-token confidence, and their correlation is only0\.310\.31–0\.580\.58across tasks, leaving substantial room for fusion\. Logistic regression fusion is a simple, low\-cost way to combine them: the logistic coefficientswware fixed on calibration data, so the fused scores~\\tilde\{s\}is a deterministic function of the individual scores and the CRC guarantee is preserved \(Proposition[5](https://arxiv.org/html/2606.29054#Thmtheorem5)\)\. The 4 configs where fusion hurts \(all<0\.01<0\.01AUROC decrease\) correspond to cases where a single dominant score \(Token Margin for JSON\) already captures nearly all relevant signal\.

### 3\.7Analysis

#### Score comparison\.

Table[5](https://arxiv.org/html/2606.29054#S3.T5)compares discriminative power \(AUROC\) of six confidence scores\. Self\-Consistency is the strongest general\-purpose score \(highest AUROC on 7/8 datasets, up to 0\.82 on QA\)\. Entity Agreement is a strong NER\-specific score \(0\.69 on Few\-NERD\), though Self\-Consistency edges it out on all three NER datasets\. Token Margin is best for JSON \(AUROC 0\.82\)\. Semantic Entropy appears weak under our exact\-match clustering proxy \(AUROC≈0\.50\\approx 0\.50\); this reflects the coarseness of exact\-match clustering rather than a limitation of semantic entropy itself\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib32); Farquharet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib33)\), and motivates embedding\-based semantic clustering for a fair SE evaluation\.

Table 5:AUROC of confidence scores for risk discrimination \(averaged across models\)\. Higher values indicate better ability to distinguish correct from incorrect outputs\. Bold: best per dataset\. – indicates score is not applicable\.
#### Data efficiency\.

Variance\-aware bounds are substantially more data\-efficient\. With only 20% calibration data, e\-CRC certifies 10% of configs vs\. 0% for Hoeffding\. At 40% calibration, e\-CRC \(13%\) already matches Hoeffding at 80% \(10%\), a clear advantage when labeled data is scarce\.

#### Baselines\.

We compare CRC against four baselines: Always Answer \(no abstention\), and entropy/TM/SC thresholds tuned to minimize abstention subject toR≤αR\\leq\\alphaon the calibration set\. Always Answer violates for 7/8 datasets, confirming the need for abstention\. Threshold baselines achieve low risk on the test set by abstaining aggressively \(85–87% average abstention\), but this is not guaranteed: they happened to be conservative on these particular test splits\. CRC achieves comparable or lower abstention on certifiable tasks with formal guarantees\. The key distinction: a threshold baseline’s 0% violation rate on one test set provides no assurance for future data, while CRC’s guarantee holds for any exchangeable test set with probability≥1−δ\\geq 1\-\\delta\.

#### Ministral violations expose framework limits\.

Ministral\-8B is the main source of CRC violations: it accounts for the single e\-CRC certified\-set violation and shows the highest violation rate across the per\-α\\alphagrid \(Table[8](https://arxiv.org/html/2606.29054#A8.T8)\), concentrated in NER where token\-level logprob calibration is weakest\. This is within theδ=0\.1\\delta=0\.1budget, but it suggests score calibration can be brittle: AWQ quantization \(e\.g\., 72B\-AWQ\) and tokenizer/model\-specific effects can perturb nonconformity scores and weaken exchangeability, yielding occasional violations\. This motivates monitoring plus ACI in deployment, and warrants further study of quantization\- and tokenizer\-aware confidence scores\.

#### Practical decision framework\.

Our results suggest a three\-step decision process for deploying CRC in structured generation:

- •Step 1: Feasibility check\.Compute base riskμ\\muon a held\-out set\. Ifμ<α\\mu<\\alpha, certification is feasible; proceed to Step 2\. Ifμ\>α\\mu\>\\alpha, the impossibility bound \(Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\) shows any valid method must abstain on at least\(μ−α\)/\(1−α\)\(\\mu\-\\alpha\)/\(1\-\\alpha\)of examples\. If that floor is unacceptable, the practitioner should either \(a\) scale to a stronger model \(Appendix Table[6](https://arxiv.org/html/2606.29054#A8.T6)\), \(b\) relaxα\\alphato matchμ\\mu, or \(c\) accept that strict risk targets will require heavy abstention for this task–model pair\.
- •Step 2: Bound and score selection\.Always use Bernstein or e\-CRC, never Hoeffding, which is uniformly dominated\. Choose e\-CRC when calibration data is scarce \(n<200n<200\)\. Use Self\-Consistency as the default nonconformity score, or score fusion when multiple scores are available \(89% improvement rate\)\. For JSON tasks, Token Margin alone suffices \(AUROC 0\.82\)\.
- •Step 3: Shift mitigation\.If the deployment distribution may differ from calibration \(the common case\), deploy ACI withγ=0\.01\\gamma=0\.01\. Our cross\-dataset experiments show static CRC fails badly under domain shift \(71% violations\), while ACI maintains control in the majority of settings \(21% violations, with the residual violations confined to the high\-base\-risk regime\)\.

## 4Discussion

There is a tension here: the tasks that most need a guarantee \(high\-risk NER, QA\) are the hardest to certify\. The impossibility bound \(Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\) says why: difficulty is set byμ/α\\mu/\\alpha, not by the bound or the score\. Validated across 24 model\-task\-scale configurations \(Appendix Table[6](https://arxiv.org/html/2606.29054#A8.T6)\), it lets practitioners read off feasibility, expected abstention, and the payoff from scaling the model before running any experiment\.

#### Why the bound hierarchy matters despite small headline gains\.

The bound hierarchy establishes strict containmentΛHoeff∗⊆ΛBern∗⊆Λe​\-​CRC∗\\Lambda^\{\*\}\_\{\\mathrm\{Hoeff\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{Bern\}\}\\subseteq\\Lambda^\{\*\}\_\{\\mathrm\{e\\text\{\-\}CRC\}\}, so practitioners can safely upgrade bounds without sacrificing validity\. At typical calibration sizes \(n≈600n\\approx 600\), Bernstein captures most gains \(\+37%\), but e\-CRC becomes decisive when calibration is scarce: atn=120n=120\(20% data\), e\-CRC certifies 10% of configs vs\. 0% for Hoeffding, providing actionable certificates in low\-resource settings \(e\.g\., medical NER, legal extraction\)\.

#### When hard tasks are uncertifiable\.

Atα=0\.10\\alpha=0\.10, NER/QA/CLS certification rates are near\-zero\. This is expected, not a shortcoming: the impossibility bound tells practitioners the outcome is unavoidable before running any experiment, preventing wasted effort on bound improvements\. Because difficulty is set byμ/α\\mu/\\alpha, the way out follows from the bound itself: lowerμ\\muby scaling the model \(72B cutsμ\\mufrom 0\.42 to 0\.24 on NER and from 0\.55 to 0\.27 on QA\), raiseα\\alphato meet it, or adapt online under shift\. The three\-step framework of §[3\.7](https://arxiv.org/html/2606.29054#S3.SS7)turns this into an operational recipe\.

#### ACI’s 21% residual violation rate\.

Under cross\-dataset transfer, ACI still violates on 3/14 configs, all withμ\>0\.5\\mu\>0\.5, in the regime where the impossibility bound forces heavy abstention\. ACI provides long\-run \(asymptotic\) control under shift within the feasible regime but cannot overcome this feasibility limit\.

## 5Related Work

#### Conformal prediction for LLMs\.

Conformal prediction\(Vovket al\.,[2005](https://arxiv.org/html/2606.29054#bib.bib1); Angelopoulos and Bates,[2023](https://arxiv.org/html/2606.29054#bib.bib2)\)provides distribution\-free guarantees\. Recent LLM adaptations include conformal language modeling\(Quachet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib7)\), conformal factuality\(Mohri and Hashimoto,[2024](https://arxiv.org/html/2606.29054#bib.bib8)\), conformal abstention\(Abbasi\-Yadkoriet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib9)\), conformal alignment\(Guiet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib10)\), and surveys for NLP\(Camposet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib11)\)\. CRC\(Angelopouloset al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib3); Bateset al\.,[2021](https://arxiv.org/html/2606.29054#bib.bib6)\)handles arbitrary losses;Barberet al\.\([2023](https://arxiv.org/html/2606.29054#bib.bib5)\)relax exchangeability\. In contrast to factuality/abstention pipelines that rely on entailment models or semantic clustering\(Mohri and Hashimoto,[2024](https://arxiv.org/html/2606.29054#bib.bib8); Abbasi\-Yadkoriet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib9)\), we study schema\-validated structured generation \(NER/QA/CLS/JSON\) with exact\-match/structured losses\. The companion workKotte \([2026](https://arxiv.org/html/2606.29054#bib.bib38)\)addresses the orthogonal problem of joint coverage across multi\-stage NER→\\toNED→\\totyping pipelines using a maximum\-nonconformity reduction; here we focus on single\-output certification and its information\-theoretic feasibility\.

#### Selective prediction and uncertainty\.

Selective prediction\(Geifman and El\-Yaniv,[2017](https://arxiv.org/html/2606.29054#bib.bib26); El\-Yaniv and Wiener,[2010](https://arxiv.org/html/2606.29054#bib.bib25); Chow,[1970](https://arxiv.org/html/2606.29054#bib.bib24)\)studies coverage–risk trade\-offs;Geifman and El\-Yaniv \([2019](https://arxiv.org/html/2606.29054#bib.bib27)\)andLeeet al\.\([2024](https://arxiv.org/html/2606.29054#bib.bib12)\)train rejection jointly\. LLM confidence is miscalibrated\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.29054#bib.bib29); Guoet al\.,[2017](https://arxiv.org/html/2606.29054#bib.bib30)\); self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib31)\), semantic entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib32); Farquharet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib33)\), and elicitation methods\(Xionget al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib34)\)offer alternatives\. We evaluate these as nonconformity scores within CRC for structured generation\.

#### Concentration inequalities and e\-values\.

Empirical Bernstein\(Maurer and Pontil,[2009](https://arxiv.org/html/2606.29054#bib.bib19)\)improves on Hoeffding by incorporating variance; testing\-by\-betting yields e\-values\(Shafer,[2021](https://arxiv.org/html/2606.29054#bib.bib13); Grünwaldet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib16); Ramdaset al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib15); Vovk and Wang,[2021](https://arxiv.org/html/2606.29054#bib.bib18)\)with strong guarantees\(Howardet al\.,[2021](https://arxiv.org/html/2606.29054#bib.bib17); Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.29054#bib.bib14)\)\. Our contribution is not a new inequality but a deployment\-facing characterization: when CRC can certify structured LLM outputs, and when it is provably impossible\.

## 6Conclusion

We provide a decision framework for when conformal risk control \(CRC\) can certify structured LLM outputs, and when meaningful certification necessarily comes with heavy abstention\. The impossibility bound \(Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\) is the key deployment test: ifμ\>α\\mu\>\\alpha, any distribution\-free method must abstain on at least\(μ−α\)/\(1−α\)\(\\mu\-\\alpha\)/\(1\-\\alpha\)of examples, so low certification on hard NER/QA/CLS at smallα\\alphais expected\. In feasible regimes, Bernstein improves over Hoeffding at negligible cost, and e\-CRC helps when calibration data is scarce\.

#### Threats to validity and limitations\.

Thresholds are selected on calibration only; we report all 656 configurations\. CRC guarantees are probabilistic \(δ=0\.1\\delta=0\.1\) under exchangeability\. Static guarantees assume exchangeability; ACI provides asymptotic \(not finite\-sample\) control under shift\. Sampling\-based scores requireKKforward passes\. Score fusion uses calibration data twice \(conservative but valid; see Proposition[5](https://arxiv.org/html/2606.29054#Thmtheorem5)\)\. One dataset \(JSON\-Extract\) uses template\-based extraction tasks\. Evaluation spans 3B–72B; larger models \(175B\+\) or closed\-source APIs would further test generalizability\.

#### Reproducibility\.

All experiments use publicly available datasets and open\-weight models\. Fixed seed 42\. Code and configurations will be released after cleanup\.

#### Ethics\.

By making the feasibility of a risk guarantee explicit, our framework discourages deploying CRC where it cannot hold, for example advertising a10%10\\%error guarantee on a task whose base risk is far higher\.

## References

- Y\. Abbasi\-Yadkori, I\. Kuzborskij, D\. Stutz, A\. György, A\. Fisch, A\. Doucet, I\. Beloshapka, W\. Weng, Y\. Yang, C\. Szepesvári, A\. T\. Cemgil, and N\. Tomasev \(2024\)Mitigating LLM hallucinations via conformal abstention\.InNeurIPS 2024 Workshop on Statistical Frontiers in LLMs and Foundation Models,Note:arXiv:2405\.01563Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p4.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- A\. N\. Angelopoulos, S\. Bates, E\. J\. Candès, M\. I\. Jordan, and L\. Lei \(2025\)Learn then test: calibrating predictive algorithms to achieve risk control\.Annals of Applied Statistics19\(2\),pp\. 1641–1662\.Note:arXiv:2110\.01052Cited by:[§A\.2](https://arxiv.org/html/2606.29054#A1.SS2.p1.8),[Appendix J](https://arxiv.org/html/2606.29054#A10.p1.6),[§2](https://arxiv.org/html/2606.29054#S2.SS0.SSS0.Px1.p1.14)\.
- A\. N\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster \(2024\)Conformal risk control\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2208\.02814Cited by:[§A\.1](https://arxiv.org/html/2606.29054#A1.SS1.p1.5),[§1](https://arxiv.org/html/2606.29054#S1.p3.7),[§2](https://arxiv.org/html/2606.29054#S2.SS0.SSS0.Px1.p1.14),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- A\. N\. Angelopoulos and S\. Bates \(2023\)Conformal prediction: a gentle introduction\.Foundations and Trends in Machine Learning16\(4\),pp\. 494–591\.Cited by:[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- R\. F\. Barber, E\. J\. Candès, A\. Ramdas, and R\. J\. Tibshirani \(2023\)Conformal prediction beyond exchangeability\.Annals of Statistics51\(2\),pp\. 816–845\.Cited by:[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px2.p1.10),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- S\. Bates, A\. Angelopoulos, L\. Lei, J\. Malik, and M\. Jordan \(2021\)Distribution\-free, risk\-controlling prediction sets\.Journal of the ACM68,pp\. 1–34\.Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p3.7),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- M\. M\. Campos, A\. Farinhas, C\. Zerva, M\. A\. T\. Figueiredo, and A\. F\. T\. Martins \(2024\)Conformal prediction for natural language processing: a survey\.Transactions of the Association for Computational Linguistics \(TACL\)\.Note:arXiv:2405\.01976Cited by:[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- C\. K\. Chow \(1970\)On optimum recognition error and reject tradeoff\.IEEE Transactions on Information Theory16\(1\),pp\. 41–46\.Cited by:[§2\.2](https://arxiv.org/html/2606.29054#S2.SS2.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- R\. El\-Yaniv and Y\. Wiener \(2010\)On the foundations of noise\-free selective classification\.Journal of Machine Learning Research11,pp\. 1605–1641\.Cited by:[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630,pp\. 625–630\.Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p2.3),[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px1.p1.3),[§3\.7](https://arxiv.org/html/2606.29054#S3.SS7.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Geifman and R\. El\-Yaniv \(2017\)Selective classification for deep neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.3](https://arxiv.org/html/2606.29054#A1.SS3.p1.1),[§2\.2](https://arxiv.org/html/2606.29054#S2.SS2.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Geifman and R\. El\-Yaniv \(2019\)SelectiveNet: a deep neural network with an integrated reject option\.InInternational Conference on Machine Learning \(ICML\),Cited by:[Appendix L](https://arxiv.org/html/2606.29054#A12.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- Gemma Team, Google DeepMind \(2025\)Gemma 3 technical report\.Technical reportGoogle DeepMind\.Note:arXiv:2503\.19786Cited by:[§3\.1](https://arxiv.org/html/2606.29054#S3.SS1.SSS0.Px1.p1.2)\.
- I\. Gibbs and E\. Candès \(2021\)Adaptive conformal inference under distribution shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix L](https://arxiv.org/html/2606.29054#A12.SS0.SSS0.Px3.p1.5),[§E\.5](https://arxiv.org/html/2606.29054#A5.SS5.1.p1.5),[§1](https://arxiv.org/html/2606.29054#S1.p3.7),[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px2.p1.10),[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px2.p1.9)\.
- P\. Grünwald, R\. de Heide, and W\. M\. Koolen \(2024\)Safe testing\.Journal of the Royal Statistical Society: Series B86\(5\),pp\. 1091–1128\.Cited by:[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.p1.11),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- Y\. Gui, Y\. Jin, and Z\. Ren \(2024\)Conformal alignment: knowing when to trust foundation models with guarantees\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2405\.10301Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p4.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p2.3),[§3\.2](https://arxiv.org/html/2606.29054#S3.SS2.p1.3),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- S\. R\. Howard, A\. Ramdas, J\. McAuliffe, and J\. Sekhon \(2021\)Time\-uniform, nonparametric, nonasymptotic confidence sequences\.Annals of Statistics49\(2\),pp\. 1055–1080\.Cited by:[§E\.1](https://arxiv.org/html/2606.29054#A5.SS1.1.p1.13),[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.1.p1.3),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p2.3),[§3\.2](https://arxiv.org/html/2606.29054#S3.SS2.p1.3),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- V\. Kotte \(2026\)PASC: pipeline\-aware split conformal prediction with joint coverage guarantees for multi\-stage NLP and LLM pipelines\.arXiv preprint arXiv:2605\.18812\.Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p4.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.29054#S1.p2.3),[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px1.p1.3),[§3\.7](https://arxiv.org/html/2606.29054#S3.SS7.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with PagedAttention\.InACM Symposium on Operating Systems Principles \(SOSP\),Cited by:[Appendix G](https://arxiv.org/html/2606.29054#A7.SS0.SSS0.Px1.p1.3),[§3\.1](https://arxiv.org/html/2606.29054#S3.SS1.SSS0.Px1.p1.2)\.
- M\. Lee, K\. Kim, T\. Kim, and S\. Park \(2024\)Selective generation for controllable language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix L](https://arxiv.org/html/2606.29054#A12.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Maurer and M\. Pontil \(2009\)Empirical Bernstein bounds and sample variance penalization\.InConference on Learning Theory \(COLT\),pp\. 115–124\.Cited by:[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.p1.2),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- C\. Mohri and T\. Hashimoto \(2024\)Language models with conformal factuality guarantees\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),Vol\.235,pp\. 36029–36047\.Note:arXiv:2402\.10978Cited by:[Appendix L](https://arxiv.org/html/2606.29054#A12.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.29054#S1.p4.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- V\. Quach, A\. Fisch, T\. Schuster, A\. Yala, J\. H\. Sohn, T\. S\. Jaakkola, and R\. Barzilay \(2024\)Conformal language modeling\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2306\.10193Cited by:[Appendix L](https://arxiv.org/html/2606.29054#A12.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.29054#S1.p4.1),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- A\. Ramdas, P\. Grünwald, V\. Vovk, and G\. Shafer \(2023\)Game\-theoretic statistics and safe anytime\-valid inference\.Statistical Science38\(4\),pp\. 576–601\.Cited by:[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.p1.11),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- G\. Shafer \(2021\)Testing by betting: a strategy for statistical and scientific communication\.Journal of the Royal Statistical Society: Series A184\(2\),pp\. 407–431\.Cited by:[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.p1.11),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- S\. Shalev\-Shwartz and S\. Ben\-David \(2014\)Understanding machine learning: from theory to algorithms\.Cambridge University Press\.Cited by:[§2\.2](https://arxiv.org/html/2606.29054#S2.SS2.SSS0.Px1.p1.1)\.
- Q\. Team \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§3\.1](https://arxiv.org/html/2606.29054#S3.SS1.SSS0.Px1.p1.2)\.
- R\. J\. Tibshirani, R\. F\. Barber, E\. Candès, and A\. Ramdas \(2019\)Conformal prediction under covariate shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px2.p1.10)\.
- V\. Vovk, A\. Gammerman, and G\. Shafer \(2005\)Algorithmic learning in a random world\.Springer\.Cited by:[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px1.p1.2)\.
- V\. Vovk and R\. Wang \(2021\)E\-values: calibration, combination and applications\.Annals of Statistics49\(3\),pp\. 1736–1754\.Cited by:[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.p1.11),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2\.3](https://arxiv.org/html/2606.29054#S2.SS3.SSS0.Px1.p1.3),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.
- I\. Waudby\-Smith and A\. Ramdas \(2024\)Estimating means of bounded random variables by betting\.Journal of the Royal Statistical Society: Series B86\(1\),pp\. 1–27\.Cited by:[§E\.2](https://arxiv.org/html/2606.29054#A5.SS2.1.p1.14),[§2\.1](https://arxiv.org/html/2606.29054#S2.SS1.p1.11),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px3.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3\.2](https://arxiv.org/html/2606.29054#S3.SS2.p1.3),[§5](https://arxiv.org/html/2606.29054#S5.SS0.SSS0.Px2.p1.1)\.

## Appendix ABackground

### A\.1Conformal Risk Control

Conformal risk control\(Angelopouloset al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib3)\)provides distribution\-free guarantees on the expected risk of a selective predictor\. Given a calibration set of exchangeable examples\(Xi,Yi\)i=1n\(X\_\{i\},Y\_\{i\}\)\_\{i=1\}^\{n\}, a risk functionR​\(Y^,Y\)R\(\\hat\{Y\},Y\), and a target risk levelα\\alpha, CRC selects a thresholdλ∗\\lambda^\{\*\}such that the conditional risk𝔼​\[R∣emit\]≤α\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\leq\\alphaholds with high probability\. The key insight is that by calibrating on the empirical risk of a nested family of selectors, one obtains finite\-sample validity without distributional assumptions\.

### A\.2Learn\-Then\-Test Framework

The Learn\-Then\-Test \(LTT\) framework\(Angelopouloset al\.,[2025](https://arxiv.org/html/2606.29054#bib.bib4)\)extends conformal prediction to risk control\. For each candidate thresholdλ\\lambda, LTT tests the hypothesisH0:R​\(λ\)\>αH\_\{0\}:R\(\\lambda\)\>\\alphausing Hoeffding\-based upper confidence bounds\. With a Bonferroni \(union\-bound\) correction overmmthresholds, the probability that any selectedλ∗\\lambda^\{\*\}satisfiesR​\(λ∗\)≤αR\(\\lambda^\{\*\}\)\\leq\\alphais at least1−δ1\-\\delta\. Settingδ=1/\(n\+1\)\\delta=1/\(n\+1\)yields finite\-sample validity\. The monotonicity of risk in thresholdλ\\lambdacan be exploited to avoid Bonferroni correction entirely, as we do in this work\.

### A\.3Selective Prediction

Selective prediction\(Geifman and El\-Yaniv,[2017](https://arxiv.org/html/2606.29054#bib.bib26)\)studies the trade\-off between coverage \(fraction of examples answered\) and risk \(error rate among answered examples\)\. The goal is to abstain on uncertain inputs so that the risk of emitted predictions meets a target\. Our work bridges selective prediction with conformal methods to obtain distribution\-free guarantees for structured LLM outputs\.

## Appendix BAlgorithm Pseudocode

Algorithm 1CRC Procedure: Calibration for risk control at levelα\\alpha\.1:Input:Calibration

\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, risk

RR, target

α\\alpha, thresholds

Λ\\Lambda\.

2:For each

\(xi,yi\)\(x\_\{i\},y\_\{i\}\): compute

si=s​\(xi,y^i\)s\_\{i\}=s\(x\_\{i\},\\hat\{y\}\_\{i\}\)\.

3:For each

λ∈Λ\\lambda\\in\\Lambda: compute

R^​\(λ\)=1\|Eλ\|​∑i∈EλR​\(yi,y^i\)\\hat\{R\}\(\\lambda\)=\\frac\{1\}\{\|E\_\{\\lambda\}\|\}\\sum\_\{i\\in E\_\{\\lambda\}\}R\(y\_\{i\},\\hat\{y\}\_\{i\}\)where

Eλ=\{i:si≥λ\}E\_\{\\lambda\}=\\\{i:s\_\{i\}\\geq\\lambda\\\}\.

4:Compute Hoeffding UCB:

U​\(λ\)=R^​\(λ\)\+log⁡\(2/δ\)/\(2​\|Eλ\|\)U\(\\lambda\)=\\hat\{R\}\(\\lambda\)\+\\sqrt\{\\log\(2/\\delta\)/\(2\|E\_\{\\lambda\}\|\)\}\.

5:

λ∗=min⁡\{λ∈Λ:U​\(λ\)≤α\}\\lambda^\{\*\}=\\min\\\{\\lambda\\in\\Lambda:U\(\\lambda\)\\leq\\alpha\\\}\(lowest threshold with guaranteed risk\)\.

6:Return

λ∗\\lambda^\{\*\}\.

Algorithm 2ACI\-SG: Adaptive Conformal Inference for Structured Generation\.1:Input:Target

α\\alpha, step

γ\>0\\gamma\>0, initial

λ0\\lambda\_\{0\}\.

2:for

t=1,2,…t=1,2,\\ldotsdo

3:Receive

\(xt,yt\)\(x\_\{t\},y\_\{t\}\); compute

y^t\\hat\{y\}\_\{t\},

sts\_\{t\},

rt=R​\(y^t,yt\)r\_\{t\}=R\(\\hat\{y\}\_\{t\},y\_\{t\}\)\.

4:

λt=λt−1−γ​\(α−rt\)\\lambda\_\{t\}=\\lambda\_\{t\-1\}\-\\gamma\(\\alpha\-r\_\{t\}\)\. \(lower

λ\\lambda→\\toaccept more\)

5:Emit if

st≥λts\_\{t\}\\geq\\lambda\_\{t\}, else abstain\.

6:endfor

Algorithm 3e\-CRC: Betting\-Based Risk Control at levelα\\alpha\.1:Input:Calibration

\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}, risk

RR, target

α\\alpha, thresholds

Λ\\Lambda, failure prob\.

δ\\delta\.

2:for

λ∈Λ\\lambda\\in\\Lambdawith

\|Eλ\|≥5\|E\_\{\\lambda\}\|\\geq 5do

3:Take risks

r1,…,rmr\_\{1\},\\ldots,r\_\{m\}in

EλE\_\{\\lambda\}in calibration order\.

4:Set wealth

W0=1W\_\{0\}=1\.

5:for

j=1,…,mj=1,\\ldots,mdo

6:Compute running mean

μ^j=1j−1​∑k<jrk\\hat\{\\mu\}\_\{j\}=\\frac\{1\}\{j\-1\}\\sum\_\{k<j\}r\_\{k\}\(predictable, for

j≥2j\\geq 2\)\.

7:Kelly bet:

κj=clip​\(α−μ^jα​\(1−α\),0,0\.5\)\\kappa\_\{j\}=\\mathrm\{clip\}\\\!\\left\(\\frac\{\\alpha\-\\hat\{\\mu\}\_\{j\}\}\{\\alpha\(1\-\\alpha\)\},0,0\.5\\right\)\.

8:Update:

Wj=Wj−1⋅\(1\+κj​\(α−rj\)\)W\_\{j\}=W\_\{j\-1\}\\cdot\(1\+\\kappa\_\{j\}\(\\alpha\-r\_\{j\}\)\)\.

9:endfor

10:If

Wm≥1/δW\_\{m\}\\geq 1/\\delta, certify

λ\\lambda\.

11:endfor

12:

λ∗=min⁡\{λ∈Λ:certified\}\\lambda^\{\*\}=\\min\\\{\\lambda\\in\\Lambda:\\mathrm\{certified\}\\\}\.

13:Return

λ∗\\lambda^\{\*\}\.

## Appendix CTask\-Specific Risk Functions

We define risk as one minus the task metric, so lower risk corresponds to better performance:

- •NER Risk:RNER=1−F1entityR\_\{\\mathrm\{NER\}\}=1\-F\_\{1\}^\{\\mathrm\{entity\}\}, whereF1entityF\_\{1\}^\{\\mathrm\{entity\}\}is entity\-level micro F1 \(exact span match\)\.
- •JSON Risk:RJSON=1−F1fieldR\_\{\\mathrm\{JSON\}\}=1\-F\_\{1\}^\{\\mathrm\{field\}\}, whereF1fieldF\_\{1\}^\{\\mathrm\{field\}\}is field\-level F1 over key\-value pairs\.
- •QA Risk:RQA=1−𝕀​\(exact​\_​match\)R\_\{\\mathrm\{QA\}\}=1\-\\mathbb\{I\}\(\\mathrm\{exact\\\_match\}\)\.
- •Classification Risk:RCLS=1−𝕀​\(correct\)R\_\{\\mathrm\{CLS\}\}=1\-\\mathbb\{I\}\(\\mathrm\{correct\}\)\.

All risk functions are bounded in\[0,1\]\[0,1\], satisfying the requirements for conformal risk control\.

## Appendix DConfidence Score Details

We evaluate six confidence scoress​\(x,y^\)∈\[0,1\]s\(x,\\hat\{y\}\)\\in\[0,1\]where higher values indicate greater reliability:

1. 1\.Token Margin \(TM\):Average gap between top\-1 and top\-2 token log\-probabilities, normalized to\[0,1\]\[0,1\]via sigmoid\. Higher margin→\\tomore confident\.
2. 2\.Negative Log\-Likelihood \(NLL\):exp⁡\(−NLL/T\)\\exp\(\-\\mathrm\{NLL\}/T\)whereTTis sequence length\. Higher→\\tomore likely output\.
3. 3\.Self\-Consistency \(SC\):Fraction ofK=10K=10samples matching the greedy output\. Higher agreement→\\tomore confident\.
4. 4\.Semantic Entropy \(SE\):1−H​\(clusters\)/log2⁡K1\-H\(\\mathrm\{clusters\}\)/\\log\_\{2\}Kover semantically clustered outputs\. Higher→\\tomore concentrated\.
5. 5\.Entity\-Level Agreement \(EA\):Minimum entity agreement acrossKKsamples\. NER\-specific\.
6. 6\.Field\-Level Completeness \(FC\):Minimum field agreement acrossKKsamples\. JSON\-specific\.

## Appendix EProofs

### E\.1Theorem[1](https://arxiv.org/html/2606.29054#Thmtheorem1): Betting\-Based Risk Validity \(Full Proof\)

###### Proof\.

Step 1: E\-value validity\.Fixλ\\lambdaand letr1,…,rmr\_\{1\},\\ldots,r\_\{m\}be risks inEλE\_\{\\lambda\}\. The wealth processW0=1W\_\{0\}=1,Wj=Wj−1​\(1\+κj​\(α−rj\)\)W\_\{j\}=W\_\{j\-1\}\(1\+\\kappa\_\{j\}\(\\alpha\-r\_\{j\}\)\)with predictableκj\\kappa\_\{j\}\. UnderH0:𝔼​\[rj∣ℱj−1\]≥αH\_\{0\}:\\mathbb\{E\}\[r\_\{j\}\\mid\\mathcal\{F\}\_\{j\-1\}\]\\geq\\alpha:

𝔼​\[Wj∣ℱj−1\]=Wj−1​\(1\+κj​\(α−𝔼​\[rj∣ℱj−1\]\)\)≤Wj−1\.\\mathbb\{E\}\[W\_\{j\}\\mid\\mathcal\{F\}\_\{j\-1\}\]=W\_\{j\-1\}\(1\+\\kappa\_\{j\}\(\\alpha\-\\mathbb\{E\}\[r\_\{j\}\\mid\\mathcal\{F\}\_\{j\-1\}\]\)\)\\leq W\_\{j\-1\}\.Thus\{Wj\}\\\{W\_\{j\}\\\}is a non\-negative supermartingale\. By Ville’s inequality\(Howardet al\.,[2021](https://arxiv.org/html/2606.29054#bib.bib17)\):ℙ​\(supjWj≥1/δ\)≤δ\\mathbb\{P\}\(\\sup\_\{j\}W\_\{j\}\\geq 1/\\delta\)\\leq\\delta\. Step 2: Certification\.Certifying whenWm≥1/δW\_\{m\}\\geq 1/\\deltagives false certification probability≤δ\\leq\\delta\. Step 3: Tightness\.The Kelly\-optimal bet achieves log\-optimal growth rate≈\(α−μ\)2/\(2​α​\(1−α\)\)\\approx\(\\alpha\-\\mu\)^\{2\}/\(2\\alpha\(1\-\\alpha\)\)\. WhenVar​\(R\)≪α​\(1−α\)\\mathrm\{Var\}\(R\)\\ll\\alpha\(1\-\\alpha\), this exceeds the Hoeffding rate, certifying more thresholds\. ∎

### E\.2Proposition[2](https://arxiv.org/html/2606.29054#Thmtheorem2): Bound Ordering \(Full Proof\)

###### Proof\.

Hoeffding⊆\\subseteqBernstein\.The Hoeffding UCB isUH=R^\+log⁡\(2/δ\)/\(2​n\)U\_\{H\}=\\hat\{R\}\+\\sqrt\{\\log\(2/\\delta\)/\(2n\)\}; the Bernstein UCB isUB=R^\+2​σ^2​log⁡\(2/δ\)/n\+7​log⁡\(2/δ\)/\(3​\(n−1\)\)U\_\{B\}=\\hat\{R\}\+\\sqrt\{2\\hat\{\\sigma\}^\{2\}\\log\(2/\\delta\)/n\}\+7\\log\(2/\\delta\)/\(3\(n\-1\)\)\. For largenn,UB<UHU\_\{B\}<U\_\{H\}whenσ^2<1/4\\hat\{\\sigma\}^\{2\}<1/4\. Sinceσ^2≤R^​\(1−R^\)≤1/4\\hat\{\\sigma\}^\{2\}\\leq\\hat\{R\}\(1\-\\hat\{R\}\)\\leq 1/4for\[0,1\]\[0,1\]\-bounded risks, Bernstein certifies everything Hoeffding certifies\. Bernstein⊆\\subseteqe\-CRC\.The Kelly\-optimal strategy achieves the optimal growth rate\(Waudby\-Smith and Ramdas,[2024](https://arxiv.org/html/2606.29054#bib.bib14)\), matching the Bernstein rate asymptotically while adapting to higher\-order distribution properties, certifying strictly more thresholds for finitenn\. Strictness\.For binary risks withμ=0\.05\\mu=0\.05,α=0\.10\\alpha=0\.10:σ^2=0\.0475≪1/4\\hat\{\\sigma\}^\{2\}=0\.0475\\ll 1/4, so Bernstein correction is0\.19\\sqrt\{0\.19\}times smaller than Hoeffding’s\. ∎

### E\.3Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3): Minimum Abstention Lower Bound \(Full Proof\)

###### Proof\.

Letp=ℙ​\(emit\)p=\\mathbb\{P\}\(\\mathrm\{emit\}\)\. By total expectation:μ=𝔼​\[R∣emit\]⋅p\+𝔼​\[R∣abstain\]​\(1−p\)\\mu=\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\cdot p\+\\mathbb\{E\}\[R\\mid\\mathrm\{abstain\}\]\(1\-p\)\. SinceR≥0R\\geq 0:μ≥𝔼​\[R∣emit\]⋅p\\mu\\geq\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\cdot p\. With guarantee𝔼​\[R∣emit\]≤α\\mathbb\{E\}\[R\\mid\\mathrm\{emit\}\]\\leq\\alpha, and using𝔼​\[R∣abstain\]≤1\\mathbb\{E\}\[R\\mid\\mathrm\{abstain\}\]\\leq 1, we haveμ≤α​p\+\(1−p\)\\mu\\leq\\alpha p\+\(1\-p\), givingp≤\(1−μ\)/\(1−α\)p\\leq\(1\-\\mu\)/\(1\-\\alpha\), henceabstention=1−p≥\(μ−α\)/\(1−α\)\\mathrm\{abstention\}=1\-p\\geq\(\\mu\-\\alpha\)/\(1\-\\alpha\)\. Finite\-sample correction\.By Hoeffding,\|μ^−μ\|≤log⁡\(2/δ\)/\(2​n\)\|\\hat\{\\mu\}\-\\mu\|\\leq\\sqrt\{\\log\(2/\\delta\)/\(2n\)\}, so the empirical bound isabstention≥\(μ−α\)/\(1−α\)−log⁡\(2/δ\)/\(2​n\)\\mathrm\{abstention\}\\geq\(\\mu\-\\alpha\)/\(1\-\\alpha\)\-\\sqrt\{\\log\(2/\\delta\)/\(2n\)\}\. Withn=1000n=1000,δ=0\.1\\delta=0\.1: correction≈0\.039\\approx 0\.039\. ∎

### E\.4Proposition[5](https://arxiv.org/html/2606.29054#Thmtheorem5): Fusion Preserves CRC Guarantee \(Full Proof\)

###### Proof\.

The fused scores~\\tilde\{s\}is fixed once the calibration set used to trainggis observed\. Conditional ons~\\tilde\{s\}, the valuess~​\(xi\)\\tilde\{s\}\(x\_\{i\}\)on the CRC calibration set are a deterministic function of\(xi,yi\)\(x\_\{i\},y\_\{i\}\)and remain exchangeable across calibration examples\. CRC threshold selection via any UCB then provides a valid guarantee on exchangeable test points, regardless of hows~\\tilde\{s\}was constructed\. When the same calibration data are reused for both fusion and CRC, the procedure is conservative but valid: the guarantee still holds because the random choice ofs~\\tilde\{s\}is then jointly accounted for in the worst\-case bound on the empirical risk\. ∎

### E\.5Theorem \(ACI Convergence for Structured Risk\)

###### Theorem 6\(ACI Convergence\)\.

Under the ACI updateλt=λt−1\+γ​\(rt−α\)\\lambda\_\{t\}=\\lambda\_\{t\-1\}\+\\gamma\(r\_\{t\}\-\\alpha\)with boundedrt∈\[0,1\]r\_\{t\}\\in\[0,1\]and fixedγ\>0\\gamma\>0, the time\-averaged risk converges:1T​∑t=1Trt→α\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}r\_\{t\}\\to\\alphaasT→∞T\\to\\infty\.

###### Proof sketch\.

Follows from online convex optimization\(Gibbs and Candès,[2021](https://arxiv.org/html/2606.29054#bib.bib22)\)\. The regret of the online gradient descent update isO​\(1/T\)O\(1/\\sqrt\{T\}\), so average risk converges toα\\alpha\. Our risk functions \(1−F11\-F\_\{1\},1−EM1\-\\mathrm\{EM\}, etc\.\) are bounded in\[0,1\]\[0,1\], satisfying the bounded gradient condition\. ∎

### E\.6Score Efficiency Ordering

###### Proposition 7\(Score Efficiency Ordering\)\.

For two scoressAs\_\{A\}andsBs\_\{B\}, ifsAs\_\{A\}stochastically dominatessBs\_\{B\}in separating low\-risk from high\-risk examples, then CRC withsAs\_\{A\}yields strictly lower abstention at anyα\\alpha\.

###### Proof\.

Stochastic dominance meanssAs\_\{A\}retains more low\-risk and excludes more high\-risk examples at any threshold\. The empirical risk among emitted examples is lower, the UCB tighter, leading to a lower valid threshold and less abstention\. ∎

### E\.7Multi\-Dimensional Risk Control

###### Proposition 8\(Multi\-Dimensional Risk Control\)\.

When simultaneously controllingKKrisk dimensions using Bonferroni correction \(αk=α/K\\alpha\_\{k\}=\\alpha/K\), the joint guarantee holds:ℙ\(∀k:Rk\(λk∗\)≤α/K\)≥1−Kδ\\mathbb\{P\}\(\\forall k:R\_\{k\}\(\\lambda^\{\*\}\_\{k\}\)\\leq\\alpha/K\)\\geq 1\-K\\delta\.

###### Proof\.

Apply the per\-dimension CRC validity to each risk dimension independently with targetαk=α/K\\alpha\_\{k\}=\\alpha/Kand failure probabilityδ\\delta\. By union bound overKKrisks,ℙ​\(all controlled\)≥1−K​δ\\mathbb\{P\}\(\\text\{all controlled\}\)\\geq 1\-K\\delta\. ∎

## Appendix FCost\-Sensitive Framework

The optimal abstention policy depends on the relative cost of errors vs\. abstentions\. Letcerrorc\_\{\\mathrm\{error\}\}denote the cost of emitting an incorrect prediction andcabstainc\_\{\\mathrm\{abstain\}\}the cost of abstaining\. The expected utility is:

U=−cerror⋅ℙ​\(error∣emit\)⋅ℙ​\(emit\)−cabstain⋅ℙ​\(abstain\)\.U=\-c\_\{\\mathrm\{error\}\}\\cdot\\mathbb\{P\}\(\\mathrm\{error\}\\mid\\mathrm\{emit\}\)\\cdot\\mathbb\{P\}\(\\mathrm\{emit\}\)\-c\_\{\\mathrm\{abstain\}\}\\cdot\\mathbb\{P\}\(\\mathrm\{abstain\}\)\.For a risk\-controlling selector with conditional riskα\\alpha, the optimal target isα∗=cabstain/cerror\\alpha^\{\*\}=c\_\{\\mathrm\{abstain\}\}/c\_\{\\mathrm\{error\}\}\. When abstention is cheap \(medical NER\), use stricterα\\alpha; when costly \(customer\-facing QA\), looserα\\alphais preferred\.

## Appendix GImplementation Details

#### Inference configuration\.

All models are served using vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2606.29054#bib.bib37)\)v0\.7\+ on a single NVIDIA A100\-80GB GPU withgpu\_memory\_utilization=0\.90\. Greedy decoding usesT=0T=0withmax\_tokens=512\. Sampling usesT=0\.7T=0\.7withK=10K=10samples per input\. The 72B model uses AWQ 4\-bit quantization\.

#### Prompt templates\.

We use task\-specific 3\-shot prompt templates\. For NER: entity schema plus three annotated examples\. For QA: context\-question\-answer triples\. For MMLU: question, four choices, three demonstrations\. For JSON: document, target schema, three examples\.

#### CRC hyperparameters\.

Grid ofm=200m=200candidate thresholdsλ∈\{0\.005,0\.01,…,1\.0\}\\lambda\\in\\\{0\.005,0\.01,\\ldots,1\.0\\\}\. Confidence levelδ=0\.1\\delta=0\.1\. For e\-CRC: Kelly\-optimal bets with clipping\[0,0\.5\]\[0,0\.5\]\. Fixed random seed 42 for 60/40 split\.

#### Computational cost\.

Greedy inference for all 8 datasets: 45–90 min per model\. Sampling \(K=10K=10\): 3–5×\\timeslonger\. Score computation:<5<5min per model \(CPU\)\. Full pipeline:<60<60min on single CPU\.

## Appendix HAdditional Experimental Results

### H\.1Impossibility Bound Across Model Scales

Table 6:Minimum abstention lower bound \(LB=max⁡\{0,\(μ−α\)/\(1−α\)\}\\mathrm\{LB\}=\\max\\\{0,\(\\mu\-\\alpha\)/\(1\-\\alpha\)\\\}\) vs\. certification across model scales atα=0\.10\\alpha=0\.10\. Whenμ≫α\\mu\\gg\\alpha, Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)implies any distribution\-free certificate must abstain heavily; in our experiments, these settings do not certify\. As scale increases,μ\\mudecreases across NER/QA/CLS, yet remains aboveα\\alphafor NER/QA/CLS even at 72B\. Only JSON \(μ≈0\\mu\\approx 0\) certifies at all scales\. MMLU\-Hum\. at 72B \(μ=0\.17\\mu=0\.17\) nears the certification frontier\.
### H\.2Miscalibration Analysis

Table 7:LLM miscalibration on structured generation tasks\. High ECE values show that model confidence is a poor proxy for correctness, motivating formal risk control\.
### H\.3CRC Validity Across Risk Levels

Table 8:CRC risk control acrossα\\alphalevels \(best score per dataset\)\. Format: test risk / abstention rate\. – indicates an empty emit set \(full abstention\); such degenerate settings are not counted as “certified” in our aggregate rates\. Bold: formally guaranteed \(ℙ​\(risk\>α\)≤δ\\mathbb\{P\}\(\\mathrm\{risk\}\>\\alpha\)\\leq\\delta\)\.†\\dagger: violation on test set\.
### H\.4Cross\-Task Controllability

We define controllabilityC​\(α\)=1−abstentionC\(\\alpha\)=1\-\\mathrm\{abstention\}atα\\alpha\. Atα=0\.10\\alpha=0\.10with Qwen2\.5\-3B: JSON achievesC=0\.996C=0\.996, CLS0\.3660\.366, QA0\.1830\.183, NER0\.1410\.141\. The strong correlation with base risk confirms that CRC cannot compensate for weak models\.

### H\.5Baselines Comparison

Table 9:Aggregate comparison atα=0\.10\\alpha=0\.10\(averaged over the five sampling\-based models, eight datasets; the 72B model is greedy\-only\)\. Threshold methods tune threshold on calibration set without formal guarantees\. CRC \(guaranteed\) reports only formally certified configurations\.
### H\.6Data Efficiency

Variance\-aware bounds are substantially more data\-efficient\. At calibration ratio0\.20\.2, Hoeffding certifies0%0\\%, Bernstein8\.0%8\.0\\%, e\-CRC10\.0%10\.0\\%\. At ratio0\.40\.4, e\-CRC \(13\.0%13\.0\\%\) matches Hoeffding at ratio0\.80\.8\(10\.0%10\.0\\%\)\. These data are summarized in Table[4](https://arxiv.org/html/2606.29054#S3.T4)\.

## Appendix ICross\-Dataset ACI Transfer

### I\.1ACI Experimental Setup

For the distribution shift experiments:

- •Temporal shift:Split by index order \(first 60%/last 40%\)\. CoNLL\-2003 articles are ordered by publication date; TriviaQA by source document; MMLU by subject cluster\. Index order captures topic drift within corpora\.
- •Domain shift:Calibrate on MMLU\-STEM \(400\), test on MMLU\-Humanities \(400\)\.
- •Severity sweep:Oversample high\-risk examples \(R\>median​\(R\)R\>\\mathrm\{median\}\(R\)\) by factors1\.01\.0–3\.0×3\.0\\times\.
- •Cross\-dataset:CoNLL→\\toWNUT \(newswire→\\tosocial media\), CoNLL→\\toFewNERD \(newswire→\\toWikipedia\), TriviaQA→\\toNQ \(trivia→\\toGoogle queries\)\.

ACI usesγ=0\.01\\gamma=0\.01with initial threshold from static CRC\.

Table 10:Cross\-dataset ACI transfer atα=0\.10\\alpha=0\.10\. Static CRC violates 10/14 \(71%\); ACI reduces to 3/14 \(21%\)\.

## Appendix JThreats to Validity

We report all 656 non\-degenerate configurations; none excluded post hoc\. Thresholdsλ∗\\lambda^\{\*\}determined entirely on calibration \(Algorithm[1](https://arxiv.org/html/2606.29054#alg1)\); no hyperparameters tuned on test\. “Guaranteed” meansR^n\+​\(λ∗\)≤α\\hat\{R\}\_\{n\}^\{\+\}\(\\lambda^\{\*\}\)\\leq\\alphaon calibration, implying𝔼​\[R​\(λ∗\)\]≤α\\mathbb\{E\}\[R\(\\lambda^\{\*\}\)\]\\leq\\alphawith probability≥1−δ\\geq 1\-\\deltaunder exchangeability\. We fixδ=0\.1\\delta=0\.1, so CRC guarantees are probabilistic: rare test\-set violations can occur even for certified configurations\. Empirically, violations are near\-zero among certified sets, suggesting the bounds are conservative in our settings\. CRC exploits risk monotonicity inλ\\lambda, so no Bonferroni correction is needed\(Angelopouloset al\.,[2025](https://arxiv.org/html/2606.29054#bib.bib4)\); cross\-bound comparisons are structural \(nested sets\), not hypothesis tests\. Bootstrap CIs over 30 resamples of the calibration/test split \(mean certified count\): Hoeffding 70 \[66, 73\], Bernstein 92 \[87, 95\]; the point estimates in the main text \(57, 78\) correspond to the fixed seed\-42 split\.

## Appendix KLimitations and Future Work

#### Limitations\.

Static guarantees assume exchangeability; ACI provides asymptotic \(not finite\-sample\) control under shift\. Sampling\-based scores requireKKforward passes\. Score fusion uses calibration data twice\. One dataset \(JSON\-Extract\) uses template\-based extraction tasks\. Evaluation spans 3B–72B; larger models \(175B\+\) or closed\-source APIs would further test generalizability\.

#### Future work\.

Learned scores optimized for risk–coverage; finite\-sample ACI via e\-processes; multi\-task CRC with shared calibration; integration with constrained decoding; evaluation on 70B\+ models\.

## Appendix LAdditional Baselines and Failure Analysis

#### Comparability to CP\-Sets and conformal factuality\.

CP\-Sets\(Quachet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib7)\)focus on set\-valued prediction \(returning a set of candidate labels/answers\) with coverage guarantees, whereas we certify single schema\-validated structured outputs under task losses such as entity\-level F1 and exact match\. Bringing CP\-Sets into our setting would require \(i\) a method to generate and score sets of structured candidates \(e\.g\., multiple JSON/NER outputs\), and \(ii\) a coverage definition tied to a structured distance \(e\.g\., edit distance over entity sets\)\. Conformal factuality\(Mohri and Hashimoto,[2024](https://arxiv.org/html/2606.29054#bib.bib8)\)certifies factual claims by composing retrieval and entailment\-style checks; this is complementary to our focus on certifying format\-valid structured generations with task\-specific losses, and could be combined with our approach as an additional nonconformity signal\.

#### Selective prediction baselines\.

Learning\-based rejection \(e\.g\., SelectiveNet\(Geifman and El\-Yaniv,[2019](https://arxiv.org/html/2606.29054#bib.bib27)\)and selective prediction for LLM systems\(Leeet al\.,[2024](https://arxiv.org/html/2606.29054#bib.bib12)\)\) can yield strong empirical coverage–risk tradeoffs, but typically does not provide distribution\-free guarantees without an additional conformalization step\. Our CRC framework can wrap any score \(including learned rejectors or retrieval\-based confidence signals\), provided exchangeability holds; thus CRC should be viewed as a guarantee layer rather than a competing uncertainty method\.

#### Why ACI can still violate under shift\.

ACI\(Gibbs and Candès,[2021](https://arxiv.org/html/2606.29054#bib.bib22)\)adapts thresholds online using realized feedback\. While this helps under gradual shift, it cannot bypass feasibility: when the base riskμ\\muis far above the targetα\\alpha, any distribution\-free method must abstain heavily \(Proposition[3](https://arxiv.org/html/2606.29054#Thmtheorem3)\)\. In our cross\-dataset transfers, the remaining ACI violations tend to occur in such high\-μ\\muregimes or when feedback is delayed/noisy, which limits adaptation\. Practically, we recommend using the feasibility check \(μ\\muvs\.α\\alpha\) before deployment and treating ACI as a monitoring\-and\-adjustment tool when feedback is available\.

#### Quantization and brittleness\.

Violations are more likely for smaller or quantized models and on token\-sensitive tasks \(especially NER\), suggesting nonconformity scores can be sensitive to tokenizer/model family and quantization artifacts \(e\.g\., AWQ\)\. This does not contradict CRC theory \(which assumes exchangeability of examples\), but it motivates using score functions that are stable across decoding/implementation variations and validating guarantees per deployment stack\.

Similar Articles

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.