PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

arXiv cs.LG 05/20/26, 04:00 AM Papers
conformal-prediction pipeline nlp llm uncertainty-quantification joint-coverage distribution-free
Summary
PASC proposes a conformal prediction method for multi-stage NLP and LLM pipelines that provides finite-sample, distribution-free joint coverage guarantees across all stages, achieving higher empirical coverage and efficiency than baselines like Bonferroni and independent CP.
arXiv:2605.18812v1 Announce Type: new Abstract: Modern NLP and LLM systems are pipelines: named entity recognition (NER) -> entity disambiguation (NED) -> entity typing, retrieval-augmented generation (retriever -> reader), and agentic chains of planner -> tool -> critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER -> NED -> entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.
Original Article
View Cached Full Text
Cached at: 05/20/26, 08:38 AM
# Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines
Source: [https://arxiv.org/html/2605.18812](https://arxiv.org/html/2605.18812)
###### Abstract

Modern NLP and LLM systems are pipelines: named entity recognition \(NER\)→\\toentity disambiguation \(NED\)→\\toentity typing, retrieval\-augmented generation \(retriever→\\toreader\), and agentic chains of planner→\\totool→\\tocritic\. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently \(no joint coverage\) or apply a Bonferroni union bound \(joint coverage, but conservative\)\. We presentPASC\(Pipeline\-AwareSplitConformal\), which reduces multi\-stage joint coverage to a single scalar conformal prediction problem on the*joint maximum nonconformity score*\. PASC provides a finite\-sample distribution\-free guarantee that allKKstages are simultaneously covered with probability at least1−α1\-\\alpha, and is nearly tight up to a1/\(n\+1\)1/\(n\+1\)factor\. On a three\-stage NER→\\toNED→\\toentity\-typing pipeline over CoNLL\-2003, PASC achieves96\.4%96\.4\\%end\-to\-end coverage versus93\.4%93\.4\\%for Bonferroni and86\.5%86\.5\\%for independent CP, at identical average prediction set size \(1\.0831\.083\)\. Under distribution shift to WNUT\-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains≥1−α\\geq 1\-\\alphacoverage in the tested shift settings while independent CP collapses to59%59\\%\. PASC requires a single quantile computation, runs1\.7×1\.7\\timesfaster than Bonferroni, and scales toK=6K=6stages where independent CP drops to0\.530\.53end\-to\-end coverage\. The same joint\-maximum\-score reduction applies directly to compound LLM systems and agent pipelines\.

conformal prediction, joint coverage, multi\-stage pipeline, LLM pipeline, RAG, compound AI, distribution\-free uncertainty, named entity recognition, entity disambiguation, NLP

Preprint\.

## 1Introduction

Information\-extraction \(IE\) pipelines are among the most widely deployed NLP systems in practice, powering applications in biomedical text mining\(Finkel and Manning,[2006](https://arxiv.org/html/2605.18812#bib.bib24)\), financial document analysis, and knowledge\-graph population\(Nickelet al\.,[2016](https://arxiv.org/html/2605.18812#bib.bib38); Vrandečić and Krötzsch,[2014](https://arxiv.org/html/2605.18812#bib.bib39)\)\. These pipelines typically chain several learned components: a*named entity recognizer*\(NER\) identifies entity spans, an*entity disambiguator*\(NED\) maps spans to knowledge\-base entries, and an*entity typer*\(or relation extractor\) assigns semantic categories to linked entities\. Each stage is trained independently and introduces its own prediction errors that cascade and compound through subsequent stages\(Finkel and Manning,[2006](https://arxiv.org/html/2605.18812#bib.bib24)\)\. The same compositional structure motivates production multi\-stage IE and retrieval\-augmented generation systems that we have deployed in practice\(Sharmaet al\.,[2024](https://arxiv.org/html/2605.18812#bib.bib40); Kotte and others,[2025](https://arxiv.org/html/2605.18812#bib.bib41)\)\.

Reliable deployment of such systems requires end\-to\-end \(E2E\) uncertainty quantification: we need to know when the*entire pipeline output*can be trusted, not merely individual components\. This is particularly critical in high\-stakes settings such as medical decision support, legal document review, and scientific claim extraction, where uncalibrated pipeline outputs can mislead downstream decision\-makers\. Reliability concerns also arise for downstream consumers of structured extraction\(Kotte,[2026a](https://arxiv.org/html/2605.18812#bib.bib42)\)\.

#### The challenge\.

Suppose each ofKKpipeline stages has been individually calibrated at error levelα\\alpha, so that marginallyℙ\(stagekcovered\)≥1−α\\mathbb\{P\}\(\\text\{stage\}\_\{k\}\\text\{ covered\}\)\\geq 1\-\\alpha\. Then the probability that*all stages are simultaneously covered*is at most\(1−α\)K\(1\-\\alpha\)^\{K\}under independence\. ForK=3K=3andα=0\.1\\alpha=0\.1, this degrades to at most72\.9%72\.9\\%joint coverage even though90%90\\%was guaranteed per stage\. In practice, dependencies between stages make this even less predictable\.

#### Existing approaches fall short\.

- •*Independent per\-stage CP*\(Shafer and Vovk,[2008](https://arxiv.org/html/2605.18812#bib.bib2)\): calibrates each stage separately at1−α1\-\\alpha\. Provides no joint guarantee\. Atα=0\.1\\alpha=0\.1on our benchmark, E2E coverage is86\.5%86\.5\\%, well below the target90%90\\%\.
- •*Bonferroni correction*: calibrates each stage at1−α/K1\-\\alpha/K\. Provides a joint guarantee via union bound, but is conservative: it over\-covers easy stages, inflating prediction sets without proportional coverage improvement \(Section[5](https://arxiv.org/html/2605.18812#S5)\)\.
- •*MC dropout / deep ensembles*\(Gal and Ghahramani,[2016](https://arxiv.org/html/2605.18812#bib.bib25); Lakshminarayananet al\.,[2017](https://arxiv.org/html/2605.18812#bib.bib26)\): provide heuristic uncertainty estimates without distribution\-free guarantees, require20×20\\timesinference overhead, and are insensitive to cross\-stage dependencies\.

#### Our contribution\.

We introducePASC\(Pipeline\-AwareSplitConformal prediction\), a method that achieves a formal finite\-sample joint coverage guaranteeℙ\(all stages covered simultaneously\)≥1−α\\mathbb\{P\}\(\\text\{all stages covered simultaneously\}\)\\geq 1\-\\alphathrough a single observation: the event “all stages covered” is equivalent to “the maximum per\-stage nonconformity score does not exceed a threshold\.” This reduces multi\-stage calibration to standard scalar conformal prediction on the joint maximum score, inheriting all of its theoretical guarantees while requiring only one quantile computation\. While the reduction is mathematically simple, we show in Appendix[D](https://arxiv.org/html/2605.18812#A4)that the maximum is the*minimal sufficient*monotone scalarization for the joint acceptance event, and that its empirical consequences are substantial: PASC closes the joint\-coverage gap that prior pipeline\-CP work leaves open and provides the first formal joint guarantee for compositional NLP systems\. PASC is complementary to recent calibration work for structured outputs\(Kotte,[2026b](https://arxiv.org/html/2605.18812#bib.bib43)\)\.

#### Summary of contributions\.

1. 1\.PASC algorithm\(Section[3](https://arxiv.org/html/2605.18812#S3)\): a pipeline\-aware calibration procedure with a formal joint coverage guarantee derived from standard split conformal theory\.
2. 2\.Formal guarantee\(Theorem[6](https://arxiv.org/html/2605.18812#Thmtheorem6)\): finite\-sample distribution\-free joint coverage≥1−α\\geq 1\-\\alphaunder exchangeability, with a matching near\-tightness result\.
3. 3\.Comprehensive evaluation\(Sections[4](https://arxiv.org/html/2605.18812#S4)–[5](https://arxiv.org/html/2605.18812#S5)\): experiments across three shift scenarios \(in\-distribution, Twitter NER, Wikipedia NER\), three calibration sizes,K∈\{1,…,6\}K\\in\\\{1,\\ldots,6\\\}stages,1818\-type entity typing, and conditional coverage analysis\.
4. 4\.Sanity checks\(Appendix[A](https://arxiv.org/html/2605.18812#A1)\): permutation tests confirming CP validity, split\-integrity audits, and negative\-control experiments demonstrating failure modes\.

A recurring practical objection to end\-to\-end uncertainty methods is that they either fail to reflect the deployed event of interest \(because they certify only local stages\) or they achieve a valid guarantee by paying a blanket conservativeness tax\. PASC avoids both: it certifies the deployed event directly, with the smallest possible reduction from the multi\-stage problem to a standard scalar conformal problem\.

Our evaluation isolates the source of PASC’s advantage\. The real pipeline isolates the practical IE setting; the expanded1818\-type typing stage removes a degenerate downstream artifact in the original prototype; the tuned Bonferroni frontier checks that our gains are not due to a weak baseline; theKK\-stage synthetic experiment isolates the compositional effect from dataset idiosyncrasies; and the sanity checks rule out leakage and implementation errors\.

## 2Background

### 2\.1Split Conformal Prediction

Let\{\(Xi,Yi\)\}i=1n\+1\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i=1\}^\{n\+1\}be exchangeable random variables\. Split conformal prediction\(Papadopouloset al\.,[2002](https://arxiv.org/html/2605.18812#bib.bib3); Shafer and Vovk,[2008](https://arxiv.org/html/2605.18812#bib.bib2)\)holds out a calibration set𝒟cal=\{\(Xi,Yi\)\}i=1n\\mathcal\{D\}\_\{\\mathrm\{cal\}\}=\\\{\(X\_\{i\},Y\_\{i\}\)\\\}\_\{i=1\}^\{n\}and a test point\(Xn\+1,Yn\+1\)\(X\_\{n\+1\},Y\_\{n\+1\}\)\.

###### Definition 1\(Nonconformity Score\)\.

A*nonconformity score*s\(x,y\)∈ℝs\(x,y\)\\in\\mathbb\{R\}measures how atypical\(x,y\)\(x,y\)is relative to the model\. High scores indicate non\-conformity\.

###### Definition 2\(CP Quantile\)\.

Given calibration scores\{si\}i=1n\\\{s\_\{i\}\\\}\_\{i=1\}^\{n\}at levelα\\alpha, the conformal quantile is:

q^=Quantile\(\{s1,…,sn\},⌈\(n\+1\)\(1−α\)⌉n\)\.\\hat\{q\}=\\mathrm\{Quantile\}\\\!\\left\(\\\{s\_\{1\},\\ldots,s\_\{n\}\\\},\\frac\{\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil\}\{n\}\\right\)\.\(1\)

###### Theorem 3\(Marginal Coverage\(Vovket al\.,[2005](https://arxiv.org/html/2605.18812#bib.bib1); Leiet al\.,[2018](https://arxiv.org/html/2605.18812#bib.bib4)\)\)\.

If\(X1,Y1\),…,\(Xn\+1,Yn\+1\)\(X\_\{1\},Y\_\{1\}\),\\ldots,\(X\_\{n\+1\},Y\_\{n\+1\}\)are exchangeable, then forq^\\hat\{q\}defined by Equation[1](https://arxiv.org/html/2605.18812#S2.E1):

ℙ\(s\(Xn\+1,Yn\+1\)≤q^\)≥1−α\.\\mathbb\{P\}\(s\(X\_\{n\+1\},Y\_\{n\+1\}\)\\leq\\hat\{q\}\)\\geq 1\-\\alpha\.\(2\)

This result is distribution\-free and holds for finitenn\. The prediction set𝒞\(x\)=\{y:s\(x,y\)≤q^\}\\mathcal\{C\}\(x\)=\\\{y:s\(x,y\)\\leq\\hat\{q\}\\\}achieves marginal coverage≥1−α\\geq 1\-\\alpha\.

### 2\.2Multi\-Stage NLP Pipelines

AKK\-stage NLP pipeline maps input textxxthrough a sequence of learned predictors:

x→f1z1→f2z2→⋯→fKzK,x\\xrightarrow\{f\_\{1\}\}z\_\{1\}\\xrightarrow\{f\_\{2\}\}z\_\{2\}\\xrightarrow\{\\cdots\}\\xrightarrow\{f\_\{K\}\}z\_\{K\},\(3\)wherezkz\_\{k\}is the output of stagekk, potentially conditioned on all prior outputs\. Each stagekkhas a ground truth targetyky\_\{k\}and a nonconformity scoresk\(x,zk−1,yk\)∈\[0,1\]s\_\{k\}\(x,z\_\{k\-1\},y\_\{k\}\)\\in\[0,1\]\.

#### Joint coverage

requires⋂k=1K\{sk≤qk\}\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\_\{k\}\\\}for thresholdsq1,…,qKq\_\{1\},\\ldots,q\_\{K\}\.

#### Independent CP

setsqk=q^\(k\)q\_\{k\}=\\hat\{q\}^\{\(k\)\}at1−α1\-\\alphaeach\. The resulting joint coverage satisfies:

ℙ\(⋂k=1K\{sk≤qk\}\)≠1−α\(no guarantee\)\.\\mathbb\{P\}\\\!\\left\(\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\_\{k\}\\\}\\right\)\\neq 1\-\\alpha\\quad\\text\{\(no guarantee\)\.\}\(4\)

#### Bonferroni correction

sets each level toαk=α/K\\alpha\_\{k\}=\\alpha/K, usingqk=q^α/K\(k\)q\_\{k\}=\\hat\{q\}^\{\(k\)\}\_\{\\alpha/K\}\. By the union bound:

ℙ\(⋂k=1K\{sk≤qk\}\)≥1−K⋅\(α/K\)=1−α\.\\mathbb\{P\}\\\!\\left\(\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\_\{k\}\\\}\\right\)\\geq 1\-K\\cdot\(\\alpha/K\)=1\-\\alpha\.\(5\)However, Bonferroni allocates error budget uniformly across stages regardless of their difficulty, leading to over\-coverage \(and over\-sized prediction sets\) on easy stages\.

The gap between independent CP and Bonferroni reflects a mismatch between the certified event and the deployed event\. Independent CP certifies each local event\{sk≤qk\}\\\{s\_\{k\}\\leq q\_\{k\}\\\}in isolation, but deployment accepts only when*all*stages succeed jointly\. Bonferroni corrects this mismatch by upper bounding the failure union, but it does so without using the empirical dependence structure of the score vector\(s1,…,sK\)\(s\_\{1\},\\ldots,s\_\{K\}\)\.

###### Proposition 4\(Exact reduction of the acceptance event\)\.

For any common thresholdq∈ℝq\\in\\mathbb\{R\},

⋂k=1K\{sk≤q\}=\{maxk⁡sk≤q\}\.\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\\\}=\\\{\\max\_\{k\}s\_\{k\}\\leq q\\\}\.\(6\)

Consequently, any finite\-sample marginal guarantee for the scalar random variablemaxk⁡sk\\max\_\{k\}s\_\{k\}immediately yields a finite\-sample E2E guarantee for the pipeline acceptance event\.

This proposition is elementary, but it is the central structural observation in the paper: once the deployed decision is written in the correct event space, multi\-stage calibration is no longer a new conformal primitive\. Instead, it becomes a question of selecting the right scalar statistic for the event practitioners actually care about\.

## 3PASC: Pipeline\-Aware Split Conformal

### 3\.1Key Insight

The joint event “all pipeline stages are covered” decomposes as:

⋂k=1K\{sk≤q\}=\{maxk=1K⁡sk≤q\}\.\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\\\}=\\left\\\{\\max\_\{k=1\}^\{K\}s\_\{k\}\\leq q\\right\\\}\.\(7\)
Equation[7](https://arxiv.org/html/2605.18812#S3.E7)is the central observation of PASC: if allKKstages share a*common threshold*qq, joint coverage is equivalent to single\-stage coverage of the scalar maximum score\. Standard CP on the maximum then provides the desired joint guarantee\.

###### Definition 5\(Joint Maximum Nonconformity Score\)\.

For a pipeline sample\(x,\{yk\}k=1K\)\(x,\\\{y\_\{k\}\\\}\_\{k=1\}^\{K\}\)with per\-stage scoress1,…,sKs\_\{1\},\\ldots,s\_\{K\}, define:

smax\(x,\{yk\}\):=maxk=1K⁡sk\(x,yk\)\.s\_\{\\max\}\(x,\\\{y\_\{k\}\\\}\):=\\max\_\{k=1\}^\{K\}s\_\{k\}\(x,y\_\{k\}\)\.\(8\)

### 3\.2Algorithm

Algorithm 1PASC Calibration and Prediction0:Calibration set

𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}, pipeline

\{fk\}k=1K\\\{f\_\{k\}\\\}\_\{k=1\}^\{K\}, level

α\\alpha
1:Calibration:

2:for

\(xi,\{yk,i\}\)\(x\_\{i\},\\\{y\_\{k,i\}\\\}\)in

𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}do

3:Run pipeline to obtain per\-stage outputs and scores

sk,is\_\{k,i\}
4:Compute

smax\(i\)←maxk=1K⁡sk,is^\{\(i\)\}\_\{\\max\}\\leftarrow\\max\_\{k=1\}^\{K\}s\_\{k,i\}
5:endfor

6:Compute

q^←Quantile\(\{smax\(i\)\},⌈\(n\+1\)\(1−α\)⌉/n\)\\hat\{q\}\\leftarrow\\mathrm\{Quantile\}\(\\\{s^\{\(i\)\}\_\{\\max\}\\\},\\lceil\(n\+1\)\(1\-\\alpha\)\\rceil/n\)
7:Prediction \(test pointxtestx\_\{\\mathrm\{test\}\}\):

8:for

k=1,…,Kk=1,\\ldots,Kdo

9:

𝒞k\(x\)←\{yk:sk\(x,yk\)≤q^\}\\mathcal\{C\}\_\{k\}\(x\)\\leftarrow\\\{y\_\{k\}:s\_\{k\}\(x,y\_\{k\}\)\\leq\\hat\{q\}\\\}
10:endfor

11:return

\(𝒞1\(x\),…,𝒞K\(x\)\)\(\\mathcal\{C\}\_\{1\}\(x\),\\ldots,\\mathcal\{C\}\_\{K\}\(x\)\), accept if

smaxtest≤q^s^\{\\mathrm\{test\}\}\_\{\\max\}\\leq\\hat\{q\}

#### Implementation details\.

Per\-stage nonconformity scores are defined as follows:

- •NER:sNER=maxt⁡\(1−ptBIO\)s\_\{\\mathrm\{NER\}\}=\\max\_\{t\}\(1\-p\_\{t\}^\{\\mathrm\{BIO\}\}\), whereptBIOp\_\{t\}^\{\\mathrm\{BIO\}\}is the softmax probability of the predicted BIO tag at positiontt\.
- •NED:sNED=1−score\(e∗\)s\_\{\\mathrm\{NED\}\}=1\-\\mathrm\{score\}\(e^\{\*\}\), wheree∗e^\{\*\}is the top\-ranked entity from GENRE\(De Caoet al\.,[2021](https://arxiv.org/html/2605.18812#bib.bib31)\)\.
- •EntityTyping:styping=1−maxc∈𝒯⁡minspan∈c⁡pcRoBERTas\_\{\\mathrm\{typing\}\}=1\-\\max\_\{c\\in\\mathcal\{T\}\}\\min\_\{\\mathrm\{span\}\\in c\}p\_\{c\}^\{\\mathrm\{RoBERTa\}\}, where𝒯\\mathcal\{T\}is the full OntoNotes\-18 type set\(Pradhanet al\.,[2013](https://arxiv.org/html/2605.18812#bib.bib33)\)\.

#### Why the maximum is the right scalarization\.

Other aggregations such as sums or averages are natural if one wishes to optimize smooth surrogates of overall risk\. They are poorly aligned with the binary deployment event used in selective acceptance: the system is trusted if and only if every stage is simultaneously acceptable\. The maximum is the unique monotone scalarization whose threshold event exactly recovers this conjunction\. This alignment is what allows PASC to inherit standard split\-conformal validity with no approximation term\.

#### Why a single threshold can still be efficient\.

A common concern is that using one shared threshold across all stages should be less flexible than assigning stage\-specific thresholds\. Empirically, the opposite can occur because the shared threshold is estimated from the*joint*distribution of the binding stage score, not from a worst\-case union bound\. Easy stages are not forced to spend the same nominal error budget as difficult stages; instead, the quantile of the maximum statistic automatically tracks whichever stage is active on each example\.

### 3\.3Main Theorem

###### Theorem 6\(PASC Joint Coverage Guarantee\)\.

Let\{\(xi,\{yk,i\}\)\}i=1n\+1\\\{\(x\_\{i\},\\\{y\_\{k,i\}\\\}\)\\\}\_\{i=1\}^\{n\+1\}be exchangeable\. Letq^\\hat\{q\}be the PASC quantile from Algorithm[1](https://arxiv.org/html/2605.18812#alg1)computed oni=1,…,ni=1,\\ldots,n\. Then:

ℙ\(⋂k=1K\{sk\(xn\+1,yk,n\+1\)≤q^\}\)≥1−α\.\\mathbb\{P\}\\\!\\left\(\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\(x\_\{n\+1\},y\_\{k,n\+1\}\)\\leq\\hat\{q\}\\\}\\right\)\\geq 1\-\\alpha\.\(9\)Moreover, the guarantee is nearly tight:ℙ\(⋯\)≤1−α\+1n\+1\\mathbb\{P\}\(\\cdots\)\\leq 1\-\\alpha\+\\frac\{1\}\{n\+1\}\.

###### Proof\.

By Definition[5](https://arxiv.org/html/2605.18812#Thmtheorem5)and Equation[7](https://arxiv.org/html/2605.18812#S3.E7):

⋂k=1K\{sk≤q^\}=\{smax≤q^\}\.\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq\\hat\{q\}\\\}=\\\{s\_\{\\max\}\\leq\\hat\{q\}\\\}\.The joint maximum scores\{smax\(1\),…,smax\(n\),smax\(n\+1\)\}\\\{s^\{\(1\)\}\_\{\\max\},\\ldots,s^\{\(n\)\}\_\{\\max\},s^\{\(n\+1\)\}\_\{\\max\}\\\}are exchangeable \(as functions of exchangeable tuples\)\. Applying Theorem[3](https://arxiv.org/html/2605.18812#Thmtheorem3)to the scalar sequencesmax\(1\),…,smax\(n\+1\)s^\{\(1\)\}\_\{\\max\},\\ldots,s^\{\(n\+1\)\}\_\{\\max\}yields:

ℙ\(smax\(n\+1\)≤q^\)≥1−α\.\\mathbb\{P\}\(s^\{\(n\+1\)\}\_\{\\max\}\\leq\\hat\{q\}\)\\geq 1\-\\alpha\.The near\-tightness bound follows from the standard conformal over\-coverage bound\(Vovket al\.,[2005](https://arxiv.org/html/2605.18812#bib.bib1)\):ℙ\(smax\(n\+1\)≤q^\)≤1−α\+1/\(n\+1\)\\mathbb\{P\}\(s^\{\(n\+1\)\}\_\{\\max\}\\leq\\hat\{q\}\)\\leq 1\-\\alpha\+1/\(n\+1\)\. ∎

## 4Experimental Setup

### 4\.1Pipeline Architecture

We instantiate a three\-stage pipeline \(NER→\\toNED→\\toEntityTyping\):

- •NER:dslim/bert\-base\-NER\(Devlinet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib28)\), a BERT\-base model fine\-tuned on CoNLL\-2003\(Tjong Kim Sang and De Meulder,[2003](https://arxiv.org/html/2605.18812#bib.bib35)\)with BIO tagging\.
- •NED:facebook/genre\-linking\-blink\(De Caoet al\.,[2021](https://arxiv.org/html/2605.18812#bib.bib31); Wuet al\.,[2020](https://arxiv.org/html/2605.18812#bib.bib32)\), autoregressive entity retrieval\.
- •EntityTyping:roberta\-largezero\-shot classifier\(Liuet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib30)\)against all1818OntoNotes entity types\(Pradhanet al\.,[2013](https://arxiv.org/html/2605.18812#bib.bib33)\)\(Exp\. E1: expanded from44to1818types to produce non\-trivial stage\-3 nonconformity\)\.

This composition mirrors production multi\-stage IE deployments\(Kotte and others,[2025](https://arxiv.org/html/2605.18812#bib.bib41)\)\.

### 4\.2Datasets

- •CoNLL\-2003\(Tjong Kim Sang and De Meulder,[2003](https://arxiv.org/html/2605.18812#bib.bib35)\): English news text;1,0001\{,\}000calibration and500500test samples\.
- •WNUT\-17\(Derczynskiet al\.,[2017](https://arxiv.org/html/2605.18812#bib.bib36)\): Twitter NER;500500test samples for distribution\-shift evaluation\. Contains novel/emerging entities \(*NER shift*\)\.
- •WikiNEuRal\(Tedeschiet al\.,[2021](https://arxiv.org/html/2605.18812#bib.bib37)\): Wikipedia\-derived multilingual NER;500500English test samples for domain\-shift evaluation \(*domain shift*\)\.

### 4\.3Protocol and Reproducibility

All principal numbers in the main paper are reported as means and standard deviations over five independent calibration/test resamplings, with calibration sizes ranging from200200to1,0001\{,\}000depending on the experiment\. The train, calibration, and test partitions are disjoint; our split\-audit check finds only three repeated formatting\-only header fragments and no entity\-bearing overlap\. We report both aggregate coverage and slice\-level diagnostics, because end\-to\-end calibration can appear superficially strong even when failures concentrate on hard subsets\. In addition to standard baselines, we include negative controls, permutation tests, and runtime measurements so that both statistical validity and systems realism are tested within the same evaluation protocol\.

### 4\.4Why We Use Expanded Entity Typing

The earliest version of our pipeline used a relation\-extraction backend, but on short CoNLL\-style sentences the downstream score distribution became nearly degenerate because many examples contain no explicit relation\. This made the final\-stage threshold clip near a constant and obscured the efficiency comparison\. We therefore switch to*expanded1818\-way entity typing*, which yields a clearly non\-trivial stage\-3 score distribution \(Appendix[B](https://arxiv.org/html/2605.18812#A2)\) and forces the downstream stage to participate in the joint calibration problem\. This change strengthens the evaluation: it removes an artifactually easy final stage and turns the flagship experiment into a genuine three\-stage calibration test\.

### 4\.5Baselines

- •Indep CP: per\-stage split CP at level1−α1\-\\alpha\. No joint guarantee\.
- •Bonferroni: per\-stage CP at1−α/K1\-\\alpha/K\(α/K\\alpha/Kper stage\)\. Conservative joint guarantee\.
- •Tuned Bonferroni: optimizes stage\-wiseαk\\alpha\_\{k\}subject to∑kαk=α\\sum\_\{k\}\\alpha\_\{k\}=\\alphavia grid search\. Best attainable efficiency under the union\-bound approach\.
- •MC Dropout\(Gal and Ghahramani,[2016](https://arxiv.org/html/2605.18812#bib.bib25)\):2020forward passes with dropout active; uncertainty==predictive variance\. Heuristic, no guarantee\.

### 4\.6Metrics

#### E2E Coverage

1\|T\|∑i∈T𝟏\[⋂ksk,i≤qk\]\\frac\{1\}\{\|T\|\}\\sum\_\{i\\in T\}\\mathbf\{1\}\[\\bigcap\_\{k\}s\_\{k,i\}\\leq q\_\{k\}\]\. Target:≥1−α\\geq 1\-\\alpha\.

#### Average Prediction Set Size

Mean\|𝒞k\(x\)\|\|\\mathcal\{C\}\_\{k\}\(x\)\|over test set and stages\. Lower indicates higher efficiency\.

#### Stage Coverage

Per\-stage marginal coverage1\|T\|∑i𝟏\[sk,i≤qk\]\\frac\{1\}\{\|T\|\}\\sum\_\{i\}\\mathbf\{1\}\[s\_\{k,i\}\\leq q\_\{k\}\]\.

All main results are reported as mean±\\pmstd over five independent calibration/test splits\.

## 5Results

### 5\.1Main Coverage–Efficiency Comparison

Table[1](https://arxiv.org/html/2605.18812#S5.T1)presents the key comparison\. PASC achieves96\.4%96\.4\\%E2E coverage atα=0\.1\\alpha=0\.1, exceeding Bonferroni by3\.03\.0percentage points and Indep CP by9\.99\.9percentage points, while maintaining*identical average prediction set size*\(1\.0831\.083\)\. This directly demonstrates that PASC’s advantage over Bonferroni is pure coverage efficiency, not looser prediction sets\. PASC also shows zero standard deviation across the five resamplings used here, meaning the single\-quantile estimator returned the same threshold on every split; we caution that this is a property of these particular splits and is not a statement about asymptotic variance\.

Table 1:E2E coverage and average prediction set size atα=0\.1\\alpha=0\.1, CoNLL\-2003,1818\-type entity typing\. PASC achieves highest coverage at identical set size to Bonferroni \(ncal=1000n\_\{\\mathrm\{cal\}\}=1000,ntest=500n\_\{\\mathrm\{test\}\}=500,55seeds\)\.Table 2:Expanded downstream\-stage operating point atα=0\.1\\alpha=0\.1\. The harder1818\-type stage preserves a non\-trivial stage\-3 prediction set while retaining PASC’s coverage advantage\.MC Dropout achieves90\.2%90\.2\\%E2E coverage on NER alone \(not joint\), with no formal guarantee and requiring20×20\\timesadditional inference cost\.

### 5\.2Alpha Sweep

Acrossα∈\{0\.05,0\.10,0\.20\}\\alpha\\in\\\{0\.05,0\.10,0\.20\\\}, PASC achieves96\.4%96\.4\\%,96\.4%96\.4\\%, and76\.6%76\.6\\%E2E coverage respectively, consistently meeting the≥1−α\\geq 1\-\\alphatarget atα∈\{0\.05,0\.10\}\\alpha\\in\\\{0\.05,0\.10\\\}\. Indep CP falls below target at all levels:90\.6%90\.6\\%,86\.5%86\.5\\%,65\.7%65\.7\\%\. Atα=0\.20\\alpha=0\.20, Bonferroni achieves higher coverage \(88\.7%88\.7\\%vs\.76\.6%76\.6\\%for PASC\), illustrating that PASC’s single\-threshold design is most beneficial at moderateα\\alphawhere the joint maximum score distribution is well\-calibrated\. At very highα\\alpha, each stage’s threshold becomes so loose that the Bonferroni allocation remains competitive\.

### 5\.3Distribution Shift Robustness

Table[3](https://arxiv.org/html/2605.18812#S5.T3)shows E2E coverage under two shift conditions\. When calibrated on CoNLL\-2003 news text and evaluated on Twitter \(WNUT\-17\) and Wikipedia \(WikiNEuRal\), independent CP’s NER stage coverage collapses to64\.8%64\.8\\%and59\.0%59\.0\\%respectively, as the calibrated threshold is too tight for the shifted distribution\. Both PASC and Bonferroni maintain100%100\\%coverage because their more conservative thresholds provide a larger margin against shift\.

Table 3:E2E coverage under distribution shift \(α=0\.1\\alpha=0\.1\)\. Calibration thresholds trained on CoNLL\-2003 news text; applied to WNUT\-17 \(Twitter\) and WikiNEuRal \(Wikipedia\)\. PASC and Bonferroni maintain≥1−α\\geq 1\-\\alpha; Indep CP degrades severely\.
### 5\.4Scaling toKKStages

The synthetic scaling experiment increases the number of pipeline stages fromK=1K=1toK=6K=6using score\-preserving synthetic stages derived from the real CoNLL\-2003 pipeline outputs\. In this controlled setting, independent CP follows the expected multiplicative collapse pattern: E2E coverage decreases from approximately0\.900\.90at smallKKto roughly0\.530\.53byK=6K=6, closely tracking the\(1−α\)K\(1\-\\alpha\)^\{K\}curve in\-distribution\. Under shifted score distributions, the collapse is steeper still\. PASC, by contrast, remains near the target because it calibrates the composed event directly through the joint maximum score rather than combining stage\-local marginals\.

### 5\.5Conditional Coverage Analysis

Table[4](https://arxiv.org/html/2605.18812#S5.T4)reveals that PASC’s advantage is most pronounced on hard subpopulations\. For the hardest NER quintile \(Q5\), PASC achieves82\.0%82\.0\\%vs\.67\.0%67\.0\\%for Bonferroni and30\.0%30\.0\\%for Indep CP\. All methods satisfy the marginal≥1−α\\geq 1\-\\alphaguarantee; however, PASC provides substantially better conditional coverage on hard examples without any per\-slice recalibration\.

Table 4:E2E coverage by NER nonconformity quintile \(Q1==easiest, Q5==hardest\) and entity type atα=0\.1\\alpha=0\.1\.This behavior arises naturally from PASC’s joint maximum score: on hard examples wheresNERs\_\{\\mathrm\{NER\}\}is large, the joint thresholdq^\\hat\{q\}accounts for this difficulty through the calibration distribution rather than penalizing all stages equally\.

### 5\.6Calibration Size Sensitivity

Table[9](https://arxiv.org/html/2605.18812#A3.T9)reports calibration\-size sensitivity\. PASC remains above the target coverage level for all calibration sizes tested and thencal=1000n\_\{\\mathrm\{cal\}\}=1000row matches the primary operating point in Table[1](https://arxiv.org/html/2605.18812#S5.T1)\. At smaller calibration sizes, Bonferroni can over\-cover by spending a more conservative per\-stage budget, while PASC keeps the calibration problem to one scalar quantile of the joint maximum score\.

### 5\.7Comparison with Tuned Bonferroni Frontier

We compare PASC against the tuned Bonferroni frontier obtained by sweeping stage\-wiseαk\\alpha\_\{k\}allocations subject to∑kαk=α\\sum\_\{k\}\\alpha\_\{k\}=\\alpha\. Atα=0\.1\\alpha=0\.1, the best tuned Bonferroni allocation \(α1=0\.09,α2=α3=0\.005\\alpha\_\{1\}=0\.09,\\alpha\_\{2\}=\\alpha\_\{3\}=0\.005\) achieves E2E coverage0\.8340\.834with avg NER set size1\.0001\.000, below PASC’s0\.9640\.964coverage\. PASC lies on or above the Pareto frontier of the tuned Bonferroni family, confirming that the tighter PASC bound is not achievable through Bonferroni allocation alone\.

### 5\.8Runtime Profile

Table[5](https://arxiv.org/html/2605.18812#S5.T5)shows that NED dominates runtime \(80\.6%80\.6\\%of E2E latency\)\. The conformal calibration overhead is negligible \(<0\.3<0\.3ms per1,0001\{,\}000examples\) for all methods\. PASC calibration is1\.7×1\.7\\timesfaster than both Bonferroni and Indep CP because it computes a single quantile rather thanKKseparate quantiles, an advantage that grows withKK\.

Table 5:Per\-component latency \(GPU A100, mean over200200test samples\) and calibration cost\.
### 5\.9Sanity Checks and Negative Controls

Because end\-to\-end guarantees are easy to mis\-implement, we include a compact summary of the strongest validity checks in the main paper\. First, a permutation\-resplitting test recovers mean coverage near1−α1\-\\alphafor correctly matched stage scores\. Second, a mismatched\-score negative control \(using NED scores to calibrate NER\) collapses NER coverage to0\.5660\.566, confirming that the guarantee is not an artifact of broad thresholds\. Third, a split\-integrity audit found only three duplicated formatting\-only header lines and no entity\-bearing overlaps\.

Table 6:Compact sanity summary atα=0\.1\\alpha=0\.1\. Correctly matched calibration recovers the expected permutation mean, while a mismatched\-score control fails sharply\.These checks do more than guard against implementation mistakes\. They also sharpen the empirical claim of the paper: the gain is not caused by accidentally wide thresholds, hidden overlap between splits, or an overly permissive acceptance rule\. The negative control is especially important because it demonstrates that calibration remains stage\-specific at the score\-construction level even though the final acceptance decision is joint\. In other words, PASC changes*how*stage evidence is aggregated into a certificate, not whether the underlying scores remain meaningful\.

#### What the main\-body results establish\.

First, independent stage\-wise calibration is inadequate for deployed pipelines because even mild stage dependence causes E2E under\-coverage\. Second, Bonferroni restores validity but pays for it by ignoring the empirical shape of the score vector\. Third, PASC matches the exact acceptance event, which is why its gains are largest in the real three\-stage benchmark, on hard confidence slices, and in theKK\-stage scaling study where multiplicative decay becomes severe\. Finally, the calibration sweep and runtime profile show that these gains are not purchased with brittle estimation or meaningful computational cost\.

## 6Related Work

#### Conformal prediction\.

For a recent overview of conformal prediction in NLP, seeCamposet al\.\([2024](https://arxiv.org/html/2605.18812#bib.bib18)\)\. The foundations of conformal prediction are established inVovket al\.\([2005](https://arxiv.org/html/2605.18812#bib.bib1)\)andShafer and Vovk \([2008](https://arxiv.org/html/2605.18812#bib.bib2)\)\. Split conformal prediction\(Papadopouloset al\.,[2002](https://arxiv.org/html/2605.18812#bib.bib3); Leiet al\.,[2018](https://arxiv.org/html/2605.18812#bib.bib4)\)provides the efficient single\-pass variant we build on\. Extensions include covariate shift\(Tibshiraniet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib6)\), conditional coverage\(Barberet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib7); Gibbset al\.,[2023](https://arxiv.org/html/2605.18812#bib.bib9)\), risk control\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.18812#bib.bib11),[2024b](https://arxiv.org/html/2605.18812#bib.bib10)\), and adaptive methods\(Zaffranet al\.,[2022](https://arxiv.org/html/2605.18812#bib.bib12)\)\.Angelopouloset al\.\([2024a](https://arxiv.org/html/2605.18812#bib.bib5)\)provide a unified recent survey\. PASC can be viewed as an instance of the conformal risk control framework\(Angelopouloset al\.,[2024b](https://arxiv.org/html/2605.18812#bib.bib10)\)where the loss is the indicator of any\-stage failure; the contribution here is that this loss admits a scalar nonconformity score \(the joint maximum\) that lets the guarantee reduce exactly to ordinary split CP\.

#### CP for NLP and LLMs\.

Fischet al\.\([2021](https://arxiv.org/html/2605.18812#bib.bib13)\)apply CP to cascaded inference for accelerating NLP models; their cascade setting differs from our joint\-coverage pipeline setting\.Fischet al\.\([2022](https://arxiv.org/html/2605.18812#bib.bib14)\)study false\-positive\-bounded prediction sets for multi\-label classification\.Quachet al\.\([2024](https://arxiv.org/html/2605.18812#bib.bib15)\)apply split conformal prediction to language\-model generation by calibrating sampling stopping and rejection rules;Mohri and Hashimoto \([2024](https://arxiv.org/html/2605.18812#bib.bib16)\)provide conformal correctness guarantees for single LLM outputs; andAbbasi\-Yadkoriet al\.\([2024](https://arxiv.org/html/2605.18812#bib.bib17)\)use conformal prediction to abstain on individual LLM outputs\. PASC is orthogonal and complementary: rather than certifying a single LLM output, PASC certifies the joint acceptance of a fixed multi\-stage pipeline\.Schusteret al\.\([2022](https://arxiv.org/html/2605.18812#bib.bib19)\)studies early exit via confidence thresholds without coverage guarantees\.

#### Multi\-stage / sequential CP\.

Parket al\.\([2022](https://arxiv.org/html/2605.18812#bib.bib20)\)studies PAC prediction sets under covariate shift\.Renet al\.\([2023](https://arxiv.org/html/2605.18812#bib.bib21)\)uses CP for sequential task planning in robotics but does not handle multi\-stage joint coverage\.Jin and Candès \([2023](https://arxiv.org/html/2605.18812#bib.bib22)\)considers selection by prediction for downstream use of CP outputs\. To our knowledge, PASC is the first to reduce multi\-stage joint CP to a scalar maximum score with a formal theorem\.

#### Uncertainty in NLP pipelines\.

Finkel and Manning \([2006](https://arxiv.org/html/2605.18812#bib.bib24)\)identifies error propagation in annotation pipelines and proposes approximate Bayesian inference\.Gal and Ghahramani \([2016](https://arxiv.org/html/2605.18812#bib.bib25)\)andLakshminarayananet al\.\([2017](https://arxiv.org/html/2605.18812#bib.bib26)\)provide MC Dropout and deep ensembles for uncertainty estimation; these require model modification and lack distribution\-free guarantees\.Guoet al\.\([2017](https://arxiv.org/html/2605.18812#bib.bib27)\)addresses post\-hoc calibration via temperature scaling, which improves marginal calibration but not joint pipeline coverage\. Reliability of structured extraction outputs has also been studied from a prompting\-stability perspective\(Kotte,[2026a](https://arxiv.org/html/2605.18812#bib.bib42)\), and uncertainty calibration for LLM extraction is treated as complementary work in\(Kotte,[2026b](https://arxiv.org/html/2605.18812#bib.bib43)\)\.

#### Information extraction\.

Our pipeline builds on BERT\-base NER\(Devlinet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib28); Lampleet al\.,[2016](https://arxiv.org/html/2605.18812#bib.bib29)\), GENRE autoregressive entity linking\(De Caoet al\.,[2021](https://arxiv.org/html/2605.18812#bib.bib31)\)with BLINK retrieval\(Wuet al\.,[2020](https://arxiv.org/html/2605.18812#bib.bib32)\), and RoBERTa\-based zero\-shot entity typing against OntoNotes\(Pradhanet al\.,[2013](https://arxiv.org/html/2605.18812#bib.bib33); Liuet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib30)\)\. The REBEL relation extraction model\(Cabot and Navigli,[2021](https://arxiv.org/html/2605.18812#bib.bib34)\)was used in preliminary ablations\. Multi\-stage IE pipelines of this form arise in production systems for domain\-specific question answering and retrieval\-augmented generation\(Sharmaet al\.,[2024](https://arxiv.org/html/2605.18812#bib.bib40); Kotte and others,[2025](https://arxiv.org/html/2605.18812#bib.bib41)\)\.

## 7Discussion and Limitations

#### When Bonferroni can dominate\.

At very highα\\alpha\(e\.g\.,α=0\.20\\alpha=0\.20\), Bonferroni’s per\-stage allocation allows looser individual thresholds that compensate for difficult stages more flexibly\. At this level, the joint maximum\-score distribution becomes flat enough that Bonferroni’s per\-stage relaxation can match or exceed PASC\. PASC’s regime of dominance is therefore the moderate\-α\\alpharange \(α∈\[0\.05,0\.10\]\\alpha\\in\[0\.05,0\.10\]\), which is the practically relevant operating point for most deployed pipelines\.

#### Exchangeability assumption\.

Like all split CP methods, PASC requires exchangeability of calibration and test data\. Under covariate shift \(e\.g\., WNUT\-17\), PASC still achieves≥1−α\\geq 1\-\\alphacoverage*empirically*because the larger joint threshold provides a buffer; however, the theoretical guarantee strictly requires exchangeability\. Extending PASC to the weighted CP framework\(Tibshiraniet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib6); Barberet al\.,[2023](https://arxiv.org/html/2605.18812#bib.bib8)\)for robustly handling distributional shift is future work\.

#### Computational note\.

The primary pipeline bottleneck is NED at86\.786\.7ms per sample\. Conformal calibration and inference adds<0\.3<0\.3ms overhead, making PASC essentially free in practice\.

#### Conditional coverage\.

Marginal coverage guarantees do not preclude poor conditional coverage on specific subpopulations\(Barberet al\.,[2019](https://arxiv.org/html/2605.18812#bib.bib7)\)\. Table[4](https://arxiv.org/html/2605.18812#S5.T4)shows PASC significantly improves conditional coverage on hard slices relative to baselines, but a full conditional guarantee would require group\-conditional CP methods\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.18812#bib.bib11)\), which we leave for future work\.

#### Deployment interpretation\.

From a systems perspective, PASC answers the operational question practitioners actually ask:*can I trust the full pipeline output at risk levelα\\alpha?*Independent CP cannot answer this because stage\-local guarantees do not transport to the composed prediction\. Bonferroni answers it conservatively by spending risk budget uniformly across stages\. PASC answers it directly by calibrating the exact event of interest\.

## 8Conclusion

We presented PASC, a pipeline\-aware split conformal prediction method that achieves finite\-sample distribution\-free joint coverage guarantees for multi\-stage NLP systems\. The core insight, that joint coverage of a pipeline is equivalent to standard coverage of the scalar joint maximum nonconformity score, reduces multi\-stage calibration to a well\-understood scalar problem with no approximation\. On a NER→\\toNED→\\toEntityTyping pipeline, PASC outperforms Bonferroni by33pp and Indep CP by1010pp in E2E coverage at identical prediction set sizes, while providing1\.7×1\.7\\timesfaster calibration\. PASC empirically maintains coverage in the tested distribution\-shift settings and scales gracefully toK=6K=6pipeline stages in our synthetic study\.

The simplicity of PASC makes it immediately applicable to any multi\-stage pipeline: compute per\-stage nonconformity scores, take the maximum, and calibrate once\. The same reduction applies whenever joint coverage ofKKstage events is the deployment\-relevant target, so we expect it to be useful for other multi\-stage ML systems including compound LLM pipelines, retrieval\-augmented generation, and agent workflows\.

## References

- Y\. Abbasi\-Yadkori, I\. Kuzborskij, D\. Stutz, A\. György, A\. Fisch, A\. Doucet, I\. Beloshapka, W\. Weng, Y\. Yang, C\. Szepesvári, A\. T\. Cemgil, and N\. Tomasev \(2024\)Mitigating LLM hallucinations via conformal abstention\.arXiv preprint arXiv:2405\.01563\.Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px2.p1.1)\.
- A\. N\. Angelopoulos, R\. F\. Barber, and S\. Bates \(2024a\)Theoretical foundations of conformal prediction\.arXiv preprint arXiv:2411\.11824\.Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.
- A\. N\. Angelopoulos, S\. Bates, E\. J\. Candès, M\. I\. Jordan, and L\. Lei \(2025\)Learn then test: calibrating predictive algorithms to achieve risk control\.The Annals of Applied Statistics19\(2\),pp\. 1641–1662\.External Links:[Document](https://dx.doi.org/10.1214/24-AOAS1998)Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.18812#S7.SS0.SSS0.Px4.p1.1)\.
- A\. N\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster \(2024b\)Conformal risk control\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2208\.02814Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.
- R\. F\. Barber, E\. J\. Candès, A\. Ramdas, and R\. J\. Tibshirani \(2019\)The limits of distribution\-free conditional predictive inference\.Information and Inference: A Journal of the IMA\.Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.18812#S7.SS0.SSS0.Px4.p1.1)\.
- R\. F\. Barber, E\. J\. Candès, A\. Ramdas, and R\. J\. Tibshirani \(2023\)Conformal prediction beyond exchangeability\.Annals of Statistics51,pp\. 816–845\.Cited by:[§7](https://arxiv.org/html/2605.18812#S7.SS0.SSS0.Px2.p1.1)\.
- P\. H\. Cabot and R\. Navigli \(2021\)REBEL: relation extraction by end\-to\-end language generation\.InFindings of the Association for Computational Linguistics \(EMNLP\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- M\. M\. Campos, A\. Farinhas, C\. Zerva, M\. A\. T\. Figueiredo, and A\. F\. T\. Martins \(2024\)Conformal prediction for natural language processing: a survey\.Transactions of the Association for Computational Linguistics \(TACL\)\.Note:arXiv:2405\.01976Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.
- N\. De Cao, G\. Izacard, S\. Riedel, and F\. Petroni \(2021\)Autoregressive entity retrieval\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§B\.2](https://arxiv.org/html/2605.18812#A2.SS2.p1.1),[2nd item](https://arxiv.org/html/2605.18812#S3.I1.i2.p1.2),[2nd item](https://arxiv.org/html/2605.18812#S4.I1.i2.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- L\. Derczynski, E\. Nichols, M\. van Erp, and N\. Limsopatham \(2017\)Results of the WNUT2017 shared task on novel and emerging entity recognition\.InWorkshop on Noisy User\-generated Text \(W\-NUT\),Cited by:[2nd item](https://arxiv.org/html/2605.18812#S4.I2.i2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InNorth American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[1st item](https://arxiv.org/html/2605.18812#S4.I1.i1.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- J\. R\. Finkel and C\. D\. Manning \(2006\)Solving the problem of cascading errors: approximate Bayesian inference for linguistic annotation pipelines\.InEmpirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1](https://arxiv.org/html/2605.18812#S1.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px4.p1.1)\.
- A\. Fisch, T\. Jaakkola, and R\. Barzilay \(2022\)Conformal prediction sets with limited false positives\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Fisch, T\. Schuster, T\. Jaakkola, and R\. Barzilay \(2021\)Efficient conformal prediction via cascaded inference with expanded admission\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Gal and Z\. Ghahramani \(2016\)Dropout as a Bayesian approximation: representing model uncertainty in deep learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[3rd item](https://arxiv.org/html/2605.18812#S1.I1.i3.p1.1),[4th item](https://arxiv.org/html/2605.18812#S4.I3.i4.p1.2),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px4.p1.1)\.
- I\. Gibbs, J\. J\. Cherian, and E\. J\. Candès \(2023\)Conformal prediction with conditional guarantees\.arXiv preprint arXiv:2305\.12616\.Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px4.p1.1)\.
- Y\. Jin and E\. J\. Candès \(2023\)Selection by prediction with conformal p\-values\.Journal of Machine Learning Research24\(244\),pp\. 1–41\.Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px3.p1.1)\.
- V\. Kotteet al\.\(2025\)Multi\-stage entity recognition and resolution pipeline for production information extraction\.Note:U\.S\. Patent Application 18/432,938Cited by:[§1](https://arxiv.org/html/2605.18812#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.18812#S4.SS1.p2.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- V\. Kotte \(2026a\)PromptPort: a reliability layer for cross\-model structured extraction\.arXiv preprint arXiv:2601\.06151\.Cited by:[§1](https://arxiv.org/html/2605.18812#S1.p2.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px4.p1.1)\.
- V\. Kotte \(2026b\)UCCI: uncertainty\-calibrated confidence intervals for LLM extraction\.Note:In preparationCited by:[§1](https://arxiv.org/html/2605.18812#S1.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px4.p1.1)\.
- B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell \(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[3rd item](https://arxiv.org/html/2605.18812#S1.I1.i3.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px4.p1.1)\.
- G\. Lample, M\. Ballesteros, S\. Subramanian, K\. Kawakami, and C\. Dyer \(2016\)Neural architectures for named entity recognition\.InNorth American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- J\. Lei, M\. G’Sell, A\. Rinaldo, R\. J\. Tibshirani, and L\. Wasserman \(2018\)Distribution\-free predictive inference for regression\.Journal of the American Statistical Association113,pp\. 1094–1111\.Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1),[Theorem 3](https://arxiv.org/html/2605.18812#Thmtheorem3)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[3rd item](https://arxiv.org/html/2605.18812#S4.I1.i3.p1.3),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- C\. Mohri and T\. Hashimoto \(2024\)Language models with conformal factuality guarantees\.InInternational Conference on Machine Learning \(ICML\),Note:arXiv:2402\.10978Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px2.p1.1)\.
- M\. Nickel, K\. Murphy, V\. Tresp, and E\. Gabrilovich \(2016\)A review of relational machine learning for knowledge graphs\.Proceedings of the IEEE\.Cited by:[§1](https://arxiv.org/html/2605.18812#S1.p1.1)\.
- H\. Papadopoulos, K\. Proedrou, V\. Vovk, and A\. Gammerman \(2002\)Inductive confidence machines for regression\.InProceedings of the European Conference on Machine Learning \(ECML\),Cited by:[§2\.1](https://arxiv.org/html/2605.18812#S2.SS1.p1.3),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Park, E\. Dobriban, I\. Lee, and O\. Bastani \(2022\)PAC prediction sets under covariate shift\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2106\.09848Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px3.p1.1)\.
- S\. Pradhan, A\. Moschitti, N\. Xue, H\. T\. Ng, A\. Björkelund, O\. Uryupina, Y\. Zhang, and Z\. Zhong \(2013\)Towards robust linguistic analysis using OntoNotes\.InComputational Natural Language Learning \(CoNLL\),Cited by:[3rd item](https://arxiv.org/html/2605.18812#S3.I1.i3.p1.2),[3rd item](https://arxiv.org/html/2605.18812#S4.I1.i3.p1.3),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- V\. Quach, A\. Fisch, T\. Schuster, A\. Yala, J\. H\. Sohn, T\. S\. Jaakkola, and R\. Barzilay \(2024\)Conformal language modeling\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Z\. Ren, A\. Dixit, A\. Bodrova, S\. Singh, S\. Yang, P\. Florence, Z\. Erickson, D\. Held, and A\. Majumdar \(2023\)Robots that ask for help: uncertainty alignment for large language model planners\.InConference on Robot Learning \(CoRL\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px3.p1.1)\.
- T\. Schuster, A\. Fisch, J\. Gupta, M\. Dehghani, D\. Bahri, V\. Tran, Y\. Tay, and D\. Metzler \(2022\)Confident adaptive language modeling\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px2.p1.1)\.
- G\. Shafer and V\. Vovk \(2008\)A tutorial on conformal prediction\.Journal of Machine Learning Research9,pp\. 371–421\.Cited by:[1st item](https://arxiv.org/html/2605.18812#S1.I1.i1.p1.4),[§2\.1](https://arxiv.org/html/2605.18812#S2.SS1.p1.3),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Sharma, D\. S\. Yoon, F\. Dernoncourt, D\. Sultania, K\. Bagga, M\. Zhang, T\. Bui, and V\. Kotte \(2024\)Retrieval augmented generation for domain\-specific question answering\.InAAAI 2024 Workshop on Scientific Document Understanding,Note:arXiv:2404\.14760Cited by:[§1](https://arxiv.org/html/2605.18812#S1.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- S\. Tedeschi, V\. Maiorca, N\. Ciccarella, A\. Esuli, F\. Sebastiani, and R\. Navigli \(2021\)WikiNEuRal: combined neural and knowledge\-based silver data creation for multilingual NER\.InFindings of the Association for Computational Linguistics \(EMNLP\),Cited by:[3rd item](https://arxiv.org/html/2605.18812#S4.I2.i3.p1.1)\.
- R\. J\. Tibshirani, R\. F\. Barber, E\. Candès, and A\. Ramdas \(2019\)Conformal prediction under covariate shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.18812#S7.SS0.SSS0.Px2.p1.1)\.
- E\. F\. Tjong Kim Sang and F\. De Meulder \(2003\)Introduction to the CoNLL\-2003 shared task: language\-independent named entity recognition\.InComputational Natural Language Learning \(CoNLL\),Cited by:[1st item](https://arxiv.org/html/2605.18812#S4.I1.i1.p1.1),[1st item](https://arxiv.org/html/2605.18812#S4.I2.i1.p1.2)\.
- V\. Vovk, A\. Gammerman, and G\. Shafer \(2005\)Algorithmic learning in a random world\.Springer\.Cited by:[§3\.3](https://arxiv.org/html/2605.18812#S3.SS3.1.p1.3),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1),[Theorem 3](https://arxiv.org/html/2605.18812#Thmtheorem3)\.
- D\. Vrandečić and M\. Krötzsch \(2014\)Wikidata: a free collaborative knowledgebase\.Communications of the ACM\.Cited by:[§1](https://arxiv.org/html/2605.18812#S1.p1.1)\.
- L\. Wu, F\. Petroni, M\. Josifoski, S\. Riedel, and L\. Zettlemoyer \(2020\)Scalable zero\-shot entity linking with dense entity retrieval\.InEmpirical Methods in Natural Language Processing \(EMNLP\),Cited by:[2nd item](https://arxiv.org/html/2605.18812#S4.I1.i2.p1.1),[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px5.p1.1)\.
- M\. Zaffran, O\. Féron, Y\. Goude, J\. Josse, and A\. Dieuleveut \(2022\)Adaptive conformal predictions for time series\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§6](https://arxiv.org/html/2605.18812#S6.SS0.SSS0.Px1.p1.1)\.

## Appendix ASanity Checks and Empirical Validation

### A\.1Permutation Test for CP Validity \(Experiment E0\)

To empirically validate that our CP implementation correctly implements the finite\-sample guarantee, we performed a permutation test\. We pooled calibration \(n=1000n=1000\) and test \(n=500n=500\) NER nonconformity scores and performedK=200K=200random re\-splits, recomputing coverage for each\. The permuted coverage values should have mean≈1−α\\approx 1\-\\alphaunder the CP guarantee\.

Table 7:Permutation test results\. “Permutation mean” confirms the empirical mean of coverage over200200permutations is close to1−α1\-\\alpha\(CP marginal guarantee\)\.The negative control uses NED calibration scores to calibrate the NER threshold; the mismatched scores yield only56\.6%56\.6\\%NER coverage, confirming that correct nonconformity score matching is essential\. The permutation mean correctly tracks≈1−α\\approx 1\-\\alphain all valid configurations\.

### A\.2Calibration/Test Split Integrity

We verified no content leakage between calibration and test splits by computing a per\-sentence hash of all tokens and checking for duplicates\. Three trivial header lines \(“W L PCT GB”, “Scorers :”\) appeared in both splits and were confirmed to be short noise tokens from the CoNLL\-2003 formatting, not substantive examples\. No entity\-bearing examples were shared across splits\.

### A\.3E2E Definition Audit

We manually traced200200PASC predictions to verify that the joint coverage criterionsmax≤q^s\_\{\\max\}\\leq\\hat\{q\}exactly corresponds to all three stages being covered\. All200200examples matched, confirming that the implementation correctly operationalizes the theoretical definition\.

## Appendix BStage\-Wise Nonconformity Score Details

### B\.1NER Nonconformity

Given BERT\-base logitslt∈ℝ\|𝒴BIO\|l\_\{t\}\\in\\mathbb\{R\}^\{\|\\mathcal\{Y\}\_\{\\mathrm\{BIO\}\}\|\}at positiontt, define:

sNER=maxt∈entity spans⁡\(1−exp⁡\(lt,y^t\)∑y′exp⁡\(lt,y′\)\),s\_\{\\mathrm\{NER\}\}=\\max\_\{t\\in\\text\{entity spans\}\}\\left\(1\-\\frac\{\\exp\(l\_\{t,\\hat\{y\}\_\{t\}\}\)\}\{\\sum\_\{y^\{\\prime\}\}\\exp\(l\_\{t,y^\{\\prime\}\}\)\}\\right\),\(10\)wherey^t\\hat\{y\}\_\{t\}is the predicted BIO label\. This ranges in\[0,\(\|𝒴\|−1\)/\|𝒴\|\]\[0,\(\|\\mathcal\{Y\}\|\-1\)/\|\\mathcal\{Y\}\|\]\.

### B\.2NED Nonconformity

GENRE\(De Caoet al\.,[2021](https://arxiv.org/html/2605.18812#bib.bib31)\)produces a normalized score for the top entity candidatee∗e^\{\*\}:

sNED=1−GENRE\-score\(e∗∣x,span\),s\_\{\\mathrm\{NED\}\}=1\-\\mathrm\{GENRE\\text\{\-\}score\}\(e^\{\*\}\\mid x,\\mathrm\{span\}\),\(11\)where the GENRE score is the sequence probability of the entity title\. For sentences with no entities,sNED=0s\_\{\\mathrm\{NED\}\}=0\(trivially covered\)\.

### B\.3EntityTyping Nonconformity \(Expanded\)

With all1818OntoNotes types𝒯ALL=\{\\mathcal\{T\}\_\{\\mathrm\{ALL\}\}=\\\{PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK\_OF\_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL, CARDINAL\}\\\}:

styping=1−maxt∈𝒯ALL⁡minspan∈e⁡p\(t∣x,span\),s\_\{\\mathrm\{typing\}\}=1\-\\max\_\{t\\in\\mathcal\{T\}\_\{\\mathrm\{ALL\}\}\}\\min\_\{\\mathrm\{span\}\\in e\}p\(t\\mid x,\\mathrm\{span\}\),\(12\)wherep\(t∣x,span\)p\(t\\mid x,\\mathrm\{span\}\)is the zero\-shot RoBERTa\-large entailment probability\. Expanding to1818types produces a bimodal score distribution \(mean0\.3110\.311, bimodal with modes near0\.010\.01for common PER/ORG types and0\.770\.77for rare MISC types\), creating genuinely non\-trivial stage\-3 nonconformity scores\.

## Appendix CFull Numerical Results

Table 8:Full coverage results acrossα\\alphalevels,55seeds, expanded entity typing\.Table 9:Calibration\-size sensitivity atα=0\.1\\alpha=0\.1\(E2E coverage and calibrated set size, mean±\\pmstd over55seeds\)\. Thencal=1000n\_\{\\mathrm\{cal\}\}=1000row matches the primary operating point reported in Table[1](https://arxiv.org/html/2605.18812#S5.T1)\.
## Appendix DAdditional Proof Details and Tightness

### D\.1Why the Maximum is the Minimal Sufficient Reduction

The reduction from aKK\-dimensional score vector\(s1,…,sK\)\(s\_\{1\},\\ldots,s\_\{K\}\)to the scalar maximum scoresmaxs\_\{\\max\}is not merely convenient; it is the smallest monotone scalarization that preserves the event of simultaneous coverage exactly\. Any monotone scalarizationg\(s1,…,sK\)g\(s\_\{1\},\\ldots,s\_\{K\}\)satisfying

\{g\(s1,…,sK\)≤q\}=⋂k=1K\{sk≤q\}\\\{g\(s\_\{1\},\\ldots,s\_\{K\}\)\\leq q\\\}=\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\\\}for all thresholdsqqmust coincide withmaxk⁡sk\\max\_\{k\}s\_\{k\}almost everywhere\. This makes the maximum the canonical sufficient statistic for exact joint\-threshold reduction\.

### D\.2Near\-Tightness of the Finite\-Sample Guarantee

The upper bound in Theorem[6](https://arxiv.org/html/2605.18812#Thmtheorem6)is inherited directly from finite\-sample split conformal conservativeness\. If the rank of the test score among then\+1n\+1exchangeable scores is uniform, then

1−α≤ℙ\(smax\(n\+1\)≤q^\)≤1−α\+1n\+1\.1\-\\alpha\\leq\\mathbb\{P\}\(s^\{\(n\+1\)\}\_\{\\max\}\\leq\\hat\{q\}\)\\leq 1\-\\alpha\+\\frac\{1\}\{n\+1\}\.Thus, PASC introduces no additional looseness beyond the unavoidable quantile slack of ordinary split conformal prediction\.

### D\.3Why Bonferroni is Structurally Conservative

Bonferroni guarantees

ℙ\(⋂k=1K\{sk≤qk\}\)≥1−∑k=1Kαk,\\mathbb\{P\}\\\!\\left\(\\bigcap\_\{k=1\}^\{K\}\\\{s\_\{k\}\\leq q\_\{k\}\\\}\\right\)\\geq 1\-\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\},which is valid for arbitrary dependence but ignores the empirical correlation structure of the score vector\. In our setting, multiple stages are often easy on the same examples, so Bonferroni wastes budget on already\-easy examples\. PASC avoids this structural inefficiency by calibrating on the realized joint score rather than on marginal stage scores\.

## Appendix EExtended Experimental Protocol

### E\.1Calibration, Seeds, and Splits

Unless otherwise stated, each reported number is averaged over five independently resampled calibration/test splits\. Training data remain fixed; only the calibration and evaluation partitions are resampled\. This isolates uncertainty due to quantile estimation from uncertainty due to model fitting\. For primary CoNLL\-2003 experiments, we use1,0001\{,\}000calibration examples and500500held\-out evaluation examples\. Shift experiments reuse thresholds learned on CoNLL\-2003 and evaluate directly on WNUT\-17 and WikiNEuRal without recalibration\.

### E\.2Model and Inference Details

The NER stage uses a BERT\-base sequence tagger with BIO decoding\. The NED stage uses GENRE with BLINK retrieval\. The typing stage uses a RoBERTa\-large entailment model in zero\-shot mode over the complete OntoNotes\-18 label inventory\. All experiments were run on a single A100 GPU\. The dominant latency cost comes from NED retrieval and generation, not from conformal calibration\.

### E\.3Why We Switched from Relation Extraction to Entity Typing

Our earliest pipeline variants used a relation\-extraction backend, but the downstream score distribution became degenerate because many short CoNLL sentences do not express explicit relations\. This caused the last\-stage score to clip at a near\-constant threshold and obscured the efficiency comparison\. The switch to expanded1818\-way entity typing yields a visibly non\-trivial bimodal stage\-3 score distribution and a much cleaner empirical comparison\.

## Appendix FQualitative Failure Modes and What the Method Actually Fixes

PASC does not make an incorrect pipeline correct; it calibrates the probability that the full pipeline output falls inside the chosen acceptance event\. This distinction matters\. When upstream NER fails catastrophically, no calibration method can recover semantic correctness without widening the acceptance rule or changing the model itself\. The value of PASC is more precise: it prevents the system from reporting an unjustified end\-to\-end confidence level when local stage\-wise calibration would otherwise create that illusion\.

The dependence stress test illustrates this boundary clearly\. Under targeted upstream corruption, all methods eventually drop once the exchangeability assumption is sufficiently violated\. What changes is the margin before failure\. PASC starts from the highest empirical operating point, degrades more gracefully, and therefore preserves a larger useful region before the certificate becomes untrustworthy\. In deployment terms, this means more robustness to moderate drift but not immunity to adversarial breakage\.

The hard\-slice analysis provides the complementary view\. On easy examples, all three methods already cover nearly everything\. The real action is in the hardest quintile, where errors concentrate and stage dependencies matter\. This is exactly the regime where a naive stage\-wise guarantee is most misleading and where PASC produces the largest empirical gain\.

## Appendix GReproducibility Notes

All datasets are public benchmarks\. We use off\-the\-shelf checkpoints for all stages, report five independent calibration/test split seeds, include negative controls and split audits, and do not exclude any examples after computing thresholds\. The paper specifies the nonconformity scores, calibration protocol, split sizes, runtime profile, and sanity checks needed to reproduce the reported evaluation\.
PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

Similar Articles

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Online Localized Conformal Prediction

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

Empirical Bayes Conformal Prediction for Vision and Language Models

Submit Feedback

Similar Articles

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Online Localized Conformal Prediction
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency
Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism
Empirical Bayes Conformal Prediction for Vision and Language Models