Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Summary
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations Source: https://arxiv.org/html/2604.16217 Peng Kuang Zhejiang University [email protected] Xiaoyu Han University of Illinois Urbana-Champaign [email protected] Kaidi Xu City University of Hong Kong [email protected] Haohan Wang University of Illinois Urbana-Champaign [email protected] ###### Abstract Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration–deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity–efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribution shift. ## 1 Introduction Large language models (LLMs) are increasingly deployed in settings where reliability matters, from general question answering and decision support to higher-stakes domains such as law, finance, and medicine, where users care not only about average accuracy but also about whether a model's output should be trusted (Huang et al., 2024; Maccha et al., 2026). Yet uncertainty quantification for LLM generation remains difficult. Common confidence proxies based on token probabilities, entropy, or self-consistency can become brittle under distribution shift, precisely when deployment risk is highest (Kuhn et al., 2023). Moreover, because many surface forms can express the same meaning, output-level uncertainty need not align with uncertainty over the underlying semantic decision. Conformal prediction (CP) is appealing because it wraps arbitrary predictors with finite-sample validity guarantees under exchangeability (Angelopoulos and Bates, 2022). In LLM deployment, however, calibration and test data often differ across domains, topics, and prompting styles, so this guarantee can degrade sharply (Gibbs and Candès, 2021). Shift-aware conformal methods can help, but they typically require informative covariates that partition the input space or accurate importance weights that characterize the calibration–test shift (Tibshirani et al., 2020; Barber et al., 2023). For LLMs, deriving such structure from text alone is difficult: prompt similarity and lexical overlap are often only shallow proxies for the latent factors that govern reliability. This suggests that the bottleneck is not only how CP is adapted to shift, but also which uncertainty signal is being conformalized. Recent conformal methods for LLMs make this limitation especially clear. API-only approaches conformalize black-box uncertainty signals without logit access (Su et al., 2024); sampling-based methods build correctness-oriented uncertainty sets from multiple generations (Wang et al., 2024b); and selective-answering methods calibrate thresholds to control downstream risk for a single returned answer (Wang et al., 2025a). Even domain-shift-aware methods for LLMs still rely mainly on surface representations to assess similarity or reweight calibration data (Lin et al., 2025). Across these settings, the dominant signals remain output-facing (Quach et al., 2024), so when reliability-relevant shift is not captured by observable surface features, these methods may inherit the fragility of text-level statistics. A complementary line of work suggests that reliability signals may reside inside the model rather than in the final output alone. Prior studies show that LLM internal representations preserve semantic and reliability-relevant structure that is only partially visible from decoded text or final-layer statistics (Azaria and Mitchell, 2023; Chen et al., 2024). Recent layer-wise analyses further suggest that hallucinations and unanswerable cases manifest as information deficiency or instability across depth, and that aggregating evidence over layers can be more informative than probing only the final layer (Kim et al., 2025b). These observations motivate a conformal perspective built directly on internal representations. In this work, we operationalize that perspective within a standard conformal pipeline for LLM question answering. We introduce *Layer-wise Information* (LI) scores, computed from how input context reshapes predictive entropy across model depth, and use them as nonconformity measures in split conformal prediction. The conformal wrapper itself is unchanged: our contribution is to replace an output-level uncertainty score with an internal, answer-level reliability score aggregated from hidden-state trajectories across sampled candidate answers. Accordingly, we do not claim that internal representations remove the need for conformal assumptions or restore formal validity under domain shift. Our claim is narrower and empirical: if layer-wise information ranks admissible answers more faithfully than output-level scores, then the same conformal wrapper can yield better validity–efficiency trade-offs, especially when calibration and deployment domains differ. Our contributions are threefold: (1) we propose LI-based nonconformity scores that move conformal uncertainty estimation from output-level statistics to internal layer-wise signals; (2) across closed-ended, open-domain, and cross-domain QA benchmarks, we show that internal-score conformalization achieves a stronger empirical validity–efficiency trade-off than baselines based on API-only, sampling-based, and selective-answering uncertainty measures, with the clearest gains under cross-domain shift; and (3) we position internal representations as a practical interface between mechanistic reliability signals in LLMs and conformal uncertainty quantification, highlighting a path toward more robust conformal scoring beyond the calibration distribution. ## 2 Related Work #### Conformal uncertainty quantification for LLMs Conformal prediction (CP) provides finite-sample guarantees under exchangeability and is now a standard tool for uncertainty quantification (Angelopoulos and Bates, 2022; Angelopoulos et al., 2026). A broad literature studies how CP behaves beyond the classical exchangeable setting, including adaptive conformal inference under distribution shift, covariate-shift-aware reweighting, and more general analyses beyond exchangeability (Gibbs and Candès, 2021; Tibshirani et al., 2020; Barber et al., 2023). Related work also develops conditional or approximate-conditional guarantees and risk-control formulations beyond standard marginal coverage (Gibbs et al., 2025; Plassier et al., 2024). These ideas have recently been adapted to LLMs and language generation, from early work on closed-ended and multi-choice QA (Kumar et al., 2023) to open-ended generation, API-only settings, and correctness-oriented uncertainty sets for free-form QA (Quach et al., 2024; Su et al., 2024; Wang et al., 2024b). Other lines study factuality, long-form generation, and selective or abstaining deployment, including COIN and SConU, which calibrate thresholds or detect uncertainty outliers to improve robustness in QA settings (Mohri and Hashimoto, 2024; Cherian et al., 2024; Wang et al., 2025a,b). Despite differences in interface and objective, most existing methods remain output-facing, deriving nonconformity or confidence from final-output statistics. Our work is similar in goal but different in mechanism: instead of proposing another output-level proxy, we study whether internal layer-wise scores can better support conformal prediction for LLMs. #### Internal representations as reliability signals A complementary line of work suggests that the most informative reliability signals may lie in the model's internal representations rather than in decoded outputs alone. Early evidence showed that hidden activations can reveal latent knowledge and truthfulness signals that are only weakly reflected in surface probabilities or generated text (Azaria and Mitchell, 2023; Burns et al., 2024). This view is reinforced in hallucination detection: Chen et al. (2024) show that internal states retain substantial detection power even when output-level statistics are weak, and related activation-based approaches likewise probe internal computation rather than final responses alone. An information-theoretic line of work provides the conceptual basis for our method. Predictive V-usable information formalizes how much label-relevant information a model family can exploit under computational constraints, and its pointwise extension characterizes instance-level difficulty (Xu et al., 2020; Ethayarajh et al., 2022). Building on this perspective, Kim et al. (2025a) argue that hallucination is fundamentally a layer-wise information-deficiency phenomenon: usable information evolves non-monotonically across depth, so final-layer analysis can miss reliability-relevant gains and losses arising during intermediate computation. This line of work provides the closest conceptual basis for our method. Our contribution is to move from diagnosis to conformalization, using layer-wise internal information directly as the answer-level nonconformity score that drives prediction-set construction. ## 3 Methodology ### 3.1 Preliminaries We work in the standard split conformal prediction (SCP) setting for question answering. Let $\mathcal{D}_{\mathrm{cal}} = \{(x_i, y_i^*)\}_{i=1}^N$ be a held-out calibration set, where $x_i \in \mathcal{X}$ is the $i$-th question and $y_i^* \in \mathcal{Y}$ its ground-truth answer. For each calibration question $x_i$, we sample $M$ responses from the deployed language model $\mathcal{M}: \mathcal{X} \to \mathcal{Y}$, producing a candidate pool $\{y_j^{(i)}\}_{j=1}^M$. For multiple-choice QA, each sampled response is parsed into one answer option; for open-domain QA, sampled responses are grouped into semantic answer units following prior conformal QA protocols (Quach et al., 2024; Su et al., 2024). Let $\mathcal{A}(x_i)$ denote the set of distinct candidate answer units induced by the sampled responses for $x_i$, and define $$\mathcal{A}^*(x_i, y_i^*) := \{a \in \mathcal{A}(x_i) : a \text{ is admissible for } y_i^*\},$$ the subset of admissible answer units in the sampled pool. Here admissibility is defined by exact match in MCQA and by semantic admissibility in open-domain QA. Let $F(a; x_i)$ be any fixed answer-level reliability score, with larger values indicating more trustworthy answers. The corresponding calibration nonconformity score is $$s_i(F) = \begin{cases} 1 - \max_{a \in \mathcal{A}^*(x_i, y_i^*)} F(a; x_i), & \text{if } \mathcal{A}^*(x_i, y_i^*) \neq \emptyset, \\ \infty, & \text{if } \mathcal{A}^*(x_i, y_i^*) = \emptyset. \end{cases}$$ Given a target risk level $\alpha \in (0,1)$, define the conformal threshold $$\widehat{q}_\alpha(F) := \operatorname{Quantile}\left(1 - \alpha; \{s_i(F)\}_{i=1}^N \cup \{\infty\}\right),$$ and the resulting prediction set for a new question $x$ by $$\widehat{C}_\alpha(x; F) = \left\{a \in \mathcal{A}(x) : 1 - F(a; x) \leq \widehat{q}_\alpha(F)\right\}.$$ Under exchangeability of the calibration and test examples, the resulting set predictor satisfies the standard marginal coverage guarantee $$\mathbb{P}\left(\widehat{C}_\alpha(X; F) \cap \mathcal{A}^*(X, Y^*) \neq \emptyset\right) \geq 1 - \alpha,$$ where $(X, Y^*)$ denotes a fresh test question-answer pair. Equivalently, with probability at least $1-\alpha$, the conformal prediction set contains at least one answer unit admissible for the ground truth (Angelopoulos and Bates, 2022). Our contribution is thus a principled internal reliability score $F$ that instantiates this otherwise standard conformal construction. A practical complication in conformal QA is that candidate sets are formed from a finite number $M$ of sampled responses. Consequently, some calibration questions may contain no admissible answer unit in the sampled pool. We retain such examples rather than filtering them out, since filtering changes the calibration distribution and can be especially harmful in cross-domain QA. This induces a minimum manageable risk level $$\alpha_l = \frac{N}{N+1} \cdot \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}\left\{\mathcal{A}^*(x_i, y_i^*) = \emptyset\right\},$$ so practical guarantees based on sampled candidate pools are meaningful only when $\alpha \geq \alpha_l$. Appendix A gives a short derivation and interpretation of this finite-sampling risk floor. ### 3.2 Layer-wise Usable Information We now define the internal reliability signal used by
Similar Articles
Empirical Bayes Conformal Prediction for Vision and Language Models
This paper introduces an empirical Bayes conformal prediction framework that uses r-values to incorporate score variability into nonconformity scores, improving ranking stability and reducing set size while preserving coverage for vision and language models.
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.
Online Localized Conformal Prediction
This paper proposes Online Localized Conformal Prediction (OLCP) to address covariate heterogeneity in online learning and time-series settings. It introduces OLCP-Hedge for bandwidth selection and demonstrates valid long-run coverage with narrower prediction sets compared to existing baselines.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
This paper investigates how LLMs' internal priors affect zero-shot annotation performance, finding that nearly two-thirds of errors resist prompt-based correction and introducing Definition-Specific Familiarity as a better predictor than memorization metrics.