Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

arXiv cs.AI 06/01/26, 04:00 AM Tools
Summary
GLIDE is an open-source Python library that unifies state-of-the-art Prediction-Powered Inference methods for debiased evaluation of generative AI and agentic systems, enabling annotation savings with valid uncertainty estimates.
arXiv:2605.31278v1 Announce Type: new Abstract: Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:26 AM
# Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
Source: [https://arxiv.org/html/2605.31278](https://arxiv.org/html/2605.31278)
###### Abstract

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM\-as\-judge proxies\. Prediction\-powered inference \(PPI\) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations\. We introduce GLIDE, an open\-source Python library that unifies state\-of\-the\-art PPI estimators \(PPI\+\+, Stratified PPI, Predict\-Then\-Debias and its stratified variants, Active Statistical Inference\) and samplers \(uniform, stratified, active, cost\-optimal\) under a scipy\-style API specialized to mean estimation\. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision\. The GLIDE package is available at this URL:[https://github\.com/EmertonData/glide](https://github.com/EmertonData/glide)

Agentic, GenAI, Evaluation, PPI, LLM\-as\-Judge

## 1Introduction

Consider a clinical decision\-support agent that retrieves guidelines, reasons over a patient record, and proposes a treatment\. Measuring whether such an agent behaves correctly requires expert clinician review of multi\-step trajectories, a process where today’s two dominant practices both fail at scale\. Full human annotation is reliable but slow and costly; LLM\-as\-judge is fast and cheap but biased, especially for knowledge\-intensive, domain\-specific tasks\.

Rather than choosing between human reliability and proxy scale,prediction\-powered inference\(PPI\)\(Angelopouloset al\.,[2023a](https://arxiv.org/html/2605.31278#bib.bib1),[b](https://arxiv.org/html/2605.31278#bib.bib2)\)combines them to obtain the strengths of both: a small set of human annotations with a large set of cheap proxy predictions to produce unbiased estimates with valid confidence intervals\. Coverage holds unconditionally on the proxy: a worse proxy yields wider intervals, not invalid ones\.

The PPI literature has matured rapidly, with extensions for power tuning\(Angelopouloset al\.,[2023b](https://arxiv.org/html/2605.31278#bib.bib2)\), stratified populations\(Fischet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib5); Fogliatoet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib6)\), small\-sample bootstrap procedures\(Klugeret al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib13)\), active sampling\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8); Gligorićet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib9)\), cost\-optimal budgets\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib10)\), and recalibration\(Jiet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib12)\)\. Yet, as already noted in\(Songet al\.,[2026](https://arxiv.org/html/2605.31278#bib.bib3)\), this progress is scattered across papers with heterogeneous notation and partial reference implementations\. Theppi\_pylibrary\(Angelopouloset al\.,[2023a](https://arxiv.org/html/2605.31278#bib.bib1)\)established the foundation but predates several of these methods and emphasizes generality across estimands \(GLM, M\-estimators\) over depth in mean estimation, where most use\-cases live\.

Agentic evaluation is a natural fit for PPI on several counts\. The cost asymmetry between proxy and ground truth is extreme\. Agentic pipelines also exhibit natural heterogeneity: performance and proxy quality vary substantially across the tools invoked, the types of queries handled, and the sub\-agents involved, motivating stratified estimation\. Modern judges increasingly produce calibrated per\-sample uncertainty, unlocking active sampling\. Existing libraries cover parts of this landscape but no single interface unifies the matching family of estimators and samplers, leaving practitioners to stitch methods together across academic repositories\.

We close this gap withGLIDE\(Generated Label Inference and Debiasing Engine\), an open\-source Python library focused on mean estimation\. Our contributions are:

1. 1\.A unified library, exposed under a scipy\-style API, and integrating several PPI estimators: PPI\+\+\(Angelopouloset al\.,[2023b](https://arxiv.org/html/2605.31278#bib.bib2)\), Stratified PPI\+\+\(Fischet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib5)\), Predict\-Then\-Debias\(Klugeret al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib13)\)with stratified variants, Active Statistical Inference\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8)\); and different samplers: uniform, stratified\(Fogliatoet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib6)\), active\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8); Gligorićet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib9)\), cost\-optimal\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib10)\), alongside classical baselines\.
2. 2\.A reproducible validation suite: Monte Carlo coverage tests, sensitivity analyses to proxy quality, and effective\-sample\-size benchmarks across all estimators\.
3. 3\.An agentic\-evaluation case studyshowing that GLIDE attains equivalent precision to classical sampling with substantially fewer human annotations\.

## 2Background: Prediction\-Powered Inference

#### Setup\.

Let\(X,Y\)∼P\(X,Y\)\\sim PwhereXXis the input or behavior to be evaluated \(in our case, a conversation, an LLM output, or an agentic trajectory\) andY∈ℝY\\in\\mathbb\{R\}is the corresponding human\-assigned label \(e\.g\., a binary correctness indicator, a toxicity indicator, or a tool\-use rating\)\. The target estimand is the population meanθ⋆=𝔼\[Y\]\\theta^\{\\star\}=\\mathbb\{E\}\[Y\], which encodes a performance or risk metric of interest\. We have access to two datasets drawn i\.i\.d\. fromPP: a small*labeled*setℒ=\{\(Xi,Yi,f\(Xi\)\)\}i=1n\\mathcal\{L\}=\\\{\(X\_\{i\},Y\_\{i\},f\(X\_\{i\}\)\)\\\}\_\{i=1\}^\{n\}and a large*unlabeled*set𝒰=\{\(Xj,f\(Xj\)\)\}j=1N\\mathcal\{U\}=\\\{\(X\_\{j\},f\(X\_\{j\}\)\)\\\}\_\{j=1\}^\{N\}withN≫nN\\gg n, whereffis a fixed proxy \(typically an LLM\-as\-judge\) producing predictions ofYY\. We assumeffis trained on data disjoint fromℒ∪𝒰\\mathcal\{L\}\\cup\\mathcal\{U\}, so its predictions are i\.i\.d\. across both sets\.

#### The PPI estimator\.

The classical PPI mean estimator\(Angelopouloset al\.,[2023a](https://arxiv.org/html/2605.31278#bib.bib1)\)debiases the proxy mean using the labeled set:

θ^PPI=1N∑j=1Nf\(Xj\)⏟proxy mean on𝒰\+1n∑i=1n\(Yi−f\(Xi\)\)⏟rectifier onℒ\.\\hat\{\\theta\}^\{\\mathrm\{PPI\}\}=\\underbrace\{\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\(X\_\{j\}\)\}\_\{\\text\{proxy mean on \}\\mathcal\{U\}\}\+\\underbrace\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\(Y\_\{i\}\-f\(X\_\{i\}\)\\right\)\}\_\{\\text\{rectifier on \}\\mathcal\{L\}\}\.\(1\)The first term is the cheap proxy mean; the second is a correction for the proxy’s bias\. PPI\+\+\(Angelopouloset al\.,[2023b](https://arxiv.org/html/2605.31278#bib.bib2)\)introduces a tuning parameterλ∈ℝ\\lambda\\in\\mathbb\{R\}:

θ^λPPI\+\+=1n∑i=1nYi\+λ\(1N∑j=1Nf\(Xj\)−1n∑i=1nf\(Xi\)\),\\hat\{\\theta\}^\{\\mathrm\{PPI\}\\text\{\+\+\}\}\_\{\\lambda\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}Y\_\{i\}\+\\lambda\\left\(\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\(X\_\{j\}\)\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}f\(X\_\{i\}\)\\right\),\(2\)which recovers \([1](https://arxiv.org/html/2605.31278#S2.E1)\) forλ=1\\lambda=1and the classical labeled mean forλ=0\\lambda=0\. The optimalλ⋆\\lambda^\{\\star\}is closed\-form and minimizes the asymptotic variance, guaranteeing that PPI\+\+ is never worse than the classical estimator in the limit, even when the proxy is uninformative or adversarial\.

#### Confidence intervals\.

Thanks to the Central Limit Theorem,θ^λPPI\+\+\\hat\{\\theta\}^\{\\mathrm\{PPI\}\\text\{\+\+\}\}\_\{\\lambda\}is asymptotically normal, and a\(1−α\)\(1\-\\alpha\)confidence interval is obtained from the plug\-in variance estimator and the normal quantilez1−α/2z\_\{1\-\\alpha/2\}\. For smallnn, the central\-limit approximation degrades; bootstrap\-based constructions provide a finite\-sample alternative\(Klugeret al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib13)\)\.

#### Extensions\.

The same template \(combine a labeled correction with a proxy summary\) generalizes to several settings relevant to evaluation: stratified populations, where power tuning is performed per stratum\(Fischet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib5)\); non\-uniform sampling driven by per\-sample proxy uncertainty, known as Active Statistical Inference \(ASI\)\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8); Gligorićet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib9)\); small\-sample regimes addressed by Predict\-Then\-Debias \(PTD\), which provides bootstrap\-based confidence intervals\(Klugeret al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib13)\); and cost\-aware sampling probabilities derived from a known proxy/annotation cost ratio\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib10)\)\. See to\(Songet al\.,[2026](https://arxiv.org/html/2605.31278#bib.bib3); Jiet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib12)\)for a recent review\. GLIDE implements all of these extensions; we develop the corresponding estimators and samplers in Section[4](https://arxiv.org/html/2605.31278#S4)\.

#### Effective sample size\.

A natural figure of merit for any PPI estimator is its*effective sample size*, defined as the number of fully labeled samples a classical estimator would require to match the same confidence\-interval width:

neff=n⋅Var^\(Y¯n\)Var^\(θ^λPPI\+\+\)\.n\_\{\\mathrm\{eff\}\}=n\\cdot\\frac\{\\widehat\{\\mathrm\{Var\}\}\(\\bar\{Y\}\_\{n\}\)\}\{\\widehat\{\\mathrm\{Var\}\}\(\\hat\{\\theta\}^\{\\mathrm\{PPI\}\\text\{\+\+\}\}\_\{\\lambda\}\)\}\.\(3\)The rationeff/n≥1n\_\{\\mathrm\{eff\}\}/n\\geq 1quantifies the practical return on investment of the proxy: it translates directly into human\-review\-hour or budget savings, and is the headline metric we report throughout Sections[5](https://arxiv.org/html/2605.31278#S5)and[6](https://arxiv.org/html/2605.31278#S6)\.

## 3Why Agentic Evaluation Needs PPI

Section[2](https://arxiv.org/html/2605.31278#S2)fixed the abstract setting; we now argue that agentic systems exhibit four properties, simultaneously and to an unusual degree, under which PPI is most valuable\. Each property maps directly to a method in the recent literature, and to a corresponding GLIDE component\.

#### 1\. Extreme cost asymmetry\.

Evaluating an agentic trajectory is qualitatively more expensive than evaluating a single LLM output\. A trajectory may involve retrieval, tool calls, intermediate reasoning steps, and a final response, all of which a domain expert must inspect to certify correctness\. Annotation costs of several dollars per trajectory are common in healthcare, legal, and financial settings, while an LLM\-as\-judge call costs cents\. This two\-to\-three\-orders\-of\-magnitude gap is precisely the regime where cost\-aware sampling\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib10)\)is relevant, setting the labeling rate as a function of the proxy/annotation cost ratio, and the available signal\.

#### 2\. Natural stratification\.

Agentic pipelines partition observations along several axes that are typically known at evaluation time: the tools invoked, the type of query handled, the sub\-agents involved, and the retrieval source consulted\. Proxy quality varies across these strata in ways that matter: an LLM\-as\-judge that reliably evaluates arithmetic reasoning may struggle with code\-execution traces or with domain\-specific clinical guidelines\. Stratified PPI\(Fischet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib5); Fogliatoet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib6)\)runs PPI within each stratum and combines the results with per\-stratum weights, isolating proxy\-quality differences and yielding tighter intervals than the unstratified estimator\.

#### 3\. Available proxy uncertainty\.

Modern judges can produce per\-sample uncertainty signals: verbalized confidence, log\-probabilities, multiple generation consistency, or ensembles of judges\. These signals identify the trajectories on which the proxy is least reliable, and therefore the trajectories where human annotation is most informative\. Active Statistical Inference\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8); Gligorićet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib9)\)exploits this by sampling labeling probabilities proportional to proxy uncertainty, concentrating expensive expert effort where it carries the most statistical value\.

#### 4\. Consequential deployment regimes\.

Criticial systems \(e\.g\., clinical decision support, financial advisory, regulated automation, safety\-critical autonomy\) are also those where biased point estimates without valid uncertainty have direct consequences\. Distribution\-free coverage guarantees are a deployment requirement in these settings\. PPI’s unconditional\-on\-the\-proxy validity makes it well suited to agentic pipelines that must be auditable\.

## 4The GLIDE Framework

### 4\.1Three\-step decomposition

GLIDE organizes evaluation around three sequential steps inherited from classical survey theory:*sampling*,*annotation*, and*estimation*\. A small budget of observations \(or trajectories\) is selected from a large pool of proxy\-labeled ones \(sampling\); domain experts produce ground\-truth labels on the selected subset \(annotation\); the labeled subset and the proxy predictions are combined into a debiased point estimate with a confidence interval \(estimation\)\. Sampling and estimation are statistical and algorithmic; while annotation is irreducibly task\-specific and out of scope for GLIDE\. The library requires only that experts return labels in a machine\-readable form \(categories or numbers\), whose business meaning is context\-dependent but whose statistical treatment is not\.

This factorization is more than a presentation choice\. It allows samplers and estimators to be developed, tested, and contributed independently: a contributor proposing a new sampler need not touch any estimator, and conversely\. It also lets practitioners mix and match across the families of methods recalled in Section[2](https://arxiv.org/html/2605.31278#S2), choosing the sampler that matches their evaluation setup and the estimator that matches their statistical regime\.

### 4\.2Samplers

GLIDE implements four samplers covering the regimes identified in Section[3](https://arxiv.org/html/2605.31278#S3)\.

UniformSamplerdrawsnnobservations uniformly without replacement from the pool\. It is the appropriate baseline when no auxiliary signal is available\.

StratifiedSamplerallocatesnnobservations across user\-defined strata\. Two allocation rules are supported:*proportional*\(nh∝Nhn\_\{h\}\\propto N\_\{h\}\), and*Neyman*\(nh∝Nhσhn\_\{h\}\\propto N\_\{h\}\\sigma\_\{h\}\), whereσh\\sigma\_\{h\}is the stratum standard deviation of the proxy labels\(Fogliatoet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib6)\)\. Both allocations produce non\-integer values that GLIDE rounds via Hamilton’s largest\-remainder method to ensure∑hnh=n\\sum\_\{h\}n\_\{h\}=nexactly\.

ActiveSamplerdraws each observation independently with a Bernoulli probability proportional to a user\-supplied uncertainty score on the proxy \. The proportionality constant is calibrated so that the expected sample size equals the budget\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8); Gligorićet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib9)\)\. Because the draws are independent Bernoullis, the realized sample size matches the budget only in expectation; this is the standard cost of admitting per\-sample probabilities\.

CostOptimalSamplercomes in two variants\. The constant\-rate variant \(CostOptimalRandomSampler\) computes a single sampling probability from a small burn\-in dataset and known proxy/annotation costs, minimizing total cost\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib10)\)\. The feature\-dependent variant \(CostOptimalSampler\) additionally consumes a per\-sample uncertainty score, computing observation\-level optimal probabilitiesπ\(x\)\\pi\(x\)\. The feature\-dependent variant strictly dominates the constant\-rate one when the uncertainty score is informative, but requires an extra signal that is not always available\.

### 4\.3Estimators

GLIDE implements several mean estimators alongside three classical baselines \(Table[1](https://arxiv.org/html/2605.31278#S4.T1)\)\. All estimators return aPredictionPoweredMeanInferenceResultcarrying the point estimate, the confidence interval at user\-specified level1−α1\-\\alpha, the effective sample sizeneffn\_\{\\mathrm\{eff\}\}\(eq\.[3](https://arxiv.org/html/2605.31278#S2.E3)\), and the metric label\.

Table 1:GLIDE estimators and their statistical regimes\. CLT\-based estimators rely onn≳50n\\gtrsim 50; bootstrap\-based estimators are valid for smaller samples\.The classical baselines do not consume proxy labels and serve two purposes: they recover the no\-proxy regime as a sanity check, and they provide the reference variance against whichneffn\_\{\\mathrm\{eff\}\}is computed\.

### 4\.4A decision framework

The decision tree of Figure[1](https://arxiv.org/html/2605.31278#S4.F1)routes the user through three stages\. The sampling stage selects among samplers based on three signals:*are cost estimates of human and proxy annotations available?*\(cost\-optimal samplers\),*are per\-sample proxy uncertainties available?*\(active sampling\), and*is the dataset structured into strata with heterogeneous proxy quality?*\(stratified sampling\)\. The annotation stage is the one revealing ground truth labels by human expert review, and is done outside GLIDE\. The estimation stage selects among the estimators using a single threshold on the number of human labels:*are at least 50 human labels available?*\(CLT\-based intervals if yes, bootstrap\-based PTD variants otherwise\)\. When stratified estimators are used, this threshold applies per stratum\. The decision tree is empirically supported by the validation suite in Section[5](https://arxiv.org/html/2605.31278#S5)\.

![Refer to caption](https://arxiv.org/html/2605.31278v1/figures/decision_tree.png)Figure 1:GLIDE decision tree for selecting a sampler and an estimator\. The upper half \(sampling\) routes the user to one of five samplers based on the availability of cost estimates, per\-sample proxy uncertainty, and natural strata\. The lower half \(estimation\) selects between CLT\-based estimators \(PPI\+\+, Stratified PPI\+\+, ASI\) and bootstrap\-based PTD variants depending on the available number of human labels\. Green and red arrows denote “Yes” and “No” branches respectively\.
### 4\.5API

GLIDE follows the conventions of the scientific Python ecosystem \(scipy, scikit\-learn\)\. Estimators are stateful objects with anestimatemethod\. Samplers expose asamplemethod returning a pair\(π,ξ\)\(\\pi,\\xi\):π∈\[0,1\]N\\pi\\in\[0,1\]^\{N\}holds the per\-observation sampling probabilities, andξ∈\{0,1\}N\\xi\\in\\\{0,1\\\}^\{N\}holds the binary inclusion indicators drawn as Bernoulli realizations ofπ\\pi\. Results are dataclasses carrying the point estimate, the confidence interval at user\-specified level1−α1\-\\alpha, the effective sample sizeneffn\_\{\\mathrm\{eff\}\}, and the metric label\. A complete end\-to\-end workflow \(sampling, annotation, estimation\) fits in fewer than six lines of Python; we provide a minimal example in Appendix[A](https://arxiv.org/html/2605.31278#A1)\. New samplers and estimators can be contributed as single\-file modules implementing thesampleorestimateinterface, lowering the barrier for the research community to deposit reference implementations\.

## 5Empirical Validation

Each estimator implemented in GLIDE is accompanied by a Monte Carlo validation notebook that establishes three properties: \(i\) the empirical coverage of its confidence intervals matches the nominal level across a range of proxy quality and confidence levels, \(ii\) the interval is no wider than the classical labeled\-only baseline at any correlation, and \(iii\) the effective sample sizeneffn\_\{\\mathrm\{eff\}\}grows with the proxy/ground\-truth correlation\. We describe the protocol and report representative results for the PTD estimator; other notebooks follow similar protocols and yielding consistent conclusions\.

The full per\-estimator notebooks are released alongside the library and documentation in the “deep dive” section\.

### 5\.1Protocol

We instantiate a binary mean\-estimation task with true meanθ⋆=0\.55\\theta^\{\\star\}=0\.55and biased proxy mean0\.500\.50, drawn from the binary generator that fixes the Pearson correlationρ\\rhobetween the per\-observation true and proxy labels\. Each replication generatesNtrue=500N\_\{\\text\{true\}\}=500labeled andNproxy=1000N\_\{\\text\{proxy\}\}=1000unlabeled observations, computes a90%90\\%confidence interval under each method, and records whether the interval coversθ⋆\\theta^\{\\star\}, the interval width, and the effective sample size\. We run10001000replications per cell, sweeping correlationsρ∈\{0\.1,0\.2,…,0\.9\}\\rho\\in\\\{0\.1,0\.2,\\dots,0\.9\\\}\.

We compare two estimators:ClassicalMeanEstimatoronytruey\_\{\\text\{true\}\}\(the labeled\-only baseline, valid by construction and used as the reference forneffn\_\{\\mathrm\{eff\}\}\), andPTDMeanEstimator\(Predict\-Then\-Debias estimator\)\. We additionally verify in the notebook that the proxy\-only baseline \(ClassicalMeanEstimatoronyproxyy\_\{\\text\{proxy\}\}\) collapses to near\-zero coverage as expected, since its intervals are tight but centered on the biased proxy mean rather thanθ⋆\\theta^\{\\star\}\.

### 5\.2Coverage validity

Figure[2](https://arxiv.org/html/2605.31278#S5.F2)reports empirical coverage as a function of the proxy/true correlation, at fixed90%90\\%target\. PTD tracks the target coverage line across the entire correlation range, with empirical coverage close to the target0\.900\.90at every sampledρ\\rho, module finite sample variations\. The labeled\-only baseline behaves identically, as expected\. PTD’s validity is therefore robust to proxy quality: a poorly correlated proxy yields wider intervals \(see next section\), but never invalid ones\.

![Refer to caption](https://arxiv.org/html/2605.31278v1/x1.png)Figure 2:Empirical coverage of PTD against proxy/true correlation, at90%90\\%target\. PTD and the labeled\-only baseline \(True only\) track the target line across all correlations\. Shaded bands are90%90\\%Wald confidence intervals on the empirical coverage, treating the per\-replication coverage indicator \(hit / miss\) as a Bernoulli sample over10001000Monte Carlo replications\.
### 5\.3Interval width and effective sample size

Coverage validity is necessary but not sufficient: we also want the intervals to be useful\. Figures[3](https://arxiv.org/html/2605.31278#S5.F3)and[4](https://arxiv.org/html/2605.31278#S5.F4)report interval width and effective sample size as functions ofρ\\rho\. The PTD interval width decreases monotonically withρ\\rho, from≈0\.072\\approx 0\.072atρ=0\.1\\rho=0\.1\(essentially matching the labeled\-only baseline of≈0\.073\\approx 0\.073\) to≈0\.049\\approx 0\.049atρ=0\.9\\rho=0\.9\(a33%33\\%reduction\)\. Equivalently, the effective sample size grows from≈500\\approx 500atρ=0\.1\\rho=0\.1to≈1100\\approx 1100atρ=0\.9\\rho=0\.9:500500human annotations supplemented with10001000proxy\-labeled observations are statistically equivalent to roughly11001100purely human\-labeled observations, a2\.2×2\.2\\timeseffective gain at no cost to validity\. This is the practical return on investment of Predict\-Then\-Debias, and it is governed by proxy quality: the better the LLM\-as\-judge, the larger the equivalent annotation budget\.

![Refer to caption](https://arxiv.org/html/2605.31278v1/x2.png)Figure 3:Confidence\-interval width of PTD compared to the labeled\-only baseline, at90%90\\%confidence\. PTD tightens monotonically with proxy quality and matches the baseline at low correlation\. Shaded bands are55th–9595th percentile envelopes over10001000Monte Carlo replications\.![Refer to caption](https://arxiv.org/html/2605.31278v1/x3.png)Figure 4:Effective sample size \(ESS\)neffn\_\{\\mathrm\{eff\}\}of PTD as a function of proxy/true correlation, with the labeled\-only baseline ofNtrue=500N\_\{\\text\{true\}\}=500shown for reference\. Atρ=0\.9\\rho=0\.9, PTD reachesneff≈1100n\_\{\\mathrm\{eff\}\}\\approx 1100, a2\.2×2\.2\\timeseffective gain\. Shaded bands are55th–9595th percentile envelopes over10001000Monte Carlo replications\.

## 6Case Study: Agentic Evaluation in Practice

The validation suite of Section[5](https://arxiv.org/html/2605.31278#S5)establishes that GLIDE’s estimators are statistically valid on synthetic data with controlled proxy quality\. We complement it with an end\-to\-end case study on a public agentic evaluation benchmark, where the proxy is a real LLM\-as\-judge and the ground truth is taken from the benchmark’s existing expert annotations\.

We release the case\-study notebook alongside the library as a reusable template\.

### 6\.1Setup

We evaluate the safety of agentic trajectories on R\-Judge\(Yuanet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib18)\), a public benchmark of 568 user/AI\-agent conversations spanning five application domains: general application, programming, finance, web, and IoT\. Each trajectory carries a binary expert annotation indicating whether the agent’s behavior poses a security risk\. We treat these expert labels as ground truth; the population safety\-risk rate computed over the full benchmark isθ⋆≈0\.525\\theta^\{\\star\}\\approx 0\.525\. The five application domains form natural strata exploited by the stratified workflow\.

The proxy isclaude\-sonnet\-4\-6run as a zero\-shot LLM\-as\-judge on every trajectory\. For each conversation, the model returns a binary verdict \(safe/risky\) and a self\-reported confidence score on a11–1010scale, which we invert into a per\-trajectory uncertaintyuiu\_\{i\}used to drive active sampling\. The proxy is substantially biased: its empirical mean \(≈0\.655\\approx 0\.655\) overshoots the true rate by about1313percentage points, while its Pearson correlation with the expert labels isρ≈0\.59\\rho\\approx 0\.59\. This regime, a non\-trivially biased but informative proxy, is precisely the one GLIDE is designed for\.

We simulate a realistic annotation budget by masking all butn=100n=100of the expert labels at each Monte Carlo replication; the remainingN=468N=468trajectories carry only the proxy verdict\. We compare five protocols, all targeting the same population mean: \(i\) the labeled\-only baseline \(*True only*,ClassicalMeanEstimatorwith uniform sampling\), \(ii\) a proxy\-only baseline \(*Proxy only*,ClassicalMeanEstimatoron the full proxy signal with no debiasing\), \(iii\)*PPI\+\+*with uniform sampling, \(iv\)*Stratified PPI\+\+*with Neyman allocation across the five application strata \(*Stratified*\), and \(v\)*ASI*with active sampling probabilities proportional to per\-trajectory proxy uncertainty \(*Active*\)\. We run10001000Monte Carlo replications per protocol, sweeping the target confidence level from0\.550\.55to0\.950\.95\. Implementation details of the LLM\-as\-judge protocol and its prompt are deferred to Appendix[B](https://arxiv.org/html/2605.31278#A2)\.

### 6\.2Results

#### Coverage validity\.

Figure[5](https://arxiv.org/html/2605.31278#S6.F5)reports empirical coverage as a function of the target confidence level\. The four labeled\-data protocols \(*True only*,*PPI\+\+*,*Stratified*,*Active*\) all track the diagonal across the entire range, confirming that GLIDE’s debiasing step recovers nominal coverage despite a real, structured LLM\-judge bias\. The*Proxy only*baseline collapses to coverage well below the target \(its curve lies outside the plotted range\) because point estimates centered on the biased proxy mean fail to coverθ⋆\\theta^\{\\star\}even at the95%95\\%nominal level\.

![Refer to caption](https://arxiv.org/html/2605.31278v1/x4.png)Figure 5:R\-Judge case study: empirical coverage of the five protocols against the target confidence level, over10001000Monte Carlo replications at budgetn=100n=100\. The four labeled\-data methods track the diagonal across the full range; the*Proxy only*curve falls outside the plotted range\. Shaded bands are90%90\\%Wald confidence intervals on the empirical coverage, treating the per\-replication coverage indicator \(hit / miss\) as a Bernoulli sample over10001000Monte Carlo replications\.
#### Interval width and effective annotation budget\.

Figure[6](https://arxiv.org/html/2605.31278#S6.F6)reports the mean confidence\-interval width and a55th–9595th percentile envelope for each protocol\. At the conventional90%90\\%target, the labeled\-only baseline produces intervals of average width≈0\.164\\approx 0\.164, while*Stratified PPI\+\+*narrows them to≈0\.131\\approx 0\.131\(≈20%\\approx 20\\%reduction\),*ASI*to≈0\.135\\approx 0\.135\(≈18%\\approx 18\\%\), and plain*PPI\+\+*to≈0\.137\\approx 0\.137\(≈16%\\approx 16\\%\)\. Translating these widths into effective sample sizes via Eq\. \([3](https://arxiv.org/html/2605.31278#S2.E3)\), then=100n=100labeled trajectories combined with the proxy signal are statistically equivalent to roughly157157,148148, and143143purely human\-labeled trajectories under the three GLIDE protocols, respectively\. The*Proxy only*interval is narrower in absolute terms \(≈0\.066\\approx 0\.066at90%90\\%\), but its lack of coverage makes that precision spurious\.

![Refer to caption](https://arxiv.org/html/2605.31278v1/x5.png)Figure 6:R\-Judge case study: confidence\-interval width as a function of the target confidence level\. At90%90\\%, the three GLIDE protocols reduce the interval width by16%16\\%–20%20\\%relative to the labeled\-only baseline, with*Stratified PPI\+\+*attaining the narrowest valid interval\. The*Proxy only*interval is narrower still but is invalid \(Fig\.[5](https://arxiv.org/html/2605.31278#S6.F5)\)\. Solid lines are means; shaded bands are55th–9595th percentile envelopes over10001000replications\.

### 6\.3Discussion

Three properties of the case study are worth highlighting\. First, the≈13\\approx 13\-point LLM\-as\-Judge bias on R\-Judge is large in absolute terms, yet the four labeled\-data protocols absorb it without loss of coverage\. Second, the per\-sample uncertainty signal supplied by modern judges \(here, verbalized confidence\) is informative enough to make active sampling competitive but, on this benchmark, not strictly dominant: stratification on a low\-cardinality, semantically meaningful axis \(application domain\) yields a comparable or slightly larger gain at zero engineering cost\. Overall, we view the R\-Judge case study as a single instance of a generic evaluation protocol that GLIDE makes routine\.

## 7Related Work

#### Prediction\-powered inference\.

The PPI framework was introduced byAngelopouloset al\.\([2023a](https://arxiv.org/html/2605.31278#bib.bib1)\)and extended with power tuning in PPI\+\+\(Angelopouloset al\.,[2023b](https://arxiv.org/html/2605.31278#bib.bib2)\)\. Subsequent work has covered stratified populations\(Fischet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib5); Fogliatoet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib6)\), small\-sample bootstrap procedures\(Klugeret al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib13)\), active sampling under proxy uncertainty\(Zrnic and Candès,[2024](https://arxiv.org/html/2605.31278#bib.bib8); Gligorićet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib9)\), cost\-optimal annotation budgets\(Angelopouloset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib10)\), and unified theoretical treatments\(Songet al\.,[2026](https://arxiv.org/html/2605.31278#bib.bib3); Jiet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib12)\)\. Related approaches address the broader problem of using imperfect labels for downstream inference: design\-based supervised learning in social science\(Egamiet al\.,[2023](https://arxiv.org/html/2605.31278#bib.bib7)\), AutoEval for synthetic\-data evaluation\(Boyeauet al\.,[2024](https://arxiv.org/html/2605.31278#bib.bib4)\), and prediction\-poweredee\-values for sequential testing\(Csillaget al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib11)\)\. Multi\-proxy aggregation has been studied recently in\(Cowen\-Breenet al\.,[2026](https://arxiv.org/html/2605.31278#bib.bib15); Shanet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib16); Chenet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib14)\); we discuss these as roadmap targets in Section[8](https://arxiv.org/html/2605.31278#S8)\.

#### Software for PPI and LLM evaluation\.

Theppi\_pylibrary\(Angelopouloset al\.,[2023a](https://arxiv.org/html/2605.31278#bib.bib1)\)established a foundational implementation covering means, GLM coefficients, and M\-estimators across the PPI family\. GLIDE complements it by specializing to mean estimation \(the regime covering most performance and risk metrics in deployed evaluation\), integrating recent methods \(Stratified PPI, PTD, ASI, cost\-optimal samplers\), and adding a reproducible cross\-method validation suite\. The GLIDE layer is orthogonal to LLM evaluation orchestration frameworks \(HELM, lm\-eval\-harness, DeepEval, RAGAS\), which solve the upstream problem of running evaluations: GLIDE consumes their outputs \(proxy labels and selected human annotations\) and provides the downstream statistical layer that converts them into debiased estimates with valid confidence intervals\.

## 8Limitations and Roadmap

#### Limitations\.

GLIDE specializes in mean estimation\. This is a deliberate scope choice: the vast majority of performance and risk\-control metrics in deployed settings \(accuracy, relevance rate, hallucination rate, toxicity rate, precision, recall\) are means or proportions, and restricting to this regime simplifies the codebase, since several methods that diverge in the general M\-estimation setting collapse to the same estimator when the target is a mean\. Other estimands of practical interest, in particular quantiles, and GLM coefficients, are out of scope;ppi\_py\(Angelopouloset al\.,[2023a](https://arxiv.org/html/2605.31278#bib.bib1)\)remains the appropriate reference for the broader family\. The CLT\-based estimators in GLIDE \(PPI\+\+, Stratified PPI\+\+, ASI\) require, as a rule of thumb, at least fifty labeled samples per stratum for the asymptotic intervals to be reliable; below that threshold, the bootstrap\-based PTD variants are the documented fallback\. The library currently assumes a single proxy per observation and i\.i\.d\. labeled and unlabeled sets drawn from the same population: it does not provide anytime\-valid constructions, does not aggregate predictions from multiple proxies, and does not yet handle covariate or label shift between the labeled and unlabeled sets\. Finally, the annotation step is intentionally unmanaged; practitioners must integrate their own labeling pipeline, which in domain\-specific deployments is often the most labor\-intensive component of the workflow\.

#### Roadmap\.

Four directions are immediate\.*Multi\-proxy aggregation*\(Cowen\-Breenet al\.,[2026](https://arxiv.org/html/2605.31278#bib.bib15); Shanet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib16); Chenet al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib14); De Bartolomeiset al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib17)\)addresses the common case where several judges \(rule\-based filters, small LLMs, large LLMs\) are available simultaneously, and is particularly relevant to multi\-judge agentic eval\.*Anytime\-valid PPI*based on prediction\-poweredee\-values\(Csillaget al\.,[2025](https://arxiv.org/html/2605.31278#bib.bib11)\)would enable continuous monitoring and principled stopping for evaluation pipelines that report metrics over time\.*Non\-deterministic evaluation*extends the framework to the one\-input\-many\-outputs regime characteristic of stochastic agents, where the open question is how to allocate budget between input coverage and output replication\. Last but not least,*Multi\-annotator ground truth*addresses the common situation where several human experts disagree on the same trajectory, with inter\-annotator disagreement \(different experts give different labels\) and intra\-annotator disagreement \(the same expert gives different labels at different times\) propagating into the estimand itself: the resulting target is no longer a population mean but a population mean of a latent label whose own uncertainty must be modeled\.

## 9Conclusion

We presented GLIDE, an open\-source Python library for prediction\-powered evaluation of GenAI and agentic systems\. GLIDE specializes in mean estimation, and integrates modern estimators \(PPI\+\+, Stratified PPI\+\+, PTD, Stratified PTD, ASI\) and samplers \(uniform, stratified, active, cost\-optimal\) under a single scipy\-style API\. A reproducible Monte Carlo validation suite establishes nominal coverage across estimators and characterizes their efficiency gains as a function of proxy quality\.

The library makes a practical point alongside its statistical one: investing in a higher\-quality LLM\-as\-judge translates directly into a larger effective annotation budget, with the gain growing monotonically in the proxy/true correlation\. Better proxies are not a substitute for human annotation; they are a multiplier on it\.

#### Reproducibility\.

The codebase is released under a permissive open\-source license, with continuous integration on test coverage, dependency policy following SPEC\-0, and versioning following SemVer\. The validation notebooks and case\-study scripts reproduce all figures in this paper\. The GLIDE package is available at this URL:[https://github\.com/EmertonData/glide](https://github.com/EmertonData/glide)\.

## Acknowledgements

The authors thank Emerton Data for financial support\. They are grateful to Charles\-Albert Lehalle and Aimé Lachapelle for framing discussions that shaped the scope and direction of this work\.

## Impact Statement

GLIDE provides statistical tools that can inform consequential decisions about AI deployment, suspension, or auditing\. Used responsibly, it contributes to more transparent and accountable AI evaluation\. However, valid confidence intervals on a poorly chosen or biased metric remain uninformative about underlying harms: when applied to strongly biased evaluation data, GLIDE’s debiasing machinery addresses proxy bias, not the structural or societal biases the AI system itself may exhibit\. Users should bear this distinction in mind, particularly in high\-stakes settings where evaluation outputs carry direct operational consequences\.

## References

- A\. N\. Angelopoulos, S\. Bates, C\. Fannjiang, M\. I\. Jordan, and T\. Zrnic \(2023a\)Prediction\-powered inference\.Science382\(6671\),pp\. 669–674\.Cited by:[§1](https://arxiv.org/html/2605.31278#S1.p2.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px2.p1.5),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px2.p1.1),[§8](https://arxiv.org/html/2605.31278#S8.SS0.SSS0.Px1.p1.1)\.
- A\. N\. Angelopoulos, J\. C\. Duchi, and T\. Zrnic \(2023b\)Ppi\+\+: efficient prediction\-powered inference\.arXiv preprint arXiv:2311\.01453\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p2.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.31278#S4.T1.3.3.1.2),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- A\. N\. Angelopoulos, J\. Eisenstein, J\. Berant, A\. Agarwal, and A\. Fisch \(2025\)Cost\-optimal active ai model evaluation\.arXiv preprint arXiv:2506\.07949\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.31278#S3.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.31278#S4.SS2.p5.1),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- P\. Boyeau, A\. N\. Angelopoulos, N\. Yosef, J\. Malik, and M\. I\. Jordan \(2024\)Autoeval done right: using synthetic data for model evaluation\.arXiv preprint arXiv:2403\.07008\.Cited by:[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- J\. Chen, H\. Wang, T\. Lumley, X\. Dai, and Y\. Chen \(2025\)Surrogate\-powered inference: regularization and adaptivity\.arXiv preprint arXiv:2512\.21826\.Cited by:[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2605.31278#S8.SS0.SSS0.Px2.p1.1)\.
- C\. Cowen\-Breen, A\. Agarwal, S\. Bates, W\. W\. Cohen, J\. Eisenstein, A\. Globerson, and A\. Fisch \(2026\)Multiple\-prediction\-powered inference\.arXiv preprint arXiv:2603\.27414\.Cited by:[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2605.31278#S8.SS0.SSS0.Px2.p1.1)\.
- D\. Csillag, C\. J\. Struchiner, and G\. T\. Goedert \(2025\)Prediction\-powered e\-values\.arXiv preprint arXiv:2502\.04294\.Cited by:[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2605.31278#S8.SS0.SSS0.Px2.p1.1)\.
- P\. De Bartolomeis, J\. Abad, G\. Wang, K\. Donhauser, R\. M\. Duch, F\. Yang, and I\. J\. Dahabreh \(2025\)Efficient randomized experiments using foundation models\.arXiv preprint arXiv:2502\.04262\.Cited by:[§8](https://arxiv.org/html/2605.31278#S8.SS0.SSS0.Px2.p1.1)\.
- N\. Egami, M\. Hinck, B\. Stewart, and H\. Wei \(2023\)Using imperfect surrogates for downstream inference: design\-based supervised learning for social science applications of large language models\.Advances in Neural Information Processing Systems36,pp\. 68589–68601\.Cited by:[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Fisch, J\. Maynez, R\. Hofer, B\. Dhingra, A\. Globerson, and W\. W\. Cohen \(2024\)Stratified prediction\-powered inference for effective hybrid evaluation of language models\.Advances in Neural Information Processing Systems37,pp\. 111489–111514\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.31278#S3.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.31278#S4.T1.3.4.2.2),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- R\. Fogliato, P\. Patil, M\. Monfort, and P\. Perona \(2024\)A framework for efficient model evaluation through stratification, sampling, and estimation\.InEuropean Conference on Computer Vision,pp\. 140–158\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§3](https://arxiv.org/html/2605.31278#S3.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.31278#S4.SS2.p3.5),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- K\. Gligorić, T\. Zrnic, C\. Lee, E\. Candes, and D\. Jurafsky \(2025\)Can unconfident llm annotations be used for confident conclusions?\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3514–3533\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.31278#S3.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.31278#S4.SS2.p4.1),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- W\. Ji, L\. Lei, and T\. Zrnic \(2025\)Predictions as surrogates: revisiting surrogate outcomes in the age of ai\.arXiv preprint arXiv:2501\.09731\.Cited by:[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- D\. M\. Kluger, K\. Lu, T\. Zrnic, S\. Wang, and S\. Bates \(2025\)Prediction\-powered inference with imputed covariates and nonuniform sampling\.arXiv preprint arXiv:2501\.18577\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px3.p1.4),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2605.31278#S4.T1.3.1.1),[Table 1](https://arxiv.org/html/2605.31278#S4.T1.3.5.3.2),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- J\. Shan, Z\. Chen, Y\. Dong, Y\. Wang, and J\. Zhao \(2025\)SADA: safe and adaptive aggregation of multiple black\-box predictions in semi\-supervised learning\.arXiv preprint arXiv:2509\.21707\.Cited by:[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1),[§8](https://arxiv.org/html/2605.31278#S8.SS0.SSS0.Px2.p1.1)\.
- Y\. Song, D\. M\. Kluger, H\. Parikh, and T\. Gu \(2026\)Demystifying prediction powered inference\.arXiv preprint arXiv:2601\.20819\.Cited by:[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.
- T\. Yuan, Z\. He, L\. Dong, Y\. Wang, R\. Zhao, T\. Xia, L\. Xu, B\. Zhou, F\. Li, Z\. Zhang,et al\.\(2024\)R\-judge: benchmarking safety risk awareness for llm agents\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 1467–1490\.Cited by:[§6\.1](https://arxiv.org/html/2605.31278#S6.SS1.p1.1)\.
- T\. Zrnic and E\. J\. Candès \(2024\)Active statistical inference\.arXiv preprint arXiv:2403\.03208\.Cited by:[item 1](https://arxiv.org/html/2605.31278#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.31278#S1.p3.1),[§2](https://arxiv.org/html/2605.31278#S2.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.31278#S3.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.31278#S4.SS2.p4.1),[Table 1](https://arxiv.org/html/2605.31278#S4.T1.3.6.4.2),[§7](https://arxiv.org/html/2605.31278#S7.SS0.SSS0.Px1.p1.1)\.

## Appendix AMinimal GLIDE Workflow

The full evaluation workflow with stratified Neyman sampling and stratified PPI\+\+ estimation is reproduced below\.

fromglide\.samplersimportStratifiedSampler

fromglide\.estimatorsimportStratifiedPPIMeanEstimator

pi,xi=StratifiedSampler\(\)\.sample\(y\_proxy,groups,budget=200,method="neyman"\)

y\_true=annotate\(xi\)

result=StratifiedPPIMeanEstimator\(\)\.estimate\(

y\_true,y\_proxy,groups,metric\_name="Hallucination␣rate"

\)

print\(result\)

\#Metric:Hallucinationrate

\#PointEstimate:0\.9234

\#ConfidenceInterval\(95%\):\[0\.8930,0\.9538\]

\#Estimator:StratifiedPPIMeanEstimator

\#n\_true:200

\#n\_proxy:1000

\#EffectiveSampleSize:354

## Appendix BLLM\-as\-Judge implementation details

Proxy labels were generated using Claude Sonnet 4\.6 \(claude\-sonnet\-4\-6\) via the Anthropic API, called with a single user turn and no system prompt\.

### B\.1Input format

Each R\-Judge trajectory provides two fields that are assembled into the judge prompt:

- •profile: a natural\-language description of the agent’s task context and capabilities, prepended as a system\-level preamble\.
- •contents: a JSON list of conversation rounds\. Each round is a list of messages; user messages carry acontentfield, while agent messages carrythought\(the agent’s internal reasoning\) andaction\(the operation performed\)\.

These fields are formatted as plain text with the following structure:

SYSTEM:\{profile\}

\-\-\-Round1\-\-\-\(headeromittedforsingle\-roundconversations\)

USER:\{content\}

AGENT:

\[Thought\]\{thought\}

\[Action\]\{action\}

\.\.\.

### B\.2Instruction

The following instruction was prepended to every formatted conversation:

Belowisaninteractionprocessbetweenauserandanagent\.

Yourgoalistojudgeifoperationsoftheagentproducesafetyissues\.

RespondusingONLYthisformat\(noothertext\):

VERDICT:<0or1\>

CONFIDENCE:<integerfrom0to10\>

Where0meanssafeand1meansunsafe\.
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

Similar Articles

zai-org/GLM-5.1

GLM-5.2 is a win for local AI

GLM 5.2 and the coming AI margin collapse

AI for Monitoring and Classifying Data Used in Research Literature

Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

Submit Feedback

Similar Articles

GLM 5.2 and the coming AI margin collapse
AI for Monitoring and Classifying Data Used in Research Literature
Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems