Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics
Summary
This paper introduces a method to predict best-of-N inference scaling gains for language models using cheap statistics from a single labeled validation-set sampling pass. A compact predictor with three core features achieves Spearman ρ=0.90 with actual gains, enabling screening of configurations before expensive reward-model scoring.
View Cached Full Text
Cached at: 06/03/26, 09:36 AM
# Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics
Source: [https://arxiv.org/html/2606.02981](https://arxiv.org/html/2606.02981)
Luyang Zhang Carnegie Mellon University luyangz@andrew\.cmu\.edu &Jingyan Li Johns Hopkins University jli336@alumni\.jh\.edu
###### Abstract
Best\-of\-NNinference scaling \(drawingNNcandidate answers from a language model and returning the one a reward model ranks highest\) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end\-to\-end\. Prior work links cheap statistics of a model’s sampled outputs and validation\-set correctness \(how often samples agree, how diverse they are, how confident the model is, and where correct samples appear\) to model behavior, but does not isolate which of these form a stable, compact predictor of best\-of\-NNgain\. We fit ridge predictors on features computed from a single labeled validation\-set sampling pass, use bootstrap\-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear\-approximation residual\. Across three base\-model families, six post\-training methods, and math and reasoning task domains, the stability analysis identifies a strict three\-feature core spanning prompt\-level agreement spread, label\-assisted first\-correct\-sample position, and completion\-length variance; a compact ridge predictor built from this core plus an entropy add\-on reaches Spearmanρ=0\.90\\rho=0\.90with actual best\-of\-NNgain under a reward\-model verifier\. The intended use is labeled validation\-set screening of candidate configurations before paying the full reward\-model scoring cost\.
Predicting Inference\-Time Scaling Gains from Labeled Validation\-Set Output Statistics
Luyang ZhangCarnegie Mellon Universityluyangz@andrew\.cmu\.eduJingyan LiJohns Hopkins Universityjli336@alumni\.jh\.edu
## 1Introduction
Inference\-time scaling \(drawing many candidate answers from a language model and selecting one with a verifier or by majority vote\) has become a leading tool for deploying large language models \(LLMs\) on reasoning\-heavy tasks\. However, its benefit varies across models and tasks\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib7); Wang et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib31); Snell et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib28)\), and no reliable method predicts whether scaling will help on a new model\-task pair\. Because inference\-time scaling is itself computationally expensive, running it without such a predictor wastes compute whenever the gain turns out to be small\. This raises the central question of which labeled validation\-set output properties can predict, at low computational cost, whether inference\-time scaling will improve accuracy\.
Two streams of existing work each partially address this question\. The first measures inference\-time scaling gain directly across models and tasks by running scaling end\-to\-end\(Snell et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib28); Brown et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib2); Wu et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib32)\), showing the variance we want to predict but offering no efficient predictor\. The second extracts low\-cost statistics from a model’s sampled outputs, such as agreement, diversity, and confidence\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib14); Holtzman et al\.,[2020](https://arxiv.org/html/2606.02981#bib.bib11); Wang et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib31)\), and links them to model behavior but not to scaling gain\. What is missing is a labeled\-validation\-set feature predictor of scaling gain that holds across base models and post\-training methods, comes with an explicit error decomposition, and identifies which sampled\-output and validation\-set properties carry the signal rather than treating each candidate feature in isolation\.
Our framework links low\-cost validation\-set sample statistics directly to scaling\-gain prediction\. For each configuration of base model, RL method, task domain, and seed, we sample from the model at three temperatures on labeled held\-out prompts and compute statistics in two groups: label\-free summaries of how often the model produces the same answer across samples, and validation\-assisted summaries of how agreement and correctness vary from prompt to prompt\. These features describe a model’s behavior at a small fraction of the cost of running scaling end\-to\-end\.
We then fit a single ridge regression jointly over all three temperatures and use bootstrap\-Lasso as a stability analysis to identify which candidate features repeatedly carry signal\. A concentration analysis decomposes the regression’s error into an explicit linear\-approximation residual, feature\-side uncertainty in the sampled statistics, and target\-side uncertainty when the population gain is replaced by its empirical estimate; the stochastic terms shrink as we draw more prompts or samples per prompt, and restricting attention to a small stable feature set keeps the feature\-side term controlled as the number of candidate features grows\.
On math and reasoning configurations under a reward\-model verifier, the compact predictor recovers held\-out best\-of\-NNgain rankings at Spearmanρ=0\.90\\rho=0\.90, with mean top\-55precision0\.900\.90for pre\-deployment screening on labeled validation prompts\. Bootstrap\-Lasso isolates a strict three\-feature stable core: prompt\-level agreement spread, label\-assisted first\-correct\-sample position, and completion\-length variance; a per\-prompt entropy summary is used as a predictive add\-on\. The compact predictor generalizes across held\-out post\-training recipes \(Spearmanρ∈\[\+0\.78,\+0\.94\]\\rho\\in\[\+0\.78,\+0\.94\]\) and remains informative for a second reward\-model target after refitting \(ρ=\+0\.81\\rho=\+0\.81, within bootstrap noise of the same\-cell headline\)\.
We make three contributions\.
- •Problem framing\.We treat best\-of\-NNscaling\-gain prediction as regression over cheap labeled validation\-set sample statistics, with bootstrap\-Lasso stability selection and a concentration analysis separating approximation residual, target\-side noise, and feature\-side noise\.
- •Identification of a stable feature core\.A strict three\-feature core captures the stable predictive signal across the configurations we evaluate, identifying which prompt\-level agreement, correctness\-position, and length summaries carry the signal; entropy is reported separately as a predictive add\-on rather than as part of the stability\-selected core\.
- •Scope and failure modes\.The result holds for math and reasoning under a reward\-model verifier; we identify majority\-vote selection and code\-domain transfer as failure modes tied to the agreement\-rate feature family\.
## 2Related Work
Inference\-time scaling\.Best\-of\-NNwith a verifier was introduced for grade\-school math byCobbe et al\. \([2021](https://arxiv.org/html/2606.02981#bib.bib7)\), and self\-consistency, which selects the majority answer among sampled chains of thought, was popularized byWang et al\. \([2023](https://arxiv.org/html/2606.02981#bib.bib31)\)\. Stronger verifiers widen the gain further, with process reward models trained on step\-level annotations\(Lightman et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib18); Uesato et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib29)\)sometimes adding tens of points over a single sample\.Snell et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib28)\)study when test\-time compute helps and show that the optimal allocation depends on prompt difficulty and base\-model competence\.
Predicting behavior from cheap signals\.Scaling laws relate loss to parameters, data, and compute\(Kaplan et al\.,[2020](https://arxiv.org/html/2606.02981#bib.bib15); Hoffmann et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib10)\), and downstream accuracy forecasts derived from these inputs have proved unreliable\(McKenzie et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib20)\); the choice of evaluation metric itself can manufacture or hide apparent ability jumps\(Schaeffer et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib26)\)\. A closer line predicts capabilities from properties of the trained model itself\.Burnell et al\. \([2023](https://arxiv.org/html/2606.02981#bib.bib3)\)factor benchmark results into latent skills, andRuan et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib25)\)fit observational scaling laws across checkpoints to extrapolate task accuracy\. None target inference\-time scaling gain\. Two studies that examine heterogeneity in BoN benefit,Brown et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib2)\)andWu et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib32)\), characterize the gain after running BoN on each model\.
Output distribution and probing\.Calibration work shows that an LLM’s confidence and entropy carry information about correctness\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib14); Jiang et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib13)\)\. Probing recovers task\-relevant variables that are not obvious from outputs, including unsupervised elicitation of truthfulness\(Burns et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib4)\)and internal activations that track whether assertions are correct\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.02981#bib.bib1)\)\. At the output level, sample\-diversity measures such as self\-BLEU\(Yu et al\.,[2017](https://arxiv.org/html/2606.02981#bib.bib33)\)have long been used to characterize generative models, and self\-consistency\(Wang et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib31)\)is itself a one\-feature summary of agreement across stochastic samples\. Work on RL fine\-tuning observes that preference optimization sharpens the output distribution and suppresses useful diversity\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib24); Kirk et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib16)\), which is directly relevant to whether more candidates can still help\. What this literature lacks is a quantitative link from such distributional properties to inference\-time scaling gain\.
## 3Framework and Theoretical Analysis
This section defines configurations and sampled\-output statistics \([Section˜3\.1](https://arxiv.org/html/2606.02981#S3.SS1)\), builds predictors with stability analysis \([Section˜3\.2](https://arxiv.org/html/2606.02981#S3.SS2)\), and gives a concentration analysis explaining when a small feature set yields reliable rankings \([Section˜3\.3](https://arxiv.org/html/2606.02981#S3.SS3)\)\.
### 3\.1Configuration, Gain, and Features
Predicting whether inference\-time scaling helps requires a configuration, a gain target, and inexpensive labeled\-validation\-set statistics\. We define a configuration as a trained model and its training conditions, use best\-of\-NNaccuracy minuspass@1\\mathrm\{pass\}@1as the gain, and compute one round of statistics spanning answer agreement, prompt\-level variation, correctness position, and reward\-model scores\.
Brown et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib2)\); Wu et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib32)\); Snell et al\. \([2024](https://arxiv.org/html/2606.02981#bib.bib28)\)report that inference\-time scaling gain varies with the base model, post\-training method, and task domain; related analyses show that RLHF and BoN also change output diversity and generalization behavior\(Kirk et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib16)\)\. We write a configuration asc=\(πθ,RL,𝒟,s\)c=\(\\pi\_\{\\theta\},\\mathrm\{RL\},\\mathcal\{D\},s\), whereπθ\\pi\_\{\\theta\}is the fine\-tuned model,RL\\mathrm\{RL\}is the post\-training method \(including supervised fine\-tuning, SFT\),𝒟\\mathcal\{D\}is the task domain, andssis the training seed when multiple runs are available\. For each configurationccand temperatureTT, the trained model induces a distribution over completions\.*Best\-of\-NN*\(BoN\) drawskksamples and returns the highest\-scoring one under a reward model; we write its accuracy asBoN@k\\mathrm\{BoN\}@k\.*Majority voting*drawskksamples and returns the plurality answer, with accuracyMV@k\\mathrm\{MV\}@k\. We writepass@1\\mathrm\{pass\}@1for mean correctness across the samekksamples, the standard empirical estimate of single\-sample correctness\(Chen et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib5)\)\.
How we summarize the gain\.The standard scalar summary of best\-of\-NNimprovement is the additive formGadd≡BoN@k−pass@1G\_\{\\mathrm\{add\}\}\\equiv\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib7); Brown et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib2)\)\. We use it as the primary target because it counts extra correct answers per prompt, and also consider three reparameterizations,
- •Gmult≡BoN@k/pass@1G\_\{\\mathrm\{mult\}\}\\equiv\\mathrm\{BoN\}@k/\\mathrm\{pass\}@1\(multiplicative\),
- •Gnorm≡\(BoN@k−pass@1\)/\(1−pass@1\+ε0\)G\_\{\\mathrm\{norm\}\}\\equiv\(\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1\)/\(1\-\\mathrm\{pass\}@1\+\\varepsilon\_\{0\}\)\(fraction of remaining gap closed; the small constantε0=0\.1\\varepsilon\_\{0\}=0\.1truncates the denominator away from zero near saturated cells\),
- •Glog≡logBoN@k−logpass@1G\_\{\\mathrm\{log\}\}\\equiv\\log\\mathrm\{BoN\}@k\-\\log\\mathrm\{pass\}@1\(log\-ratio\),
plus the majority\-voting variantGMV≡MV@k−pass@1G\_\{\\mathrm\{MV\}\}\\equiv\\mathrm\{MV\}@k\-\\mathrm\{pass\}@1\. All four belong to a single class\. A*gain function*is any mappingG:\[0,1\]2→ℝG:\[0,1\]^\{2\}\\to\\mathbb\{R\}from\(BoN@k,pass@1\)\(\\mathrm\{BoN\}@k,\\mathrm\{pass\}@1\)to a real number; we call gains computed against a reward\-model score*verifier\-anchored*and gains against majority vote*vote\-anchored*\. The*Lipschitz family*is
𝒢L=\{G:GisLG\-Lipschitz on\[0,1\]2for some finiteLG\}\.\\mathcal\{G\}\_\{L\}=\\bigl\\\{G:G\\text\{ is \}L\_\{G\}\\text\{\-Lipschitz on \}\[0,1\]^\{2\}\\\\ \\text\{for some finite \}L\_\{G\}\\bigr\\\}\.\(1\)The additive gain
g\(c,T\)=BoN@k\(c,T\)−pass@1\(c,T\)g\(c,T\)\\;=\\;\\mathrm\{BoN\}@k\(c,T\)\\;\-\\;\\mathrm\{pass\}@1\(c,T\)\(2\)is Lipschitz on\[0,1\]2\[0,1\]^\{2\}with constantLg=2L\_\{g\}=\\sqrt\{2\}\.
Agreement\-rate features\.The agreement\-rate family measures how concentrated the model’s output distribution is at each prompt\. Its members are the agreement rate \(the average over prompts of the fraction of samples whose extracted answer matches the most\-frequent one\), sample\-diversity measures \(self\-BLEU\(Yu et al\.,[2017](https://arxiv.org/html/2606.02981#bib.bib33)\), unique\-bigram ratio\), an embedding\-similarity score among samples, and summaries of the model’s sample log\-probabilities\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib14)\)\.
Variance refinements\.A second family targets how agreement*varies across prompts*, which prompt\-averages miss\. Its primary member is*majority\-fraction spread*\(the cross\-prompt standard deviation of the per\-prompt most\-frequent\-answer fraction\), supplemented by variance\- and entropy\-based summaries plus one label\-assisted statistic: the median sample index at which the first correct answer appears \(full list in[Table˜4](https://arxiv.org/html/2606.02981#A2.T4)\)\. These refinements complement the agreement\-rate family, and the label\-assisted member makes the headline predictor a labeled\-validation\-set screen rather than an unlabeled diagnostic\.
Cross\-reward\-model features\.An exploratory third family compares scores from two reward models and reports their disagreement; we treat it as an under\-powered robustness check rather than a primary component \([Table˜12](https://arxiv.org/html/2606.02981#A3.T12)in the appendix\)\.
What the families measure together\.The agreement\-rate family and its variance refinements measure the concentration, prompt\-level spread, and validation\-set success pattern of sampled answers, all of which depend on how reliably a reward model can separate correct from incorrect samples in a configuration’s output distribution\.
### 3\.2Predictor and Feature Selection
The regression pools rows from three sampling temperatures into a single fit\. Stability selection is used as a feature\-analysis layer: it identifies features that remain selected across resamples, but the LOSO predictive comparisons below evaluate fixed feature families or fixed compact designs rather than a fully nested automatic feature\-discovery procedure\.
Let the feature vector𝐱\(c,T\)∈ℝd\\mathbf\{x\}\(c,T\)\\in\\mathbb\{R\}^\{d\}collect the agreement\-rate and variance\-refinement features for configurationccat temperatureTT, whereddis the feature dimension\. The pooled cross\-temperature design is
g\(c,T\)=𝜷⊤𝐱\(c,T\)\+γ\(T−T0\)\+ε\(c,T\),g\(c,T\)\\;=\\;\\boldsymbol\{\\beta\}^\{\\top\}\\mathbf\{x\}\(c,T\)\\;\+\\;\\gamma\\,\(T\-T\_\{0\}\)\\;\+\\;\\varepsilon\(c,T\),\(3\)fit by ridge regression with regularization strengthα\\alphachosen by inner cross\-validation, where𝜷\\boldsymbol\{\\beta\}are the regression coefficients,γ\\gammais a temperature main\-effect coefficient,T0T\_\{0\}is the median operating temperature, andε\(c,T\)\\varepsilon\(c,T\)is the residual\. We call this pooled regression specification the*joint cross\-temperature design*\. Sharing the coefficient vector𝜷\\boldsymbol\{\\beta\}across temperatures triples the row count without increasing the parameter count, which keeps the regression well\-posed at our sample size\. Feature\-by\-temperature interactions are omitted because they overfit at our row count\.
Out\-of\-sample evaluation holds out whole configurations at two levels of granularity\.*Leave\-one\-configuration\-out*\(LOO\) drops all temperatures of a single configuration\.*Leave\-one\-set\-out*\(LOSO\) drops all rows of a \(base family, domain\) cluster and is the harder generalization test, which we report as the primary result\. Uncertainty is quantified by a*cluster bootstrap*, in which configurations are resampled with replacement at the configuration level so that all temperatures of a sampled configuration appear in the same resample\. The held\-out Spearman correlation is recomputed on each resample, and the resulting percentiles form the confidence interval\.
Stability selection\.To identify features the regression robustly selects, we apply bootstrap\-Lasso stability selection\(Meinshausen and Bühlmann,[2010](https://arxiv.org/html/2606.02981#bib.bib21)\)\. On each configuration\-level bootstrap resample, we fit a Lasso with cross\-validated regularization on the joint cross\-temperature design and record non\-zero coefficients\. A feature is*stable*if its selection frequency exceeds the fixed80%80\\%threshold recommended byMeinshausen and Bühlmann \([2010](https://arxiv.org/html/2606.02981#bib.bib21)\)\. This provides an interpretive check on whether variance refinements add value beyond agreement\-rate features; the stable subset also attains a small coefficientL1L^\{1\}norm, keeping feature\-side error controlled\.
### 3\.3Theoretical Analysis
This subsection states a concentration analysis that explains the predictor’s behavior\. The analysis decomposes the predictor’s error into an explicit linear\-approximation residual, a feature\-side term \(uncertainty in the features\), and a coefficient\-estimation term; when the population gain is replaced by its empirical estimate, a target\-side term is added\. The target\-side transfer applies uniformly across the Lipschitz family𝒢L\\mathcal\{G\}\_\{L\}defined in[Section˜3\.1](https://arxiv.org/html/2606.02981#S3.SS1), scaling by the Lipschitz constantLGL\_\{G\}for anyG∈𝒢LG\\in\\mathcal\{G\}\_\{L\}\.
Setup and assumptions\.A configurationcccontributes population\-level estimatorsBoN@k\(c,T\),pass@1\(c,T\)∈\[0,1\]\\mathrm\{BoN\}@k\(c,T\),\\mathrm\{pass\}@1\(c,T\)\\in\[0,1\], from which anyG∈𝒢LG\\in\\mathcal\{G\}\_\{L\}produces a population\-level gainG\(c,T\)G\(c,T\); the empirical counterpartsBoN@k^,pass@1^\\widehat\{\\mathrm\{BoN\}@k\},\\widehat\{\\mathrm\{pass\}@1\}andG^\\hat\{G\}are computed fromPPprompts andnsampn\_\{\\text\{samp\}\}completions per prompt\. The feature vector𝐱\(c,T\)∈ℝd\\mathbf\{x\}\(c,T\)\\in\\mathbb\{R\}^\{d\}and its empirical counterpart𝐱^\\hat\{\\mathbf\{x\}\}are defined as in[Section˜3\.1](https://arxiv.org/html/2606.02981#S3.SS1); the joint ridge fit onmmtraining configurations produces𝜷^∈ℝd\\hat\{\\boldsymbol\{\\beta\}\}\\in\\mathbb\{R\}^\{d\}, with population\-optimal coefficient𝜷⋆\\boldsymbol\{\\beta\}^\{\\star\}\. We assume within\-prompt i\.i\.d\. completions \(vLLM at a fixed temperature satisfies this by construction\), prompt\-level i\.i\.d\. across the evaluation set \(standard for held\-out test prompts\), and bounded features \(each statistic in[Section˜3\.1](https://arxiv.org/html/2606.02981#S3.SS1)is bounded on\[0,1\]\[0,1\]by construction or by rescaling\)\. We writeσtgt\\sigma\_\{\\text\{tgt\}\}for the Hoeffding deviation bound on the pair\(BoN@k^,pass@1^\)\(\\widehat\{\\mathrm\{BoN\}@k\},\\widehat\{\\mathrm\{pass\}@1\}\)at the per\-configuration sample budget, define the coefficient\-weighted feature\-error envelope asEβ\(𝜷^\):=∑j\|β^j\|εjE\_\{\\beta\}\(\\hat\{\\boldsymbol\{\\beta\}\}\):=\\sum\_\{j\}\|\\hat\{\\beta\}\_\{j\}\|\\,\\varepsilon\_\{j\}, whereεj\\varepsilon\_\{j\}is a feature\-specific concentration radius at the per\-\(configuration, feature\) budget, and define the linear\-approximation residual
AG⋆:=supc,T\|𝜷⋆⊤𝐱\(c,T\)−G\(c,T\)\|A\_\{G\}^\{\\star\}:=\\sup\_\{c,T\}\\bigl\|\\boldsymbol\{\\beta\}^\{\\star\\top\}\\mathbf\{x\}\(c,T\)\-G\(c,T\)\\bigr\|over the training configurations and temperatures\.
###### Proposition 1\(Joint concentration on training configurations, Lipschitz family\)\.
LetG∈𝒢LG\\in\\mathcal\{G\}\_\{L\}be a Lipschitz gain function with constantLGL\_\{G\}, and fixδ∈\(0,1\)\\delta\\in\(0,1\)split symmetrically asδtgt=δfeat=δ/2\\delta\_\{\\text\{tgt\}\}=\\delta\_\{\\text\{feat\}\}=\\delta/2\. Under the three assumptions above, with probability at least1−δ1\-\\deltasimultaneously over allmmtraining configurationsccand temperaturesTT,
\|𝜷^⊤𝐱^\(c,T\)\\displaystyle\\bigl\|\\,\\hat\{\\boldsymbol\{\\beta\}\}^\{\\\!\\top\}\\hat\{\\mathbf\{x\}\}\(c,T\)−G\(c,T\)\|≤AG⋆⏟approx\.\+Eβ\(𝜷^\)⏟data\\displaystyle\-G\(c,T\)\\,\\bigr\|\\;\\leq\\;\\underbrace\{A\_\{G\}^\{\\star\}\}\_\{\\text\{approx\.\}\}\\;\+\\;\\underbrace\{E\_\{\\beta\}\(\\hat\{\\boldsymbol\{\\beta\}\}\)\}\_\{\\text\{data\}\}\+‖𝜷^−𝜷⋆‖2⋅‖𝐱\(c,T\)‖2⏟coefficient\.\\displaystyle\\;\+\\;\\underbrace\{\\\|\\hat\{\\boldsymbol\{\\beta\}\}\-\\boldsymbol\{\\beta\}^\{\\star\}\\\|\_\{2\}\\cdot\\\|\\mathbf\{x\}\(c,T\)\\\|\_\{2\}\}\_\{\\text\{coefficient\}\}\.\(4\)Moreover, the empirical gain satisfies\|G^\(c,T\)−G\(c,T\)\|≤LGσtgt\|\\hat\{G\}\(c,T\)\-G\(c,T\)\|\\leq L\_\{G\}\\,\\sigma\_\{\\text\{tgt\}\}, so comparing predictions to measured gains adds the target\-side radiusLGσtgtL\_\{G\}\\,\\sigma\_\{\\text\{tgt\}\}to the right\-hand side\. For our primary targetg=Gaddg=G\_\{\\mathrm\{add\}\}the Lipschitz constant isLg=2L\_\{g\}=\\sqrt\{2\}\. The target\-side transfer is silent onG∉𝒢LG\\notin\\mathcal\{G\}\_\{L\}\(e\.g\., multiplicative or log\-ratio gains, which are non\-Lipschitz nearpass@1=0\\mathrm\{pass\}@1=0\); we report empirical results on those gains in[Section˜4\.5](https://arxiv.org/html/2606.02981#S4.SS5)as a structural test of whether the predictor’s signal extends beyond the Lipschitz class\.
The full proof is given in[Section˜B\.2](https://arxiv.org/html/2606.02981#A2.SS2)\.
###### Corollary 1\(Conditional held\-out transfer\)\.
Assume the training and held\-out configurations are drawn i\.i\.d\. from a common population,‖𝐱\(c,T\)‖2≤Rx\\\|\\mathbf\{x\}\(c,T\)\\\|\_\{2\}\\leq R\_\{x\}, and the same feature\-concentration radiiεj\\varepsilon\_\{j\}hold for an independent held\-out configuration\. Suppose further that the population approximation residual is bounded on that support,
AG,pop:=supc,T\|𝜷⋆⊤𝐱\(c,T\)−G\(c,T\)\|,A\_\{G,\\mathrm\{pop\}\}:=\\sup\_\{c,T\}\\bigl\|\\boldsymbol\{\\beta\}^\{\\star\\top\}\\mathbf\{x\}\(c,T\)\-G\(c,T\)\\bigr\|,and writeηm:=Cridge\(d\+log\(1/δest\)\)/m\\eta\_\{m\}:=C\_\{\\mathrm\{ridge\}\}\\sqrt\{\(d\+\\log\(1/\\delta\_\{\\mathrm\{est\}\}\)\)/m\}for a ridge\-estimation radius satisfying‖𝛃^−𝛃⋆‖2≤ηm\\\|\\hat\{\\boldsymbol\{\\beta\}\}\-\\boldsymbol\{\\beta\}^\{\\star\}\\\|\_\{2\}\\leq\\eta\_\{m\}\. Then, for a fresh held\-out configuration at a fixedTT, with probability at least1−δ−δest1\-\\delta\-\\delta\_\{\\mathrm\{est\}\},
\|𝜷^⊤𝐱^\(c,T\)−G^\(c,T\)\|\\displaystyle\\bigl\|\\,\\hat\{\\boldsymbol\{\\beta\}\}^\{\\\!\\top\}\\hat\{\\mathbf\{x\}\}\(c,T\)\-\\hat\{G\}\(c,T\)\\,\\bigr\|≤AG,pop\+Eβ\(𝜷^\)\\displaystyle\\leq A\_\{G,\\mathrm\{pop\}\}\+E\_\{\\beta\}\(\\hat\{\\boldsymbol\{\\beta\}\}\)\+Rxηm\+LGσtgt\.\\displaystyle\\quad\+R\_\{x\}\\eta\_\{m\}\+L\_\{G\}\\sigma\_\{\\mathrm\{tgt\}\}\.\(5\)
[Corollary˜1](https://arxiv.org/html/2606.02981#Thmcorollary1)makes the held\-out requirements explicit alongside the training\-cell bound: transfer depends on the population approximation residual of the selected feature span and on the usuald/m\\sqrt\{d/m\}coefficient\-estimation rate for ridge\. In our small\-mmregime, the stochastic terms are controlled by the sampling budget, while the LOSO and top\-KKexperiments estimate whether the residual and coefficient terms are small enough for useful ranking\.
To translate this per\-configuration error bound into a held\-out ranking statement, we use Spearman rank correlationρ\\rhorather than mean\-squared error\. The deployment question is which configurations to scale, not how much each will gain, and rank correlation directly measures the predictor’s ability to recover that order\.
###### Lemma 1\(Rank perturbation\)\.
Letm′m^\{\\prime\}denote the number of held\-out test configurations\. Let𝐫^,𝐫⋆∈ℝm′\\hat\{\\boldsymbol\{r\}\},\\boldsymbol\{r\}^\{\\star\}\\in\\mathbb\{R\}^\{m^\{\\prime\}\}be predicted and population scores on thesem′m^\{\\prime\}configurations with‖𝐫^−𝐫⋆‖∞≤Δ\\\|\\hat\{\\boldsymbol\{r\}\}\-\\boldsymbol\{r\}^\{\\star\}\\\|\_\{\\infty\}\\leq\\Delta, and letq<2Δ:=2m′\(m′−1\)\|\{\(i,j\):i<j,\|ri⋆−rj⋆\|<2Δ\}\|q\_\{<2\\Delta\}:=\\tfrac\{2\}\{m^\{\\prime\}\(m^\{\\prime\}\-1\)\}\\bigl\|\\\{\(i,j\):i<j,\\;\|r^\{\\star\}\_\{i\}\-r^\{\\star\}\_\{j\}\|<2\\Delta\\\}\\bigr\|denote the fraction of configuration\-pairs whose population gap is below2Δ2\\Delta\. Then there is an absolute constantccsuch that the Spearman rank correlationρ^\\hat\{\\rho\}between the two rankings satisfies\|ρ^−ρ⋆\|≤c⋅q<2Δ\|\\hat\{\\rho\}\-\\rho^\{\\star\}\|\\leq c\\cdot q\_\{<2\\Delta\}\.
Implication\.SettingΔ\\Deltato the right\-hand side of[Corollary˜1](https://arxiv.org/html/2606.02981#Thmcorollary1),[Lemma˜1](https://arxiv.org/html/2606.02981#Thmlemma1)is informative for anyG∈𝒢LG\\in\\mathcal\{G\}\_\{L\}wheneverq<2Δq\_\{<2\\Delta\}is small relative to the gain gap; the same argument applies to top\-KKprecision\. The bound makes three structural claims explicit: success requires a selected feature span with small approximation residualAG,popA\_\{G,\\mathrm\{pop\}\}, stability selection controls feature\-side error through‖𝜷^‖1\\\|\\hat\{\\boldsymbol\{\\beta\}\}\\\|\_\{1\}, and the target\-side termLGσtgtL\_\{G\}\\,\\sigma\_\{\\text\{tgt\}\}depends on prompt countPPand the gain function’s Lipschitz constant\. Thus non\-Lipschitz reparameterizations fall outside the finite\-sample transfer even if they work empirically\. We verify these predictions in[Section˜4](https://arxiv.org/html/2606.02981#S4)\.
## 4Experiments
We answer four empirical questions in order: does the predictor recover the ranking of configurations by actual scaling gain \([Section˜4\.2](https://arxiv.org/html/2606.02981#S4.SS2)\), does that ranking convert into a useful signal at the head of the list \([Section˜4\.3](https://arxiv.org/html/2606.02981#S4.SS3)\), which features carry the signal \([Section˜4\.4](https://arxiv.org/html/2606.02981#S4.SS4)\), and how robust is the result to changes in sampling budget, gain target, held\-out recipe, prompt set, and reward model \([Section˜4\.5](https://arxiv.org/html/2606.02981#S4.SS5)\)\. We then characterize the operating conditions under which the same feature family applies \([Section˜4\.6](https://arxiv.org/html/2606.02981#S4.SS6)\)\.
### 4\.1Setup
Configurations\.We evaluate post\-training configurations spanning the design grid below; the exact evaluation\-set sizes per predictor subset are listed in Appendix[A](https://arxiv.org/html/2606.02981#A1)\. The grid spans three base\-model families \(Qwen2\.5, Llama\-3\.1, gemma\-2\), each in a large and a small variant; six post\-training methods: DPO\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib24)\), SimPO\(Meng et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib22)\), KTO\(Ethayarajh et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib8)\), ORPO\(Hong et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib12)\), GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib27)\), and an SFT\-only reference \(no preference data\)\(Ouyang et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib23)\); and three task domains \(math, code, and reasoning\)\. Code is included in the grid as an explicit out\-of\-distribution stress test for leave\-one\-domain\-out \([Section˜4\.6](https://arxiv.org/html/2606.02981#S4.SS6)\); the validated scope for headline claims is math and reasoning\.
Sampling\.Each configuration generatesk=64k=64completions per prompt at three temperaturesT∈\{0\.3,0\.7,1\.0\}T\\in\\\{0\.3,0\.7,1\.0\\\}onP=200P=200labeled held\-out prompts per domain \(prompts that the post\-trained model never saw during training\)\. Full sampling and inference hyperparameters are in Appendix[A](https://arxiv.org/html/2606.02981#A1)\.
Predictor\.We fit the joint cross\-temperature ridge from[Section˜3\.2](https://arxiv.org/html/2606.02981#S3.SS2)over the agreement\-rate\+\+variance features in[Section˜3\.1](https://arxiv.org/html/2606.02981#S3.SS1); the full catalog is in Appendix[B\.1](https://arxiv.org/html/2606.02981#A2.SS1)\. The catalog includes label\-free output\-distribution summaries and one label\-assisted statistic \(first\-correct\-sample position\), so the headline use case is labeled\-validation\-set screening\. Cross\-validation is leave\-one\-set\-out \(LOSO\), holding out all configurations of a single \(base family, domain\) combination\. Within each split, standardization, ridgeα\\alphaselection, and coefficient fitting use only training clusters; bootstrap\-Lasso is a stability analysis of fixed feature designs, not a fully nested automatic feature selector\. Confidence intervals use a cluster bootstrap over configurations rather than rows, keeping all three temperatures of a sampled configuration together\.
Baselines\.We compare against three classes of baseline\.*Naive single\-feature predictors*use one obvious cheap signal each:pass@1\\mathrm\{pass\}@1alone \(the headroom intuition behind difficulty\-aware allocation,Snell et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib28)\), the mean reward\-model score per configuration, mean per\-token log\-probability, sample\-diversity statistics including self\-BLEU\(Yu et al\.,[2017](https://arxiv.org/html/2606.02981#bib.bib33)\), and first\-token entropy\(Kadavath et al\.,[2022](https://arxiv.org/html/2606.02981#bib.bib14)\)\.*Multi\-feature priors*include an agreement\-rate\-only predictor over the ten features of prior work\(Holtzman et al\.,[2020](https://arxiv.org/html/2606.02981#bib.bib11); Yu et al\.,[2017](https://arxiv.org/html/2606.02981#bib.bib33)\), the closest published\-feature analog\.*Reference controls*are a random\-KKcontrol that selectsKKconfigurations uniformly at random and an oracle ranking by actual scaling gain, giving the upper bound on top\-KKprecision\.
Target\.The best\-of\-NNscaling gaing\(c,T\)=BoN@k−pass@1g\(c,T\)=\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1defined in[Eq\.˜2](https://arxiv.org/html/2606.02981#S3.E2), with Skywork\-Reward\-Llama\-3\.1\-8B\(Liu et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib19)\)as the reward model that scores each sample and selects the highest\-scoring one\.
Compute\.Predictor inference per configuration replaces the∼30\\sim 30GPU\-minute reward\-model scoring step required by end\-to\-end best\-of\-NNwith CPU\-only feature extraction and ridge regression that complete in minutes \(breakdown in Appendix[A](https://arxiv.org/html/2606.02981#A1)\)\.
### 4\.2Rank prediction
The compact predictor recovers the LOSO ranking of configurations by actual scaling gain at Spearmanρ=0\.90\\rho=0\.90using the strict stable core plus one entropy add\-on \([Table˜1](https://arxiv.org/html/2606.02981#S4.T1)\)\. A matched\-grid calibration scatter for the joint cross\-TTridge lies on they=xy=xline in[Figure˜1](https://arxiv.org/html/2606.02981#S4.F1)a, indicating the same feature family recovers absolute gain values, not just their ordering\.
Table 1:LOSO/LOO Spearman with95%95\\%cluster\-bootstrap half\-width\. Upper block: multi\-feature predictors\. Lower block: naive single\-feature baselines\. See Appendix[A](https://arxiv.org/html/2606.02981#A1)for evaluation set sizes\.Main correlation\.The agreement\-rate baseline carries substantial signal \(ρ=0\.83\\rho=0\.83\), and adding variance refinements lifts LOSO correlation toρ=0\.87\\rho=0\.87on the same configurations\. The compact predictor, formed from the strict stable core plus a per\-prompt entropy add\-on and refit on the slightly larger eligible grid, reachesρ=0\.90\\rho=0\.90; this is our headline predictor\. The paired cluster\-bootstrap CI versus the agreement\-rate baseline,Δρ∈\[−0\.03,\+0\.16\]\\Delta\\rho\\in\[\-0\.03,\+0\.16\], brackets zero at currentn=50n=50, so the contribution rests on the larger lift over single\-feature baselines rather than a CI\-separable improvement at thisnn\.
Figure 1:\(a\) Predicted vs\. actual scaling gain on the matched calibration grid, using the joint cross\-TTridge over agreement\-rate\+\+variance features; each marker is one \(configuration,TT\) pair \(9696markers total\), withy=xy=xreference\.[Table˜1](https://arxiv.org/html/2606.02981#S4.T1)reports the compact headline predictor on its larger eligible grid\. \(b\) LOSO Spearmanρ\\rhoversus sample budgetkkper temperature; LOSO holds out by \(base family, domain\)\.Naive single\-feature baselines\.Single\-feature LOSO Spearmans range fromρ=−0\.24\\rho=\-0\.24\(pass@1\\mathrm\{pass\}@1alone\) toρ=\+0\.57\\rho=\+0\.57\(mean reward\-model score or the agreement\-rate feature\), with calibration and diversity signals \(self\-BLEU, first\-token entropy, mean log\-probability\) falling between \([Table˜1](https://arxiv.org/html/2606.02981#S4.T1), lower block\)\. The compact predictor reachesρ=0\.90\\rho=0\.90, giving a substantially stronger ranking signal than any retained single\-feature baseline in this comparison\.
Not a design artifact\.A non\-parametric permutation null places the observedρ\\rhoabove the null’s97\.5%97\.5\\%percentile at every temperature,p=0\.002p=0\.002\(lower\-bound resolution\); full table in[Section˜C\.1](https://arxiv.org/html/2606.02981#A3.SS1.SSS0.Px2)\.
### 4\.3Deployment utility
If a practitioner ranks configurations by predicted gain and runs scaling on the topKK, how much actual gain do they recover relative to pickingKKat random? AtK=5K=5, mean precision\-at\-55is0\.900\.90and the predictor recovers the actual top\-55exactly in64%64\\%of bootstrap resamples \([Table˜2](https://arxiv.org/html/2606.02981#S4.T2)\)\.
Precision\-at\-KK\.AtK=5K=5, predicted\-top\-55configurations deliver mean actual gain\+0\.18\+0\.18against a random\-55control’s\+0\.09\+0\.09, a paired difference of\+0\.10\+0\.10whose cluster\-bootstrap CI excludes zero; the precision\-at\-55distribution concentrates near one while the actual gain of the selected set stays well above random selection \([Figure˜2](https://arxiv.org/html/2606.02981#S4.F2)\), which is the regime pre\-deployment screening requires\.
Table 2:Targetg\(c,T\)g\(c,T\), joint cross\-T ridge, LOSO out\-of\-sample\. Random\-KKCIs use50005000random subsets; precision uses20002000cluster\-bootstrap resamples of configurations;Δ\\Deltais the paired top\-K−K\{\-\}random\-KKdifference;P\(=1\)P\(=1\)is the bootstrap\-resample probability that the predictor’s top\-KKexactly matches the oracle top\-KK\.Figure 2:Bootstrap distribution of precision\-at\-KKand mean actual gain for predictor\-selected versus random\-KKconfigurations;20002000cluster\-bootstrap resamples ofnnconfigurations\. Precision is computed against the oracle top\-KKranking by actual gain\.
### 4\.4Stability selection
Bootstrap\-Lasso stability selection identifies three features above the fixed80%80\\%threshold across 500 resamples: majority\-fraction spread, median first\-correct\-sample position, and completion\-length variance\. We treat these as the strict stable core\. The compact predictor in[Table˜1](https://arxiv.org/html/2606.02981#S4.T1)augments this core with one entropy summary chosen by paired\-bootstrap ablation; this add\-on is a predictive refinement, not part of the stability\-selected core\.
Stable core and entropy add\-on\.The strict stable core consists of majority\-fraction spread, median first\-correct\-sample position \(label\-assisted\), and completion\-length variance \(full ranking in[Table˜7](https://arxiv.org/html/2606.02981#A3.T7)\)\. The namedagreement\_ratefeature sits at67\.4%67\.4\\%, below threshold once it competes against variance refinements\. Separately, the near\-threshold entropy add\-on improves prediction: at the largern=56n=56grid, paired cluster\-bootstrap on \(core\+\+entropy−\-core\) givesΔρ=\+0\.081\\Delta\\rho=\+0\.081with95%95\\%CI\[\+0\.002,\+0\.241\]\[\+0\.002,\+0\.241\]\. We therefore distinguish the stability\-selected core from the compact predictor used for the headline ranking result\.
Within\-family refinement\.The stable core and entropy add\-on refine the same prompt\-level concentration and success pattern captured coarsely by agreement\-rate features\. Matching the full1818\-feature predictor atn=32n=32with a compact feature set identifies a parsimonious model, and the progression with feature count is monotone \(one featureρ=0\.75\\rho=0\.75, two0\.900\.90, four0\.920\.92\)\.
### 4\.5Robustness
The predictor’s signal is robust along five dimensions: sample budgetkk, gain\-function reparameterization, held\-out RL recipe, prompt set, and reward model\.
Across the scaling curve\.Retargeting toBoN@k′−pass@1\\mathrm\{BoN\}@k^\{\\prime\}\-\\mathrm\{pass\}@1at smallerk′∈\{2,4,8,16,32\}k^\{\\prime\}\\in\\\{2,4,8,16,32\\\}keeps the predictor informative at everyk′k^\{\\prime\}and every temperature, plateauing byk=16k=16–3232\(full table in[Table˜8](https://arxiv.org/html/2606.02981#A3.T8)\)\.
Across gain functions\.Retargeted to four BoN\-anchored variants \(additive, normalized, multiplicative, log\-ratio\), the same predictor achievesρ∈\[0\.83,0\.88\]\\rho\\in\[0\.83,0\.88\]regardless of Lipschitz status\. Retargeted to majority votingGMVG\_\{\\mathrm\{MV\}\},ρ\\rhocollapses to0\.000\.00\(half\-width0\.460\.46;[Table˜9](https://arxiv.org/html/2606.02981#A3.T9)\), pointing to verifier\- versus vote\-anchoring as the relevant distinction\.
Across post\-training methods\.To test method\-level generalization beyond the LOSO clusters, we train the compact predictor on five RL recipes and predict on the held\-out sixth, repeating for each recipe\. All six folds give Spearmanρ∈\[\+0\.78,\+0\.94\]\\rho\\in\[\+0\.78,\+0\.94\]with cluster\-bootstrap CI excluding zero \([Table˜10](https://arxiv.org/html/2606.02981#A3.T10)\), so the signal is not specific to one RL recipe\.
Across prompt sets\.We re\-extract the same features on freshk=64k=64generations from MATH500\(Hendrycks et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib9); Lightman et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib18)\), re\-score the BoN target with the same Skywork reward model, and apply the trained predictor without retraining\. MATH500 transfer holds atρ=\+0\.79\\rho=\+0\.79\(p<10−4p<10^\{\-4\}\), essentially matching the in\-distribution LOSO result; details and a code\-domain transfer check are in Appendix[D](https://arxiv.org/html/2606.02981#A4)\.
Across reward models\.To test whether the compact feature design is specific to Skywork\-Reward\-Llama\-3\.1\-8B, we re\-score every sample with ArmoRM\-Llama3\-8B\(Wang et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib30)\)and refit the compact ridge against the ArmoRM\-defined BoN gain\. On then=56n=56configurations scored by both reward models, the same feature design reaches LOSOρ=\+0\.81±\.19\\rho=\+0\.81\_\{\\pm\.19\}against ArmoRM; Skywork on the same cells givesρ=\+0\.90±\.13\\rho=\+0\.90\_\{\\pm\.13\}\([Table˜11](https://arxiv.org/html/2606.02981#A3.T11)\)\. The0\.090\.09attenuation is within bootstrap noise, so the feature design remains informative under a retargeted verifier\-specific ridge fit\.
### 4\.6Operating conditions
The predictor’s operating regime can be characterized along three axes: rare\-correct\-sample behavior at high temperature, the match between surface agreement and semantic correctness, and the value of pooling temperatures\.
High\-temperature residuals\.The largest LOSO residuals are SFT and GRPO configurations at highT∈\{0\.7,1\.0\}T\\in\\\{0\.7,1\.0\\\}\([Table˜16](https://arxiv.org/html/2606.02981#A3.T16)\), cells where the agreement\-rate family reads low surface agreement while the reward model still selects rare correct samples\. This pattern is consistent with the feature interpretation above: majority\-fraction spread and first\-correct\-sample position measure whether the correct answer appears in the sampled support, but they do not fully model the reward\-score tail that determines which rare sample the verifier will choose\.
Domain alignment\.The agreement\-rate family measures surface\-string agreement among samples\. This aligns with semantic correctness on math and reasoning, where extracted answers map roughly one\-to\-one to surface strings, but is less aligned on code, where semantically equivalent programs admit many surface forms via renaming, restructuring, and stylistic variation\. Aggregate leave\-one\-domain\-out remains informative atρ=\+0\.72\\rho=\+0\.72, but the per\-domain code fold isρ=−0\.56\\rho=\-0\.56, which identifies code as an out\-of\-scope stress test rather than part of the headline validated regime\.
Temperature pooling\.Single\-temperature LOSO regressions yield wider CIs than the joint cross\-TTfit at everyTT\([Table˜5](https://arxiv.org/html/2606.02981#A3.T5)\)\. The joint design pools samples across the three temperatures while retaining a temperature main effect, giving the predictor enough rows to stabilize rank estimates without claiming that all temperatures have identical gain distributions\.
## 5Conclusion
Across the math and reasoning configurations we evaluate under a reward\-model verifier, bootstrap\-Lasso identifies a strict three\-feature stable core of labeled validation\-set sample statistics \(majority\-fraction spread, label\-assisted first\-correct\-sample position, and completion\-length variance\), while a compact ridge predictor that adds a per\-prompt entropy summary gives the strongest rank\-prediction result\. Combined with the concentration analysis, this turns the practical question of which configurations benefit most from best\-of\-NNscaling from a costly end\-to\-end measurement to a single\-pass labeled\-validation\-set check within the validated scope\.
## Limitations
The present study focuses on reward\-model\-verifier scaling for reasoning\-style benchmarks with extractable answers\. Natural extensions include open\-ended generation, tool\-augmented tasks, and future model families whose output distributions may differ from those studied here\. The predictor is intended as a pre\-deployment screening tool for comparing configuration grids on held\-out prompts; final deployment decisions should still be paired with task\-specific evaluation\. We view these extensions as empirical rather than methodological: the framework is designed to be re\-applied as benchmarks, verifiers, and post\-training recipes evolve\.
## References
- Azaria and Mitchell \(2023\)Amos Azaria and Tom Mitchell\. 2023\.The internal state of an LLM knows when it’s lying\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*\.
- Brown et al\. \(2024\)Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini\. 2024\.Large language monkeys: Scaling inference compute with repeated sampling\.*arXiv preprint arXiv:2407\.21787*\.
- Burnell et al\. \(2023\)Ryan Burnell, Han Hao, Andrew R\. A\. Conway, and Jose Hernández\-Orallo\. 2023\.Revealing the structure of language model capabilities\.*arXiv preprint arXiv:2306\.10062*\.
- Burns et al\. \(2023\)Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt\. 2023\.Discovering latent knowledge in language models without supervision\.In*International Conference on Learning Representations \(ICLR\)*\.
- Chen et al\. \(2021\)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, and 39 others\. 2021\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*\.
- Clark et al\. \(2018\)Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\. 2018\.Think you have solved question answering? try ARC, the AI2 reasoning challenge\.*arXiv preprint arXiv:1803\.05457*\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\. 2021\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*\.
- Ethayarajh et al\. \(2024\)Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela\. 2024\.KTO: Model alignment as prospect theoretic optimization\.*arXiv preprint arXiv:2402\.01306*\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\. 2021\.Measuring mathematical problem solving with the MATH dataset\.In*Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track*\.
- Hoffmann et al\. \(2022\)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others\. 2022\.Training compute\-optimal large language models\.*arXiv preprint arXiv:2203\.15556*\.
- Holtzman et al\. \(2020\)Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi\. 2020\.The curious case of neural text degeneration\.In*International Conference on Learning Representations \(ICLR\)*\.
- Hong et al\. \(2024\)Jiwoo Hong, Noah Lee, and James Thorne\. 2024\.ORPO: Monolithic preference optimization without reference model\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 11170–11189\.
- Jiang et al\. \(2021\)Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig\. 2021\.How can we know when language models know? on the calibration of language models for question answering\.*Transactions of the Association for Computational Linguistics*, 9:962–977\.
- Kadavath et al\. \(2022\)Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield\-Dodds, Nova DasSarma, Eli Tran\-Johnson, Scott Johnston, Sheer El\-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, and 17 others\. 2022\.Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*\.
- Kaplan et al\. \(2020\)Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\. 2020\.Scaling laws for neural language models\.*arXiv preprint arXiv:2001\.08361*\.
- Kirk et al\. \(2024\)Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu\. 2024\.Understanding the effects of RLHF on LLM generalisation and diversity\.In*International Conference on Learning Representations \(ICLR\)*\.
- Kwon et al\. \(2023\)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\. 2023\.Efficient memory management for large language model serving with PagedAttention\.In*Proceedings of the 29th Symposium on Operating Systems Principles*\.
- Lightman et al\. \(2024\)Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\. 2024\.Let’s verify step by step\.In*International Conference on Learning Representations \(ICLR\)*\.
- Liu et al\. \(2024\)Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou\. 2024\.Skywork\-reward: Bag of tricks for reward modeling in LLMs\.*arXiv preprint arXiv:2410\.18451*\.
- McKenzie et al\. \(2023\)Ian R\. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, and 8 others\. 2023\.Inverse scaling: When bigger isn’t better\.*Transactions on Machine Learning Research*\.
- Meinshausen and Bühlmann \(2010\)Nicolai Meinshausen and Peter Bühlmann\. 2010\.Stability selection\.*Journal of the Royal Statistical Society: Series B \(Statistical Methodology\)*, 72\(4\):417–473\.
- Meng et al\. \(2024\)Yu Meng, Mengzhou Xia, and Danqi Chen\. 2024\.SimPO: Simple preference optimization with a reference\-free reward\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\. 2022\.Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems \(NeurIPS\)*, 35\.
- Rafailov et al\. \(2023\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D\. Manning, and Chelsea Finn\. 2023\.Direct preference optimization: Your language model is secretly a reward model\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.
- Ruan et al\. \(2024\)Yangjun Ruan, Chris J\. Maddison, and Tatsunori Hashimoto\. 2024\.Observational scaling laws and the predictability of language model performance\.*arXiv preprint arXiv:2405\.10938*\.
- Schaeffer et al\. \(2023\)Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo\. 2023\.Are emergent abilities of large language models a mirage?In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*\.
- Snell et al\. \(2024\)Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\. 2024\.Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.*arXiv preprint arXiv:2408\.03314*\.
- Uesato et al\. \(2022\)Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins\. 2022\.Solving math word problems with process\- and outcome\-based feedback\.*arXiv preprint arXiv:2211\.14275*\.
- Wang et al\. \(2024\)Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang\. 2024\.Interpretable preferences via multi\-objective reward modeling and mixture\-of\-experts\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*\.
- Wang et al\. \(2023\)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V\. Le, Ed H\. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\. 2023\.Self\-consistency improves chain of thought reasoning in language models\.In*International Conference on Learning Representations \(ICLR\)*\.
- Wu et al\. \(2024\)Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang\. 2024\.Inference scaling laws: An empirical analysis of compute\-optimal inference for problem\-solving with language models\.*arXiv preprint arXiv:2408\.00724*\.
- Yu et al\. \(2017\)Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu\. 2017\.SeqGAN: Sequence generative adversarial nets with policy gradient\.In*AAAI Conference on Artificial Intelligence*\.
## Appendix ADetailed setups
### A\.1Sampling and inference
All sampling uses vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.02981#bib.bib17)\)atk=64k=64samples per prompt, top\-p=1p=1, max tokens10241024, temperaturesT∈\{0\.3,0\.7,1\.0\}T\\in\\\{0\.3,0\.7,1\.0\\\}\. The evaluation suite contains200200prompts per domain drawn from the standard splits of MATH\(Hendrycks et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib9)\), HumanEval\(Chen et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib5)\), and ARC\-Challenge\(Clark et al\.,[2018](https://arxiv.org/html/2606.02981#bib.bib6)\); a fixed prompt ordering is used across configurations so that within\-promptkksamples are i\.i\.d\. but between\-configuration prompts are matched\.
### A\.2Evaluation\-set sizes
Up ton=73n=73post\-training configurations are available with full feature data\. The exact count varies across predictor variants depending on feature availability: the multi\-feature predictors in[Table˜1](https://arxiv.org/html/2606.02981#S4.T1)\(agreement\-rate baseline, agreement\-rate \+ variance refinements, and all naive single\-feature baselines\) are evaluated on the samen=50n=50configurations for which every feature in the design matrix is computable, ensuring all rows of[Table˜1](https://arxiv.org/html/2606.02981#S4.T1)are directly comparable\. The compact predictor \(stable core\+\+entropy add\-on\) refits on a slightly largern=56n=56since only four features are required\. The evaluated checkpoint scales represented in the configuration grid include Qwen2\.5\-3B/7B\-Instruct, Llama\-3\.1\-8B\-Instruct, and gemma\-2\-2B/9B\-it; reward\-model parameter counts are listed below\.
### A\.3Artifact use and data handling
We use public benchmark artifacts \(MATH, HumanEval, ARC\-Challenge, and MATH500\), public model and reward\-model artifacts \(the base\-model families in[Section˜4\.1](https://arxiv.org/html/2606.02981#S4.SS1), Skywork\-Reward\-Llama\-3\.1\-8B, ArmoRM\-Llama3\-8B, and DeBERTa\-v3\-large\-Reward\), and the vLLM inference library\. We cite the creators of these artifacts in the main text and this appendix\. Our use is limited to research evaluation of model configurations; we do not redistribute benchmark datasets, model weights, reward\-model weights, or raw generated completions, and any released code or derived tables are intended to require users to obtain the original artifacts under their own licenses or terms\. The evaluated prompts are public math, programming, and multiple\-choice reasoning benchmarks rather than newly collected user data; no human subjects or annotators are involved in this study\. We use generated samples only for aggregate feature extraction and evaluation statistics\.
### A\.4Reward models
The primary reward model is Skywork\-Reward\-Llama\-3\.1\-8B\(Liu et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib19)\)\(8B parameters, Llama\-3\.1 architecture\)\. The target\-level cross\-reward\-model robustness check \([Table˜11](https://arxiv.org/html/2606.02981#A3.T11)\) uses ArmoRM\-Llama3\-8B\(Wang et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib30)\)\(8B parameters, Llama\-3 architecture with a multi\-objective mixture\-of\-experts head\), independently trained by a different group on different preference data\. A separate feature\-level preliminary block \([Table˜12](https://arxiv.org/html/2606.02981#A3.T12)\) uses DeBERTa\-v3\-large\-Reward \(304M parameters, DeBERTa\-v3 architecture\); this block is reported as under\-powered \(n=6n=6\)\.
### A\.5Predictor configuration
Joint cross\-temperature ridge regression is fit withRidgeCVoverα∈\{10−3,10−2,10−1,1,10,102\}\\alpha\\in\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\},1,10,10^\{2\}\\\}selected by55\-fold inner cross\-validation\. Features are standardized within each cross\-validation training fold \(held\-out fold uses the training\-fold scaler\)\. Temperature is centered atT0=0\.7T\_\{0\}=0\.7for the main\-effect covariateγ\(T−T0\)\\gamma\\,\(T\-T\_\{0\}\)\.
### A\.6Cluster bootstrap
Configurations are resampled with replacement at the configuration level so that all three temperatures of a sampled configuration appear together in the resample\. Spearman correlation is recomputed on each resample and the2\.5%2\.5\\%and97\.5%97\.5\\%percentiles across500500resamples define the95%95\\%cluster\-bootstrap confidence interval\. The paired bootstrap CI onΔρ\\Delta\\rhouses the same resample indices for both feature regimes, so that across\-feature variance is preserved and the resulting interval is a fair test of one regime against another\.
### A\.7Stability selection bootstrap
500500configuration\-level bootstrap resamples,LassoCVwith internalα∈\{10−3,…,101\}\\alpha\\in\\\{10^\{\-3\},\\dots,10^\{1\}\\\}on each resample,≥80%\\geq 80\\%selection\-frequency threshold fixed before data inspection\(Meinshausen and Bühlmann,[2010](https://arxiv.org/html/2606.02981#bib.bib21)\)\. The temperature covariate is included in the design so that variance refinements are not artifactually inflated by absorbing the temperature signal\. This stability analysis is run to summarize which candidate features repeatedly carry signal in the evaluated grid; the LOSO correlations in the main table fit the ridge coefficients within each outer split for fixed feature sets\.
### A\.8Permutation null
500500permutations of the configuration\-level targets relative to the feature matrix, preserving the within\-configuration three\-temperature structure\. The entire joint cross\-temperature LOSO ridge pipeline is re\-run on each permutation; the resulting Spearman distribution defines the null\. The one\-sidedppfloor is1/500=0\.0021/500=0\.002\.
### A\.9Compute
All inference and post\-training on1×1\\timesL40S 48GB GPUs \(general partition of the CMU Babel cluster\)\. Feature extraction and predictor fits are CPU\-only and complete in minutes per configuration; the full predictor pipeline takes under an hour end\-to\-end atn=27n=27–5656configurations\.
### A\.10Cost\-value comparison
The predictor’s value rests on a compute asymmetry\.[Table˜3](https://arxiv.org/html/2606.02981#A1.T3)itemizes the per\-configuration cost of running end\-to\-end best\-of\-NNversus our predictor on the same trained checkpoint\. Both pipelines require the samek=64k=64samples per prompt acrossP=200P=200prompts at three temperatures \(the sampling pass is shared and is the dominant GPU cost\)\. End\-to\-end best\-of\-NNadditionally scores allP×k×3=38,400P\\times k\\times 3=38\{,\}400\(prompt, completion\) pairs with the Skywork reward model; we measure this at∼30\\sim 30GPU\-minutes per configuration on a single L40S 48GB at batch size3232in bf16\. Our predictor replaces this reward\-model scoring step with feature extraction \(CPU only,∼3\\sim 3minutes per configuration\) and ridge inference \(CPU seconds\)\. The savings are therefore the entire reward\-model scoring step,≈N×30\\approx N\\times 30GPU\-minutes for anNN\-configuration screening grid; at the headlineN=56N=56grid this is≈28\\approx 28GPU\-hours of L40S compute avoided per screening pass\. We emphasize that the savings only materialize when the practitioner has not already paid the scoring cost: a deployment that already runs best\-of\-NNat inference time has paid the scoring step and would not see additional savings; the predictor is intended for screening grids of candidate post\-training configurations before any one is deployed\.
Table 3:Per\-configuration compute breakdown on1×1\\timesL40S 48GB\. “Shared” rows are the same for both pipelines; “BoN\-only” rows are paid only by the end\-to\-end procedure that the predictor replaces\.
## Appendix BFramework details
This section backs the framework and theoretical content of Section[3](https://arxiv.org/html/2606.02981#S3):[Section˜B\.1](https://arxiv.org/html/2606.02981#A2.SS1)lists the full feature catalog referenced by[Section˜3\.1](https://arxiv.org/html/2606.02981#S3.SS1),[Section˜B\.2](https://arxiv.org/html/2606.02981#A2.SS2)contains the full proof of the joint concentration bound stated in[Section˜3\.3](https://arxiv.org/html/2606.02981#S3.SS3), and[Section˜B\.3](https://arxiv.org/html/2606.02981#A2.SS3)reports the numerical instantiation at our operating point\.
### B\.1Feature catalog
The full1919\-dimensional feature set used in the agreement\-rate\+\+variance design matrix\. Each feature is computed from the200×64200\\times 64matrix of \(prompt, sample\) completions at a fixed temperature\.
Table 4:Full feature catalog used in the agreement\-rate\+\+variance predictor\. The reasoning\-step\-count feature is domain\-adaptive \(proof lines for MATH, code lines for HumanEval, reasoning items for ARC\)\. The first\-correct\-sample feature is intentionally label\-assisted and requires gold validation answers, matching the paper’s labeled validation\-set screening use case\.Feature \(code name\)FamilyDescriptionagreement\_rateagreement\-ratefrac\. samples matching modal answermajority\_fractionagreement\-rateavg\. per\-prompt modal fractionself\_bleuagreement\-ratepairwise BLEU among samplesuniq\_2gram\_ratioagreement\-rateunique\-bigram ratioembed\_simagreement\-ratepairwise cosine sim\. of sample embedsanswer\_entropyagreement\-rateShannon entropy of extracted answersmean\_logprobagreement\-ratemean per\-token log\-probstd\_logprobagreement\-ratestd\-dev of per\-sample log\-probsmean\_topK\_massagreement\-rateavg\. top\-KKtoken mass per stepseq\_lp\_spreadagreement\-ratecross\-sample sequence log\-prob spreadmajority\_size\_stdvariance ref\.cross\-prompt SD of modal fractionmajority\_size\_minvariance ref\.min cross\-prompt modal fractioncompletion\_length\_variancevariance ref\.variance of completion lengthsfirst\_correct\_sample\_position\_medianlabel\-assist\. ref\.median first\-correct indexper\_prompt\_answer\_entropy\_medianvariance ref\.median per\-prompt answer entropycompletion\_repetition\_4gramvariance ref\.4\-gram repetition ratereasoning\_step\_count\_meanvariance ref\.domain\-adaptive step countfirst\_token\_entropyvariance ref\.avg\. first\-token entropymajority\_certainty\_advantagevariance ref\.modal vs\. runner\-up certainty gapThe catalog separates the ten original agreement\-rate features \(top block\) from the nine variance\-refinement features \(bottom block\); the bootstrap\-Lasso stable core \([Table˜7](https://arxiv.org/html/2606.02981#A3.T7)\) is drawn from the variance\-refinement block, and none of the agreement\-rate features cross the80%80\\%stability threshold\. One stable\-core refinement, the first\-correct\-sample\-position feature, is label\-assisted; the headline predictor therefore assumes labeled validation prompts rather than fully unlabeled deployment data\.
### B\.2Proof of Theorem 1
We provide the expanded version of the proof sketched in[Section˜3\.3](https://arxiv.org/html/2606.02981#S3.SS3)\. Recall that we wish to bound
\|𝜷^⊤𝐱^\(c,T\)−G\(c,T\)\|\\bigl\|\\,\\hat\{\\boldsymbol\{\\beta\}\}^\{\\\!\\top\}\\hat\{\\mathbf\{x\}\}\(c,T\)\-G\(c,T\)\\,\\bigr\|for any Lipschitz gain functionG∈𝒢LG\\in\\mathcal\{G\}\_\{L\}uniformly over allmmtraining configurationsccand all three temperaturesTT, with probability at least1−δ1\-\\delta\.
Decomposition\.By the triangle inequality,
\|𝜷^⊤𝐱^−G\(c,T\)\|\\displaystyle\\bigl\|\\,\\hat\{\\boldsymbol\{\\beta\}\}^\{\\\!\\top\}\\hat\{\\mathbf\{x\}\}\-G\(c,T\)\\,\\bigr\|≤\|𝜷^⊤\(𝐱^−𝐱\)\|⏟\(A\)data\\displaystyle\\leq\\underbrace\{\\bigl\|\\hat\{\\boldsymbol\{\\beta\}\}^\{\\\!\\top\}\(\\hat\{\\mathbf\{x\}\}\-\\mathbf\{x\}\)\\bigr\|\}\_\{\(A\)\\ \\text\{data\}\}\+\|\(𝜷^−𝜷⋆\)⊤𝐱\|⏟\(B\)coef\.\\displaystyle\\quad\+\\underbrace\{\\bigl\|\(\\hat\{\\boldsymbol\{\\beta\}\}\-\\boldsymbol\{\\beta\}^\{\\star\}\)^\{\\\!\\top\}\\mathbf\{x\}\\bigr\|\}\_\{\(B\)\\ \\text\{coef\.\}\}\+\|𝜷⋆⊤𝐱−G\(c,T\)\|⏟\(C\)approx\.\.\\displaystyle\\quad\+\\underbrace\{\\bigl\|\\boldsymbol\{\\beta\}^\{\\star\\,\\top\}\\mathbf\{x\}\-G\(c,T\)\\bigr\|\}\_\{\(C\)\\ \\text\{approx\.\}\}\.
Term \(A\)\.By Hölder’s inequality with the dual\(ℓ1,ℓ∞\)\(\\ell^\{1\},\\ell^\{\\infty\}\)pair,
\|𝜷^⊤\(𝐱^−𝐱\)\|≤∑j=1d\|β^j\|\|x^j−xj\|\.\\bigl\|\\hat\{\\boldsymbol\{\\beta\}\}^\{\\\!\\top\}\(\\hat\{\\mathbf\{x\}\}\-\\mathbf\{x\}\)\\bigr\|\\leq\\sum\_\{j=1\}^\{d\}\|\\hat\{\\beta\}\_\{j\}\|\\,\|\\hat\{x\}\_\{j\}\-x\_\{j\}\|\.Each empirical featurex^j\\hat\{x\}\_\{j\}is a bounded statistic computed overPPprompts \(and within\-promptnsampn\_\{\\text\{samp\}\}completions\)\. For mean\-like features we use Hoeffding’s inequality; for median or quantile features we use the corresponding feature\-specific bounded\-statistic radius\. At the per\-feature failure budgetδfeat/d\\delta\_\{\\text\{feat\}\}/dand after a union bound acrossddfeatures, with probability at least1−δfeat1\-\\delta\_\{\\text\{feat\}\},
\|x^j−xj\|≤εjfor allj=1,…,d,\|\\hat\{x\}\_\{j\}\-x\_\{j\}\|\\leq\\varepsilon\_\{j\}\\qquad\\text\{for all \}j=1,\\dots,d,whereεj\\varepsilon\_\{j\}is the per\-\(configuration, feature\) Hoeffding deviation bound\. Summing the coefficient\-weighted terms yields the data\-side boundEβ\(𝜷^\)=∑j\|β^j\|εjE\_\{\\beta\}\(\\hat\{\\boldsymbol\{\\beta\}\}\)=\\sum\_\{j\}\|\\hat\{\\beta\}\_\{j\}\|\\,\\varepsilon\_\{j\}\.
Term \(B\)\.By Cauchy–Schwarz,
\|\(𝜷^−𝜷⋆\)⊤𝐱\|≤‖𝜷^−𝜷⋆‖2⋅‖𝐱\(c,T\)‖2\.\\bigl\|\(\\hat\{\\boldsymbol\{\\beta\}\}\-\\boldsymbol\{\\beta\}^\{\\star\}\)^\{\\\!\\top\}\\mathbf\{x\}\\bigr\|\\leq\\\|\\hat\{\\boldsymbol\{\\beta\}\}\-\\boldsymbol\{\\beta\}^\{\\star\}\\\|\_\{2\}\\cdot\\\|\\mathbf\{x\}\(c,T\)\\\|\_\{2\}\.This term captures the coefficient gap between the empirical ridge estimate and its population optimum, and is the part that requires further control to extend the bound to held\-out configurations \(see remark below\)\.
Term \(C\)\.Lipschitzness alone does not imply that the gain function lies in the linear span of the selected features\. We therefore keep the approximation residual explicit:
\|𝜷⋆⊤𝐱−G\(c,T\)\|≤AG⋆,\\bigl\|\\boldsymbol\{\\beta\}^\{\\star\\,\\top\}\\mathbf\{x\}\-G\(c,T\)\\bigr\|\\leq A\_\{G\}^\{\\star\},whereAG⋆:=supc,T\|𝜷⋆⊤𝐱\(c,T\)−G\(c,T\)\|A\_\{G\}^\{\\star\}:=\\sup\_\{c,T\}\|\\boldsymbol\{\\beta\}^\{\\star\\top\}\\mathbf\{x\}\(c,T\)\-G\(c,T\)\|over the training configurations and temperatures\. This term is the modeling error of the linear feature representation; it is estimated empirically by the held\-out validation results rather than bounded by concentration\.
Target\-side concentration\.Lipschitzness is used only when comparing the population gainG\(c,T\)G\(c,T\)to the measured gainG^\(c,T\)\\hat\{G\}\(c,T\)\. The two underlying empirical meansBoN@k^\\widehat\{\\mathrm\{BoN\}@k\}andpass@1^\\widehat\{\\mathrm\{pass\}@1\}are each computed fromPPprompts \(withnsampn\_\{\\text\{samp\}\}within\-prompt completions\); by Hoeffding’s inequality on each, with probability at least1−δtgt/\(3m\)1\-\\delta\_\{\\text\{tgt\}\}/\(3m\)\(after union bound acrossmmconfigurations and three temperatures\), each lies withinlog\(3m/δtgt\)/\(2P\)\\sqrt\{\\log\(3m/\\delta\_\{\\text\{tgt\}\}\)/\(2P\)\}of its population value\. Combining the two as a joint Euclidean norm,
∥\(BoN@k^,pass@1^\)\\displaystyle\\bigl\\\|\(\\widehat\{\\mathrm\{BoN\}@k\},\\widehat\{\\mathrm\{pass\}@1\}\)−\(BoN@k,pass@1\)∥2\\displaystyle\-\(\\mathrm\{BoN\}@k,\\mathrm\{pass\}@1\)\\bigr\\\|\_\{2\}≤σtgt\.\\displaystyle\\leq\\sigma\_\{\\text\{tgt\}\}\.For anyG∈𝒢LG\\in\\mathcal\{G\}\_\{L\}, this implies\|G^\(c,T\)−G\(c,T\)\|≤LGσtgt\|\\hat\{G\}\(c,T\)\-G\(c,T\)\|\\leq L\_\{G\}\\,\\sigma\_\{\\text\{tgt\}\}\.
Symmetric split\.Settingδtgt=δfeat=δ/2\\delta\_\{\\text\{tgt\}\}=\\delta\_\{\\text\{feat\}\}=\\delta/2and combining the feature, coefficient, approximation, and target\-concentration bounds yields the statement of[Proposition˜1](https://arxiv.org/html/2606.02981#Thmproposition1)\. ∎
### B\.3Held\-out extension and numerical instantiation
Conditional held\-out transfer\.[Corollary˜1](https://arxiv.org/html/2606.02981#Thmcorollary1)applies the same feature\- and target\-side concentration steps to an independent held\-out configuration and then substitutes two population\-level controls\. The first is the approximation residual over the configuration population,AG,popA\_\{G,\\mathrm\{pop\}\}, which replaces the training\-set residualAG⋆A\_\{G\}^\{\\star\}\. The second is the standard ridge coefficient rate‖𝜷^−𝜷⋆‖2≤Cridge\(d\+log\(1/δest\)\)/m\\\|\\hat\{\\boldsymbol\{\\beta\}\}\-\\boldsymbol\{\\beta\}^\{\\star\}\\\|\_\{2\}\\leq C\_\{\\mathrm\{ridge\}\}\\sqrt\{\(d\+\\log\(1/\\delta\_\{\\mathrm\{est\}\}\)\)/m\}; the constant absorbs the design\-covariance and noise parameters\. Combining these terms with Cauchy–Schwarz gives the held\-out bound in the corollary\. Thus the theorem controls the stochastic sampling terms directly, while the empirical LOSO experiments test whether the approximation and coefficient terms are small enough on the evaluated configuration population\.
Numerical instantiation\.At our operating point\(P,nsamp,m,d,δ\)=\(200,64,27,8,0\.05\)\(P,n\_\{\\text\{samp\}\},m,d,\\delta\)=\(200,64,27,8,0\.05\)withδtgt=δfeat=0\.025\\delta\_\{\\text\{tgt\}\}=\\delta\_\{\\text\{feat\}\}=0\.025, the target\-side Hoeffding deviation bound for the headlineBoN@k−pass@1\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1target evaluates toσtgt≤0\.29\\sigma\_\{\\text\{tgt\}\}\\leq 0\.29, and the coefficient\-weighted feature\-error envelope evaluates toEβ\(𝜷^pilot\)≤0\.43E\_\{\\beta\}\(\\hat\{\\boldsymbol\{\\beta\}\}\_\{\\mathrm\{pilot\}\}\)\\leq 0\.43at the compact feature design\. A tighter empirical\-Bernstein variant of the same feature\-side calculation, using the observed feature variances rather than the worst\-case Hoeffding bound, brings the data\-side bound toEβBern≤0\.27E\_\{\\beta\}^\{\\mathrm\{Bern\}\}\\leq 0\.27at the same configuration\. These radii quantify the stochastic sampling terms in[Corollary˜1](https://arxiv.org/html/2606.02981#Thmcorollary1); the remaining approximation and ridge\-estimation terms are assessed by the LOSO and top\-KKexperiments rather than certified by concentration alone\.
## Appendix CExperiment extensions
This section mirrors the main\-text Section[4](https://arxiv.org/html/2606.02981#S4)subsection\-for\-subsection\. Each subsection here backs the same\-named subsection in the main text\. Setup details for[Section˜4\.1](https://arxiv.org/html/2606.02981#S4.SS1)are in[Appendix˜A](https://arxiv.org/html/2606.02981#A1)above\.
### C\.1Rank prediction
#### Per\-temperature breakdown\.
[Table˜5](https://arxiv.org/html/2606.02981#A3.T5)reports per\-TTLOSO Spearman correlations separately for the agreement\-rate baseline and the agreement\-rate\+\+variance predictor\. The variance refinements lift the correlation most atT=1\.0T=1\.0\(\+0\.08\+0\.08\); the contrast is small atT=0\.3T=0\.3andT=0\.7T=0\.7\.
Table 5:Per\-TTLOSO Spearman with cluster\-bootstrap CI \(half\-width\);n=36n=36configurations for the agreement\-rate baseline andn=32n=32for the with\-variance regime\.The lift from variance refinements is concentrated at the highest sampling temperature \(T=1\.0T=1\.0,Δρ=\+0\.08\\Delta\\rho=\+0\.08\); atT=0\.3T=0\.3andT=0\.7T=0\.7the two predictors are essentially tied, supporting the joint cross\-temperature design as the source of the contrast at ournn\.
#### Permutation null\.
[Table˜6](https://arxiv.org/html/2606.02981#A3.T6)reports the full permutation\-null breakdown for the headline rank\-prediction result\. For each temperature, configuration\-level targets are randomly permuted500500times relative to the feature matrix \(within\-configuration three\-temperature structure preserved\); the full joint cross\-temperature LOSO pipeline is re\-run on each permutation; the resulting Spearman distribution defines the null\. The observedρ\\rhoexceeds the null’s97\.5%97\.5\\%percentile by≈0\.5\\approx 0\.5at every temperature,p=0\.002p=0\.002\(lower\-bound resolution1/5001/500\)\. The “null\[2\.5%,97\.5%\]\[2\.5\\%,97\.5\\%\]” column is the null distribution’s percentile range across the500500permutations, not a confidence interval on the observedρ\\rho\(those are in[Table˜1](https://arxiv.org/html/2606.02981#S4.T1)\)\.
Table 6:Permutation\-null breakdown per temperature on theBoN@k−pass@1\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1target\. The null column is the null distribution’s percentile range, not a confidence interval on the observedρ\\rho\.At every temperature, the observed Spearman lies well above the null’s97\.5%97\.5\\%percentile \(gap of≈0\.5\\approx 0\.5\); the design’s three\-temperature pooling is not the source of the headline signal\.
### C\.2Stability selection
#### Full bootstrap\-Lasso ranking\.
[Table˜7](https://arxiv.org/html/2606.02981#A3.T7)reports the bootstrap\-Lasso selection frequencies for all1919candidate features\. Three features cross the standard80%80\\%stable\-selection threshold and define the stable core; one more \(per\_prompt\_answer\_entropy\_median,79\.6%79\.6\\%\) sits just below and is treated only as a predictive add\-on\.
Table 7:Bootstrap\-Lasso stability\-selection frequencies across500500configuration\-level resamples; the threshold for “stable” is≥80%\\geq 80\\%fixed before data inspection\.Three features cross the80%80\\%threshold and define the stable core; the near\-threshold entropy feature is not counted as stable\. The stable core consists of variance refinements rather than the original agreement\-rate features, and the namedagreement\_ratefeature itself is selected only67\.4%67\.4\\%of the time\. The label\-assisted first\-correct\-sample feature is part of this core, so the predictor should be read as a labeled\-validation\-set screen\.
Figure 3:Bootstrap\-Lasso selection frequency of each candidate feature across500500configuration\-level resamples; dashed line marks the80%80\\%stable\-selection threshold; stars mark features above the threshold\.The bar chart visualizes the same selection\-frequency data as[Table˜7](https://arxiv.org/html/2606.02981#A3.T7), with rows ordered by frequency and the80%80\\%threshold annotated\.
#### Partial correlation between the strongest variance refinement and agreement\-rate features\.
On the samen=32n=32configurations,majority\_size\_std\(the strongest variance refinement feature\) andagreement\_rate\(the strongest agreement\-rate feature\) are correlated at Pearson\|r\|=0\.787\|r\|=0\.787\(p<10−4p<10^\{\-4\}\)\. The marginal Spearman correlations with the gain target are−0\.868\-0\.868\(agreement\_rate\) and\+0\.856\+0\.856\(majority\_size\_std\), of comparable magnitude\. After residualizing one against the other and correlating the residual with the gain, the partial Spearman ofmajority\_size\_stdgivenagreement\_ratedrops to\+0\.192\+0\.192\(p=0\.34p=0\.34\); the partial Spearman ofagreement\_rategivenmajority\_size\_stddrops to−0\.299\-0\.299\(p=0\.13p=0\.13\)\. This is the basis for the main text’s claim that the variance refinements are a within\-family refinement rather than an orthogonal predictor \([Section˜4\.4](https://arxiv.org/html/2606.02981#S4.SS4)\)\.
### C\.3Robustness
#### Across the scaling curve\.
[Table˜8](https://arxiv.org/html/2606.02981#A3.T8)reports the joint cross\-temperature LOSO Spearman as the BoN budgetk′k^\{\\prime\}is swept from22to6464, separately at each temperature\. The correlation is informative at everyk′k^\{\\prime\}tested and plateaus byk=16k=16–3232\.
Table 8:kk\-sweep LOSO Spearman per temperature on theBoN@k′−pass@1\\mathrm\{BoN\}@k^\{\\prime\}\-\\mathrm\{pass\}@1target, recomputed from the firstk′k^\{\\prime\}samples of the existingk=64k=64data;p<0\.001p<0\.001throughout\.The correlation is informative \(ρ≥0\.43\\rho\\geq 0\.43\) at every sample budget tested and plateaus byk=16k=16–3232; the headline result is therefore not specific tok=64k=64and would be recoverable at much smaller sample budgets at deployment time\.
#### Across gain functions\.
[Table˜9](https://arxiv.org/html/2606.02981#A3.T9)reports LOSO Spearman when the same compact ridge is retargeted to each of the four BoN\-anchored gain functions and the majority\-voting variant\. Verifier\-anchored gains all giveρ≥0\.83\\rho\\geq 0\.83regardless of Lipschitz status; the majority\-voting variant collapses toρ=0\\rho=0, distinguishing the verifier\-anchored from the vote\-anchored axis\. TheGaddG\_\{\\mathrm\{add\}\}row reportsρ=\+0\.87\\rho=\+0\.87at the matchedn=32n=32used for this table \(chosen so that all five target variants are computed on identical cells\); the headlineρ=0\.90\\rho=0\.90in[Table˜1](https://arxiv.org/html/2606.02981#S4.T1)is the same compact predictor refit on the largern=56n=56grid available only for the additive target \(cell\-level eligibility differs across gain variants\)\. The two numbers reflect different evaluation\-set sizes, not different predictors or methods\.
Table 9:Same agreement\-rate\+\+variance features and joint cross\-temperature ridge, retargeted to each gain function\. LOSO holds out\(base family,domain\)\(\\text\{base family\},\\text\{domain\}\)clusters;95%95\\%cluster\-bootstrap CIs\.ε0=0\.1\\varepsilon\_\{0\}=0\.1for the normalized variant;GMVG\_\{\\mathrm\{MV\}\}uses true self\-consistency MV \(plurality vote on extracted answers\)\.
#### Across post\-training recipes\.
[Table˜10](https://arxiv.org/html/2606.02981#A3.T10)reports held\-out LOSO Spearman when an entire RL recipe is held out from the training set in turn\. Four of five recipes with sufficient configurations have held\-outρ≥0\.57\\rho\\geq 0\.57; ORPO and GRPO are the strongest individually\.
Table 10:Per\-recipe leave\-out: train on all configurations whose RL recipe is not the held\-out one, predict on the held\-out recipe\. Joint cross\-TTridge; LOSO;95%95\\%cluster\-bootstrap CI \(half\-width\)\.All six held\-out folds have cluster\-bootstrap CI excluding zero\. ORPO transfers most cleanly \(tightest CI\); GRPO is the weakest fold \(widest CI\) but the interval still lies above zero\. The result holds under per\-method leave\-out across all RL recipes evaluated\.
#### Across reward models\.
We report two analyses on this axis\. The first changes the BoN*target*by swapping the verifier; the second adds cross\-reward\-model*features*to the predictor input\. The two analyses answer different questions and use different second reward models\.
Target\-level \(ArmoRM verifier\)\.[Table˜11](https://arxiv.org/html/2606.02981#A3.T11)reports the analysis summarized in[Section˜4\.5](https://arxiv.org/html/2606.02981#S4.SS5): re\-score every sample with ArmoRM\-Llama3\-8B\(Wang et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib30)\), recompute the BoN gaing\(c,T\)g\(c,T\)against this independent verifier, and refit the same compact feature design on the new target\. On then=56n=56configurations that are compact\-predictor\-eligible and fully scored by both reward models, the ArmoRM\-target ridge reachesρ=\+0\.81±\.19\\rho=\+0\.81\_\{\\pm\.19\}; the Skywork\-target ridge on the same cells reachesρ=\+0\.90±\.13\\rho=\+0\.90\_\{\\pm\.13\}\. The0\.090\.09\-point gap between Skywork and ArmoRM Spearmans is within the cluster\-bootstrap half\-widths, so the cross\-verifier difference is not separable from zero at thisnn\. ArmoRM was trained by a different group on different preference data and a different mixture\-of\-experts head architecture, so it serves as a strong independent verifier; this experiment supports feature\-design transfer after retargeted fitting, not zero\-shot coefficient transfer\.
Table 11:Target\-level cross\-reward\-model robustness\. The same compact feature design is refit against the BoN gain defined by each verifier on the intersection of compact\-predictor\-eligible, fully ArmoRM\-scored configurations\. Cluster\-bootstrap95%95\\%half\-widths\.Feature\-level \(DeBERTa scorer, preliminary\)\.[Table˜12](https://arxiv.org/html/2606.02981#A3.T12)reports a separate analysis that adds cross\-reward\-model features \(computed from DeBERTa\-v3\-large\-Reward scores\) to the agreement\-rate\+\+variance feature design without changing the target\. With these features added, predictor performance does not change in any direction whose cluster\-bootstrap interval excludes zero\. The cross\-reward\-model feature block was computed on onlyn=6n=6configurations \(the subset for which the secondary DeBERTa\-Reward scorer was run\), so the comparison is under\-powered\.
Table 12:Preliminary cross\-reward\-model feature ablation \(separate analysis from[Table˜11](https://arxiv.org/html/2606.02981#A3.T11)\)\. Cross\-reward\-model features computed onn=6n=6configurations for which DeBERTa\-v3\-large\-Reward scores were available; baseline row trained on the matchedn=32n=32agreement\-rate\+\+variance grid\.The feature\-level row has too few configurations to draw a conclusion; its CI spans zero\. We report it as an under\-powered robustness check rather than evidence against cross\-reward\-model features\. The target\-level analysis in[Table˜11](https://arxiv.org/html/2606.02981#A3.T11)provides the better\-powered cross\-verifier test\.
#### Sensitivity and ablation\.
[Table˜13](https://arxiv.org/html/2606.02981#A3.T13)summarizes a sensitivity\-and\-ablation sweep on a matchedn=32n=32compact\-ablation grid\. The predictor is robust across ridge regularization strength \(α∈\[10−3,10\]\\alpha\\in\[10^\{\-3\},10\]keepsρ≈0\.91\\rho\\approx 0\.91; over\-regularization atα=100\\alpha=100dropsρ\\rhoto0\.880\.88\), insensitive to feature standardization, and competitive against nonlinear baselines \(random forest and gradient boosting both reachρ≈0\.88\\rho\\approx 0\.88, slightly below the ridge fit\)\. Within the joint cross\-temperature design, the temperature\-subset ablation isolates whichTTcontributes most: single\-TTpredictors atT=0\.3T=0\.3andT=1\.0T=1\.0recoverρ≈0\.90\\rho\\approx 0\.90on their own, whereasT=0\.7T=0\.7in isolation drops toρ=0\.72\\rho=0\.72\. This is consistent withT=0\.7T=0\.7being the most noise\-prone single\-temperature regime, which is the basis for the main text’s claim that the joint cross\-TTdesign is necessary for tight CIs at ournn\([Section˜4\.6](https://arxiv.org/html/2606.02981#S4.SS6)\)\. The two\-temperature subset\{0\.3,1\.0\}\\\{0\.3,1\.0\\\}slightly exceeds the full three\-temperature fit \(ρ=0\.92\\rho=0\.92vs\.0\.910\.91\), suggestingT=0\.7T=0\.7adds noise rather than information\.
The cluster\-bootstrap CI width is itself stable to the number of bootstrap resamples: half\-width is≈0\.06\\approx 0\.06atN∈\{100,200,500,1000,2000\}N\\in\\\{100,200,500,1000,2000\\\}resamples, and the medianρ\\rhoacross these settings varies by less than±0\.005\\pm 0\.005\([Table˜14](https://arxiv.org/html/2606.02981#A3.T14)\)\. Prompt\-set sensitivity \(variation of the result under subsampling of theP=200P=200evaluation prompts per cell\) was not run here because the design\-matrix features are pre\-aggregated to cell\-level summaries; the appropriate sensitivity check would require re\-running feature extraction at smallerPP\.
Table 13:Sensitivity\-and\-ablation sweep on the compact predictor; LOSO Spearman on the matchedn=32n=32compact\-ablation grid\.The predictor is robust across ridge regularization, feature standardization, and alternate model classes; the temperature\-subset block shows thatT=0\.7T=0\.7alone is noise\-dominated \(ρ=0\.72\\rho=0\.72\) whileT=0\.3T=0\.3andT=1\.0T=1\.0each recoverρ≈0\.90\\rho\\approx 0\.90, supporting the joint cross\-TTdesign as the source of tight CIs\.
Table 14:Cluster\-bootstrap stability across resample countsN∈\{100,200,500,1000,2000\}N\\in\\\{100,200,500,1000,2000\\\}; the headline result usesN=500N=500\.The cluster\-bootstrap CI is essentially invariant to the number of resamples across two orders of magnitude; the headline CI is not an artifact of theN=500N=500choice\.
### C\.4Where the predictor fails
#### Per\-domain leave\-one\-domain\-out\.
[Table˜15](https://arxiv.org/html/2606.02981#A3.T15)reports the per\-domain LODO breakdown\. Math is the only domain whose held\-out CI excludes zero; code held\-out is strongly negative, reflecting that two of three training domains is insufficient breadth to extrapolate\.
Table 15:Per\-domain leave\-one\-domain\-out Spearman; train on configurations whose task domain is not the held\-out one and predict on the held\-out domain;95%95\\%cluster\-bootstrap half\-width\.Math transfers cleanly when held out; the code fold flips sign, consistent with the systematic over\-prediction on code documented in[Table˜16](https://arxiv.org/html/2606.02981#A3.T16)\. Two training domains are insufficient breadth to extrapolate to a third\.
#### Adversarial residual examples\.
[Table˜16](https://arxiv.org/html/2606.02981#A3.T16)lists the five configurations with the largest absolute LOSO residual under the BoN target\. All five are SFT or GRPO at high temperatures; the shared pattern is low per\-prompt majority\-fraction that the agreement\-rate refinements read as a small predicted gain, while the actual BoN gain is larger because the reward model still picks out a rare correct sample\.
Table 16:Five largest absolute LOSO residuals under the BoN target on the compact predictor\.All five largest residuals are SFT or GRPO atT≥0\.7T\\geq 0\.7on code or weak\-base math; the shared pattern is low per\-prompt majority\-fraction read by the predictor as a small gain, while best\-of\-NNstill picks out a rare correct sample\. A feature family conditioned on the reward\-model score distribution is the structural fix flagged as future work\.
#### Where the predictor over\-estimates\.
Code\-domain configurations with low cross\-reward\-model disagreement are systematically over\-predicted on theBoN@k−pass@1\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1target\. The interpretation is that these configurations look stable to the agreement\-rate family \(the model consistently picks one answer\) but the verifier’s notion of correctness on code prompts is harsher than the per\-prompt commitment level suggests, so actual best\-of\-NNgain is smaller than predicted\.
## Appendix DCross\-dataset BoN transfer
We re\-extract the agreement\-rate\+\+variance features on freshk=64k=64generations from MATH500\(Hendrycks et al\.,[2021](https://arxiv.org/html/2606.02981#bib.bib9); Lightman et al\.,[2024](https://arxiv.org/html/2606.02981#bib.bib18)\)\(math\-domain configurations\) and HumanEval \(code\-domain configurations\), score per\-prompt correctness against the new gold labels, score each sample with the same Skywork\-Reward\-Llama\-3\.1\-8B reward model to recover the cross\-datasetBoN@k−pass@1\\mathrm\{BoN\}@k\-\\mathrm\{pass\}@1target, and apply the compact predictor \(trained on the original evaluation suite, no retraining\)\.
Table 17:Cross\-dataset BoN transfer: train on the original evaluation suite’s math \(resp\. code\) cells; predict on MATH500 \(resp\. HumanEval\) re\-extracted features without retraining; recover the BoN target by re\-scoring with the same Skywork reward model\.MATH500 transfer atρ=\+0\.79\\rho=\+0\.79is essentially as strong as the in\-distribution LOSO result, indicating that the predictor’s signal is a property of the trained configuration rather than the original eval suite\. HumanEval transfer atρ=\+0\.35\\rho=\+0\.35is weaker but excludes zero, consistent with the smaller code\-domain training base and the over\-prediction pattern noted in[Section˜C\.4](https://arxiv.org/html/2606.02981#A3.SS4)\.Similar Articles
Sequential statistical inference for Large Language Models: Representation, validity, and monitoring
This paper argues for a sequential inference framework to enhance LLM trustworthiness by modeling interactions as dependent stochastic processes, ensuring validity under repeated use, and enabling online monitoring for behavioral shifts.
Sample-Size Scaling of the African Languages NLI Evaluation
This paper examines the effect of labeled data size on natural language inference performance for 16 African languages using the AfriXNLI benchmark. The results show that scaling behavior is language-sensitive and often non-monotonic, challenging the common assumption of monotonic improvement, and emphasizing the need for language-specific dataset creation and stronger multilingual strategies.
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
This paper systematically studies scale vectors in LLM normalization layers, showing they optimize training through a self-amplifying preconditioning effect, and proposes three lightweight improvements that enhance performance and scaling behavior with negligible overhead.
Model Merging Scaling Laws in Large Language Models
This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.