You Don't Need to Run Every Eval

arXiv cs.LG Papers

Summary

This research paper demonstrates that the scores of frontier AI models across 133 benchmarks are approximately rank-2, meaning only two latent factors explain over 90% of variation. The authors introduce BenchPress, a logit-space matrix completion method that predicts a model's full scorecard from just a few benchmarks, significantly reducing the cost of evaluation.

arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to run every eval? We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 cells, 23.3% filled) and find it is approximately rank-2: a model's scores across all 133 benchmarks are largely determined by just two numbers. We confirm this in two ways: scores hidden from the matrix are best recovered using two factors, and two factors already explain over 90% of the variation among models on the benchmarks they share. Building on this, we design BenchPress: a logit-space rank-2 matrix completion method that recovers held-out scores to within 4.6 points, and a confidence layer that says when each prediction can be trusted. Using BenchPress, we find a subset of five benchmarks {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} that can recover the rest of a model's public scorecard to within 3.93 points. For a tighter inference budget, a cheaper set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} can predict a model's evals to within 4.55. We release the score matrix, the BenchPress code, and an interactive tool that predicts any model's score on any benchmark.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:50 AM

# (a) Accuracy of predicting one (model, benchmark) score.
Source: [https://arxiv.org/html/2606.24020](https://arxiv.org/html/2606.24020)
††footnotetext:Emails:\{zengyuchen, dimitriosp\}@microsoft\.comA modern model release reports scores on 40\+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release\. But do we need to run every eval? We compile a public score matrix of 84 frontier models on 133 benchmarks \(2,604 cells, 23\.3% filled\) and find it is approximately rank\-2: a model’s scores across all 133 benchmarks are largely determined by just two numbers\. We confirm this in two ways: scores hidden from the matrix are best recovered using two factors, and two factors already explain over 90% of the variation among models on the benchmarks they share\. Building on this, we designBenchPress: a logit\-space rank\-2 matrix completion method that recovers held\-out scores to within4\.64\.6points, and a confidence layer that says when each prediction can be trusted\. UsingBenchPress, we find a subset of five benchmarks\{\\\{GPQA\-D, HLE, Codeforces, MMLU\-Pro, ARC\-AGI\-1\}\\\}that can recover the rest of a model’s public scorecard to within3\.933\.93points\. For a tighter inference budget, a cheaper set\{\\\{GPQA\-D, MMLU\-Pro, Aider Polyglot, MATH\-500, AIME 2026\}\\\}can predict a model’s evals to within4\.554\.55\. We release the score matrix, theBenchPresscode, and an interactive tool that predicts any model’s score on any benchmark\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_hero_panel_a_examples.pdf)\(a\)Accuracy of predicting one \(model, benchmark\) score\.
![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_hero_panel_b_overall.pdf)\(b\)Reporting overall score\-prediction error\.

Figure 1:BenchPresspredicts unseen benchmark scores from a handful of revealed ones\.Left:For four model–benchmark cells, we hide the target score and revealkkother scores from the*same model’s row*, in a random order\. The y\-axis is absolute prediction error on the held\-out target cell\. Error drops sharply once a few same\-row scores are revealed, and reaches zero whenever the target cell itself appears in the revealed prefix\.Right:A complementary setting that mimics how a practitioner would runBenchPressin practice\. A fixed set ofkkbenchmarks is chosen as the probe set, and every model is evaluated on whichever probe scores it has observed;BenchPresspredicts the rest of each model’s scores, and we report pooled error across all evaluated cells\. With only five benchmark probes selected on the current matrix, pooled MedAE drops to3\.933\.93score points \(4\.554\.55when restricted to a lower inference cost list; see[Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)\)\. See[Section˜A\.1](https://arxiv.org/html/2606.24020#A1.SS1)for the detailed experiment setting\.###### Contents

1. [1Introduction](https://arxiv.org/html/2606.24020#S1)
2. [2Related Work](https://arxiv.org/html/2606.24020#S2)
3. [3The Score Matrix and Its Geometry](https://arxiv.org/html/2606.24020#S3)1. [3\.1Data Collection](https://arxiv.org/html/2606.24020#S3.SS1) 2. [3\.2The Final Score Matrix](https://arxiv.org/html/2606.24020#S3.SS2) 3. [3\.3Rank\-2 Geometry](https://arxiv.org/html/2606.24020#S3.SS3)
4. [4BenchPress: A Low\-rank Benchmark Score Predictor](https://arxiv.org/html/2606.24020#S4)1. [4\.1Candidate Methods](https://arxiv.org/html/2606.24020#S4.SS1) 2. [4\.2From Candidate Methods toBenchPress](https://arxiv.org/html/2606.24020#S4.SS2) 3. [4\.3BenchPressvs\. LLMs as Benchmark Score Predictors](https://arxiv.org/html/2606.24020#S4.SS3)
5. [5WhatBenchPressEnables for Model Evaluation](https://arxiv.org/html/2606.24020#S5)1. [5\.1Budgeted Scorecard Recovery](https://arxiv.org/html/2606.24020#S5.SS1) 2. [5\.2Preserving Model Rankings](https://arxiv.org/html/2606.24020#S5.SS2) 3. [5\.3Predicting Newly Released Models](https://arxiv.org/html/2606.24020#S5.SS3)
6. [6When to TrustBenchPress’s Predictions](https://arxiv.org/html/2606.24020#S6)1. [6\.1What Affects Prediction Reliability](https://arxiv.org/html/2606.24020#S6.SS1) 2. [6\.2Estimating Prediction Reliability](https://arxiv.org/html/2606.24020#S6.SS2)
7. [7Discussion](https://arxiv.org/html/2606.24020#S7)
8. [References](https://arxiv.org/html/2606.24020#bib)
9. [ASupplemental toSection˜1: Introduction](https://arxiv.org/html/2606.24020#A1)1. [A\.1Experiment Setting forFigure˜1](https://arxiv.org/html/2606.24020#A1.SS1)
10. [BSupplemental toSection˜3: The Score Matrix and Its Geometry](https://arxiv.org/html/2606.24020#A2)1. [B\.1Data Collection](https://arxiv.org/html/2606.24020#A2.SS1) 2. [B\.2The Final Score Matrix](https://arxiv.org/html/2606.24020#A2.SS2)
11. [CSupplemental toSection˜4:BenchPress: A Low\-rank Benchmark Score Predictor](https://arxiv.org/html/2606.24020#A3)1. [C\.1Candidate Methods](https://arxiv.org/html/2606.24020#A3.SS1) 2. [C\.2From Candidate Methods toBenchPress](https://arxiv.org/html/2606.24020#A3.SS2) 3. [C\.3BenchPressvs\. LLMs as Benchmark Score Predictors](https://arxiv.org/html/2606.24020#A3.SS3)
12. [DSupplemental toSection˜5: WhatBenchPressEnables for Model Evaluation](https://arxiv.org/html/2606.24020#A4)1. [D\.1Budgeted Scorecard Recovery](https://arxiv.org/html/2606.24020#A4.SS1) 2. [D\.2Preserving Model Rankings](https://arxiv.org/html/2606.24020#A4.SS2) 3. [D\.3Predicting Newly Released Models](https://arxiv.org/html/2606.24020#A4.SS3)
13. [ESupplemental toSection˜6: When to TrustBenchPress’s Predictions](https://arxiv.org/html/2606.24020#A5)1. [E\.1What Affects Prediction Reliability](https://arxiv.org/html/2606.24020#A5.SS1) 2. [E\.2Estimating Prediction Reliability](https://arxiv.org/html/2606.24020#A5.SS2)

## 1Introduction

LLM evaluation is expensive and growing more so\. A frontier model release now routinely reports scores on dozens of benchmarks: Qwen3\.5 reports 40 language benchmark rows\(qwen35\), and Kimi K2\.5 reports 43 benchmark rows\(kimik25\)\. Being this thorough is good for science\. But a public model release is the visible tip of a much larger measurement effort\. Researchers compare checkpoints and design choices and downstream consumers shortlist models for deployment and use\. Across these settings, subsets of the same evaluation suite are run and re\-run many more times than any single release reports\. A full evaluation suite can therefore cost thousands of dollars and days of wall\-clock time per run\.

This raises a question:do we always need to run every evaluation, or are there settings where an approximate score, available for free, would be enough?

Benchmark scores are clearly not independent measurements\. Strong performance on coding and agentic benchmarks often co\-occurs with strong performance on competition\-math benchmarks: for example, SWE\-bench Verified\(jimenez2024swebench;openai2024swebenchverified\)is strongly correlated with AIME\(aime\)and MATH\-500\(lightman2023math500\), and Terminal\-Bench variants\(terminalbench\)show similar but noisier trends\. What is unclear is whether this dependence extends across the full landscape of benchmarks\.

Why would one care? If a few observed scores can predict the rest of a model’s benchmark profile to useful accuracy, practitioners have a new option for evals: run a small set of probes and infer the rest, instead of running every evaluation independently\. We first build a score predictor, then ask what it enables in practice and when its predictions should be trusted\.[Figure˜1](https://arxiv.org/html/2606.24020#S0.F1)previews both the single\-cell prediction task and the probe\-set recovery setting\.

##### Contributions:

1. 1\.We compile a public score matrix and show that it is effectively rank\-2\.We collect scores from public sources, canonicalize near\-duplicate model variants and benchmark configurations, and filter out models and benchmarks with insufficient observations to obtain an84×13384\\times 133matrix with2,6042\{,\}604observed entries \(23\.3% of all model–benchmark cells\)\. Two independent diagnostics on this matrix show that it is effectively rank\-2: rank\-sweeping Soft\-Impute matrix completion minimizes held\-out prediction error at rank 2, and SVDs of the largest fully\-observed submatrices show that two factors explain more than 90% of variance \([Section˜3](https://arxiv.org/html/2606.24020#S3)\)\.
2. 2\.We buildBenchPress, a benchmark score predictor\.We evaluate seven feature transforms and twelve prediction methods, finding that the best full\-coverage score predictor is a rank\-2 alternating least squares \(ALS\) matrix\-completion method in logit space\(koren2009\)\. It predicts every missing model–benchmark cell, reaching4\.64\.6score\-point median absolute error on held\-out entries at100%100\\%coverage \([Section˜4](https://arxiv.org/html/2606.24020#S4)\)\.
3. 3\.We show whatBenchPressenables for model evaluation\.\(i\) We*select compact probe sets*that recover a model’s scorecard under an evaluation budget: even when restricted to a low\-cost benchmark allowlist, five probes lead to pooled MedAE of4\.554\.55score points \([Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)\)\. \(ii\) We*verify ranking preservation*: allowing a five\-point margin on the true scores, completed scores byBenchPresspreserve92\.1%92\.1\\%of pairwise model rankings on the same benchmark \([Section˜5\.2](https://arxiv.org/html/2606.24020#S5.SS2)\)\. \(iii\) We*stress\-test predictions on newly released models*: even when the training matrix predates the release, five seed scores lead to median absolute error of5\.05\.0points \([Section˜5\.3](https://arxiv.org/html/2606.24020#S5.SS3)\)\.
4. 4\.We characterize when predictions should be trusted\.We first identify the matrix\-support factors that consistently affect prediction quality: target\-model and target\-benchmark coverage, the availability of similar peer models and neighboring benchmarks, and recency of training anchors\. We then use these factors together with ensemble spread, a reliability signal measuring how much plausible score predictors disagree, to estimate trust probabilities and conformally\-calibrated90%90\\%prediction intervals forBenchPresspredictions \([Section˜6](https://arxiv.org/html/2606.24020#S6)\)\.

##### Scope and caveats\.

Our claims should be read within four limits\.*\(i\) Public\-score heterogeneity:*the matrix mixes vendor\-reported and third\-party scores under varying evaluation configurations, soBenchPresspredicts what this public matrix would extrapolate to rather than what a controlled re\-evaluation would yield\.*\(ii\) Snapshot dependence:*the rank\-2 structure and prediction errors are conditional on the 84 models and 133 benchmarks in this snapshot; future frontier releases with capability profiles unlike anything in the current matrix can break this geometry\.*\(iii\) Score inferability:*our analysis identifies benchmark*scores*that are currently inferable from others, not benchmarks whose*existence*is unnecessary\. Benchmarks still serve purposes beyond score prediction, including failure\-mode discovery, contamination and distribution\-shift monitoring, and incentive shaping for model developers\.*\(iv\) Probe\-set specificity:*compact probe sets are selected for the current matrix and should be re\-derived as the matrix grows or the model population drifts\.

## 2Related Work

##### Low\-rank structure in evaluation\.

Burnell et al\.\(burnell2023\)argued that evaluation reporting is redundant\. A follow\-up by the same group\(burnell2023structure\)found that three latent factors \(reasoning, comprehension, core language modeling\) explain most of the variance across 27 HELM\(liang2023helm\)tasks evaluated on 29 models\. Ilić & Gignac\(ilic2024\)applied psychometric factor analysis to 591 models from the Open LLM Leaderboard, finding agg\-factor \(borrowing the term from human intelligence research\) that accounts for 85% of variance across 12 benchmarks\. Burnham\(burnham2025\)independently arrived at a closely related rank\-2 decomposition of the Epoch AI Capabilities Index into “general capability \+ provider\-specific residual” via PCA, consistent with the rank\-2 geometry we recover in[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3)on a different \(heterogeneous, frontier\-era\) matrix\. These studies establish that low\-rank structure exists; we build a benchmark score\-prediction system on top of it, show what it enables, and characterize where it breaks \([Sections˜4](https://arxiv.org/html/2606.24020#S4),[5](https://arxiv.org/html/2606.24020#S5)and[6](https://arxiv.org/html/2606.24020#S6)\)\.

##### Benchmark compression and design\.

Perlitz et al\.\(perlitz2024\)showed HELM evaluation can be compressed100×100\\timeswith minimal ranking reliability loss; their follow\-up\(perlitz2024bat\)formalized best practices for evaluating whether benchmarks agree with one another\. Ni et al\.\(ni2024mixeval\)\(MixEval\) constructed a single compact benchmark from web\-query\-matched items, achieving 0\.96 Chatbot Arena correlation\. These approaches select or design a fixed evaluation suite*a priori*\.BenchPressinstead predicts missing scores from whatever benchmarks happen to be available, requiring no fixed probe set: a practitioner can feed in MMLU\(hendrycks2021mmlu\)and GPQA\(rein2024gpqa\)today, or LiveCodeBench and AIME tomorrow, without reconfiguration\.

##### Item\-level subset selection\.

A complementary line of work reduces cost*within*individual benchmarks by selecting which test items to run\.*IRT\-based*methods include MetaBench\(kipnis2024\), which uses item response theory to keep 3% of items across six benchmarks while preserving aggregate conclusions, and tinyBenchmarks\(polo2024\), which builds on Anchor Points using IRT to pick informative items\.*Correlation\- or embedding\-based*methods include Anchor Points\(vivek2024anchorpoints\), which selects items via cross\-model correlations; Scales\+\+\(bean2025scalespp\), which uses cognitive\-scale embeddings to reduce cost18×18\\timesat 2\.9% MAE without prior model evaluations; DISCO\(rubinstein2025\), which condenses sets by selecting items where models disagree most; SubLIME\(saranathan2025sublime\), which trains a correlation predictor for compact subsets; EssenceBench\(wang2026essencebench\), which applies genetic algorithms for up to200×200\\timescompression; and Zhou et al\.\(zhou2025\), which exploits low\-rank structure at the example level for up to20×20\\timesspeedups\. Most of these methods require instance\-level pass/fail data across many models to calibrate item selection; Scales\+\+ is a notable exception\.BenchPressrequires only aggregate scores and predicts*across*benchmarks, a complementary approach that could be combined with item\-level methods for end\-to\-end savings\.

##### Score prediction\.

Closest to our work are methods that predict aggregate benchmark scores directly\. Schram et al\.\(schram2023\)applied Bayesian matrix factorisation to predict cross\-lingual NLP performance, the nearest methodological predecessor, though in a different domain \(languages×\\timestasks, not LLMs×\\timesbenchmarks\)\. Zhang et al\.\(zhang2024cpp\)applied collaborative filtering to LLM scores; Ruan et al\.\(ruan2024\)showed performance is a function of a low\-dimensional capability space; Polo et al\.\(polo2024sloth\)used latent skill models for cross\-benchmark prediction; Ye et al\.\(ye2023\)showed BIG\-bench is 95%\+ predictable\. Park et al\.\(park2025precog\)took a different approach entirely, using LLMs to predict benchmark scores from text descriptions alone, with no execution needed; we revisit this LLM\-as\-predictor comparison empirically in[Section˜4\.3](https://arxiv.org/html/2606.24020#S4.SS3)\. Koh et al\.\(koh2026rbridge\)\(rBridge\) use a small proxy model to predict large\-model reasoning performance via scaling\-law\-like transfer; this requires actually training the proxy, whileBenchPressrequires no model access at all\. We differ from these score\-prediction methods in three ways: \(1\) we operate at substantially larger scale and on a frontier\-era snapshot \(8484models×\\times133133benchmarks, including post\-2024 reasoning, coding, and agentic suites\), \(2\) we compare 84 transform–method configurations head\-to\-head on the same data, and \(3\) we provide explicit failure analysis, showing where and why prediction breaks\.

## 3The Score Matrix and Its Geometry

In this section we ask: given a collection of existing LLM benchmark scores, can we predict the missing ones from a small subset? Our starting point is a*score matrix*with models on one axis and benchmarks on the other, populated from publicly available evaluations\.[Section˜3\.1](https://arxiv.org/html/2606.24020#S3.SS1)describes how each cell is sourced and audited\.[Section˜3\.2](https://arxiv.org/html/2606.24020#S3.SS2)introduces the resulting matrix and discusses its data quality limitations\.[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3)reveals that the score matrix is effectively rank\-2\. Appendix details for data collection and the released score matrix appear in[Sections˜B\.1](https://arxiv.org/html/2606.24020#A2.SS1)and[B\.2](https://arxiv.org/html/2606.24020#A2.SS2)\. Throughout,MMdenotes the number of models,BBdenotes the number of benchmarks,sm​bs\_\{mb\}denotes the observed score of modelmmon benchmarkbb, ands^m​b\\hat\{s\}\_\{mb\}denotes its prediction\.

### 3\.1Data Collection

Our data collection proceeds in four steps\.*First*, we seed a queue with a small initial set of models \(GPT\-5\.5\(openai2026gpt55\), Claude Opus 4\.7\(anthropic2026claude47\), Gemini 3\.1 Pro\(google2026gemini31\), DeepSeek\-V4\-Pro\(deepseek2026v4pro\), and a few other widely\-discussed recent releases\) and crawl every official source attached to each: the release blog, the system card, the technical report, and the Hugging Face model card\.*Second*, we recurse: each source typically reports the model’s own scores together with a handful of competitor baselines, so any newly mentioned model is added to the queue and crawled in turn, until no further round introduces an unvisited model\.*Third*, we sweep a fixed list of primary leaderboards \(MathArena\(matharena\), ARC\-Prize\(arcprize\), Terminal\-Bench\(terminalbench\), LMArena\(lmarena\), Epoch AI\(epochai\_frontiermath\), LiveBench\(white2024livebench\)\) to fill remaining gaps for models and benchmarks that vendors do not directly cover\.*Fourth*, we filter the resulting raw matrix to a dense subset that supports the analyses in the rest of the paper\.

When the same model\-benchmark cell is reported by multiple sources we resolve conflicts as best we can: we keep the highest\-priority value, with priority order release blog, then system card, then technical report, then HuggingFace model card, then primary leaderboards, breaking ties by recency\. We also fix one canonical configuration per model \(typically the setting the vendor itself foregrounds, such as a specific reasoning effort level\) to avoid inflating coverage with effectively duplicate rows from near\-duplicate variants\. Every retained value carries the URL it was sourced from \([Section˜B\.1](https://arxiv.org/html/2606.24020#A2.SS1)shows the released record format\), and roughly half come from a source outside the model’s own provider\. Alongside the score itself, each entry records the model’s release date, provider, and canonical evaluation setting \(mode, reasoning effort, sampling, judge, harness, prompt style, temperature, tool use\), and each benchmark records its category, metric, problem count, and reference link, so downstream analyses can condition on release timing, evaluation regime, or benchmark type without re\-crawling the sources\.

Table 1:The threshold filter trades coverage for density\.Each row applies a minimum\-observation requirement to model rows and benchmark columns, iterated to a fixed point\. We use the bolded setting throughout\.Min\. obs\. perResulting matrixModelBench\.\#Models\#Bench\.\#Obs\.Fill\(unfiltered\)1883164,4937\.6%1081301413,20117\.5%10121241042,79521\.7%1016112702,24628\.6%158841332,60423\.3%151261811,81136\.7%151653531,33747\.6%208491101,88535\.0%201241691,35347\.8%

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_matrix_clean_white.pdf)Figure 2:Observation pattern of the84×13384\\times 133score matrix, sorted by coverage\. Each row is a model and each column is a benchmark; dark cells are observed scores\. Only 23\.3% of entries are filled\.

##### Data quality caveat\.

Public benchmark scores are heterogeneous measurements, and the matrix treats all of them as comparable\. Every score has a source URL for auditability, but users should be aware of the following noise sources, an inherent limitation of any cross\-provider benchmark analysis:

- •Source heterogeneity\.Prompting strategies, evaluation harnesses, reasoning budgets, and evaluation dates differ across sources\. Some scores are vendor\-reported \(potentially optimistic\); others come from independent third parties\. Vendor\-reported scores may be optimistically biased relative to independent reproductions, potentially inflating apparent cross\-benchmark correlations\.
- •Measurement noise\.Benchmark scores have inherent noise from non\-deterministic decoding, prompt sensitivity, and evaluation harness differences\. The same model evaluated with identical prompts can produce scores varying by 1–3 points across runs due to sampling temperature, and different harnesses can shift scores by 5\+ points on the same benchmark\.
- •Structured missingness\.Popular models×\\timespopular benchmarks are over\-represented, violating the uniform sampling assumption underlying standard matrix completion guarantees\.

We do not attempt to correct for any of these effects; our error estimates conflate prediction error with measurement noise, and prediction accuracy should be interpreted as an upper bound on what a fully standardized evaluation would achieve\.

##### Filtering\.

The raw matrix at this point \(May, 2026\) contains 188 models and 316 benchmarks but only 4,493 of the 59,408 cells are filled \(7\.6%\), with the long tail dominated by barely\-observed rows and columns\. This raw audit pool is useful for provenance, but it is not yet the analysis matrix: some rows or columns are alternate views of the same underlying signal\. We first canonicalize these cases\. For model setting variants, we keep one representative row rather than making one mode trivially predictable from another\. For benchmark variants, we keep one canonical column per task family; same\-scale versions may fill missing canonical cells as non\-canonical measurements, while different\-scale variants are excluded\. This yields a canonicalized pool with 181 models, 304 benchmarks, and 4,177 observed cells\. The canonicalized pool is still very sparse, so we then filter to a dense subset by requiring every retained model to be observed on at least a minimum number of benchmarks, and every retained benchmark to be observed on at least a minimum number of models\.[Table˜1](https://arxiv.org/html/2606.24020#S3.T1)sweeps a grid of these requirements; we adopt 15 observations per model and 8 observations per benchmark for all analyses in this paper\. The resulting matrix has 84 models×\\times133 benchmarks with 2,604 observed cells \(23\.3% fill\)\.

### 3\.2The Final Score Matrix

Throughout,sm​bs\_\{mb\}denotes the observed score of modelmmon benchmarkbbands^m​b\\hat\{s\}\_\{mb\}its prediction\. After the curation pipeline of[Section˜3\.1](https://arxiv.org/html/2606.24020#S3.SS1), the score matrix contains 2,604 observed entries out of 11,172 cells \(23\.3% fill rate\)\.[Figure˜2](https://arxiv.org/html/2606.24020#S3.F2)shows the sparsity pattern: popular models and benchmarks are well\-covered, but the lower\-right corner is almost entirely empty\.[Figure˜3](https://arxiv.org/html/2606.24020#S3.F3)summarizes what this adopted matrix contains: a broad benchmark mix, with observed cells concentrated in math, coding, agentic/tool\-use, and knowledge\-oriented evaluations; a model set concentrated in recent releases, with coverage varying by release time; and score provenance dominated by model\-provider materials, with the remainder split between benchmark leaderboards and third\-party aggregators\.

##### Benchmarks\.

The 133 benchmarks span all major LLM evaluation axes: agentic tasks and tool use, math, coding, multimodal and vision, long context, instruction following, knowledge and QA, reasoning, hallucination and factuality, science, composite indices, human preference, safety, and other specialized categories\. The full benchmark inventory with metrics, item counts, and source links is provided inLABEL:tab:benchmarks\.

##### Models\.

The 84 models span 13 providers: OpenAI \(20\), Google \(12\), Anthropic \(11\), Alibaba/Qwen \(11\), DeepSeek \(9\), Meta \(6\), Zhipu AI \(4\), Moonshot AI \(3\), xAI \(3\), MiniMax \(2\), Cohere \(1\), ByteDance \(1\), and Mistral \(1\)\. Among models with annotated type, 51 are reasoning models \(chain\-of\-thought\) and 31 are non\-reasoning\. Among models with annotated release status, 35 are open\-weight and 47 are closed\. Where parameter counts are disclosed, they range from 1B \(e\.g\., Gemma 3 1B\(google2025gemma3\)\) to 1\.6T \(e\.g\., DeepSeek\-V4\-Pro\(deepseek2026v4pro\)\)\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_bench_mix.pdf)\(a\)Benchmark mix by category\.
![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_obs_concentrate.pdf)\(b\)Where observed scores concentrate\.
![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_releases.pdf)\(c\)Model releases over time\.
![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_coverage.pdf)\(d\)Coverage by model release time\.
![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_source_provenance.pdf)\(e\)Source provenance of observed cells\.

Figure 3:Composition and coverage of the adopted score matrix\(84 models×\\times133 benchmarks, 2,604 observed cells, 23\.3% fill\)\. The matrix spans a broad mix of benchmark categories \(a\), with most observed cells in math, coding, agentic/tool\-use, and knowledge\-oriented evaluations \(b\)\. Models are concentrated in recent releases \(c\), and coverage varies by release time because newer models are often reported on different benchmark suites than older baselines \(d\)\. Roughly four in five scores come from the model provider’s own materials, with the remainder split between benchmark leaderboards and third\-party aggregators \(e\)\.Full benchmark and model details are provided in[Appendix˜B](https://arxiv.org/html/2606.24020#A2)\(LABEL:tab:benchmarksand[10](https://arxiv.org/html/2606.24020#A2.T10)\)\.

##### Living dataset\.

The score matrix is designed to grow as new models and benchmarks appear\. All results in this paper are based on the May 2026 snapshot, which after filtering contains 84 models and 133 benchmarks \([Figure˜2](https://arxiv.org/html/2606.24020#S3.F2)\); the unfiltered audit pool covers 188 models across 316 benchmarks\. The data format, evaluation harness code, and prediction methods are all open\-source, allowing others to extend the matrix and reproduce all experiments\. Community contributions via pull request are welcome\.

### 3\.3Rank\-2 Geometry

If model capabilities lie in a low\-dimensional space, the score matrix should be predictable from a low\-rank completion\. The operational question is therefore not only whether observed scores can be compressed, but which rank best predicts held\-out benchmark scores\. We establish rank 2 through two lines of evidence:*\(i\)*rank\-2 matrix completion minimizes held\-out prediction error in raw\-score and logit\-score spaces \([Figure˜4](https://arxiv.org/html/2606.24020#S3.F4)\), and*\(ii\)**Singular Value Decomposition*\(SVD\) of fully\-observed submatrices shows matching rank\-2 geometry\.

*Evidence 1: Rank\-2 completion minimizes held\-out prediction error\.*We use Soft\-Impute\(mazumder2010\), a standard matrix\-completion method that alternates between filling missing entries and taking a low\-rank SVD approximation\. We sweep its rank in raw\-score and logit\-transformed score spaces; the latter linearizes percentage scores before standardization\. We evaluate on held\-out entries using Median Absolute Percentage Error \(MedAPE\), the median of absolute percentage errors\|predicted−true\|/\|true\|×100%\|\\mathrm\{predicted\}\-\\mathrm\{true\}\|/\|\\mathrm\{true\}\|\\times 100\\%\.[Figure˜4](https://arxiv.org/html/2606.24020#S3.F4)shows the same pattern in both score spaces: held\-out error is minimized at rank 2 and rises for higher ranks\.

![[Uncaptioned image]](https://arxiv.org/html/2606.24020v1/bp_rank_ucurve_raw_logit.pdf)

Figure 4:Held\-out MedAPE vs\. rank for raw\- and logit\-space Soft\-Impute matrix completion\. Both curves bottom out at rank 2\.
Table 2:Complete submatrix SVD analysis across different trade\-off points between benchmark coverage and model count\. “Var \(topkk\)” denotes the cumulative variance explained by the firstkksingular components\.

*Evidence 2: Complete submatrices show matching rank\-2 geometry\.*Recall that the best rank\-rrapproximation retains the toprrsingular values, and the*stable rank*‖M‖F2/σ12\\\|M\\\|\_\{F\}^\{2\}/\\sigma\_\{1\}^\{2\}measures effective dimensionality: values near 1 mean that one component dominates\. We mean\-center each benchmark column before computing SVD, so that the leading component reflects directions of model variation rather than the shared average score level \(equivalent to PCA on the submatrix\)\. We then take the largest fully\-observed model subsets available at several benchmark\-coverage levels and compute the SVD of each submatrix\.[Table˜2](https://arxiv.org/html/2606.24020#S3.T2)shows the same pattern across these different shapes: the spectra are dominated by a single direction, and two components explain more than 90% of the variance in every submatrix\.

Taken together, the held\-out completion sweep gives the operational reason to use rank 2, and the fully\-observed SVDs show the matching geometry behind that choice:

Finding 1:The84×13384\\times 133model–benchmark score matrix behaves as an effectively rank\-2 prediction problem\.

## 4BenchPress: A Low\-rank Benchmark Score Predictor

Because the held\-out rank sweep in[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3)selects rank 2, we now ask: can we turn this structure into a predictor for missing benchmark scores?[Section˜4\.1](https://arxiv.org/html/2606.24020#S4.SS1)introduces the candidate prediction methods;[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)moves from these candidates toBenchPressby comparing all transform–method combinations on a common experiment setting and selecting the default score predictor; and[Section˜4\.3](https://arxiv.org/html/2606.24020#S4.SS3)comparesBenchPressagainst LLMs as benchmark score predictors\.

### 4\.1Candidate Methods

Suppose a new model arrives with scores onkkbenchmarks\. How do we predict its scores on the remaining benchmarks? We decompose the problem into two design choices: \(i\) a*feature transform*that reshapes raw scores into a space amenable to linear methods, and \(ii\) a*prediction method*that exploits correlations across benchmarks or models\.

##### Feature transforms\.

Lets∈\[0,100\]s\\in\[0,100\]denote a benchmark score\. Scores are bounded percentages, and models cluster near the ceiling on easy benchmarks and near the floor on hard ones\. We evaluate seven transforms that address this nonlinearity in different ways:

- •*Identity\.*Use scoresssas\-is, with no transformation\.
- •*Log\.*Applylog⁡\(s\+1\)\\log\(s\+1\), compressing high scores and stretching low ones\.
- •*Logit\.*Applylog⁡\(s/\(100−s\)\)\\log\\\!\\bigl\(s/\(100\-s\)\\bigr\), mapping the bounded score range to an unbounded scale that symmetrically spreads apart scores near both 0 and 100\.
- •*Arcsinh\.*Applyarcsinh⁡\(s/50\)\\operatorname\{arcsinh\}\(s/50\), a smooth approximation to log that is defined at zero\.
- •*Square root\.*Applys\\sqrt\{s\}, a mild compression that reduces the influence of high scores\.
- •*Probit\.*ApplyΦ−1​\(s/100\)\\Phi^\{\-1\}\(s/100\), whereΦ\\Phiis the standard normal CDF\. Similar to logit but with heavier tails\.
- •*Quantile\.*Replace each score with its within\-benchmark rank divided byn\+1n\+1, producing uniform marginals\. Non\-parametric but discards magnitude information\.

Transforms that assume a\[0,100\]\[0,100\]range \(logit, probit, arcsinh, square root\) are applied only to percentage\-scale benchmarks; the few non\-percentage benchmarks \(Codeforces rating\(codeforces\), Chatbot Arena Elo\(chiang2024\), GDPval \(Artificial Analysis ELO\)\(gdpval\)\) do not suffer from ceiling or floor effects and are left untransformed\. For logit and probit, scores are clipped slightly away from the endpoints before transformation to avoid infinite values\. After applying the chosen transform, we standardize each benchmark column to zero mean and unit variance, so every prediction method operates entirely in the transformed, standardized space\. After prediction, we invert the pipeline in reverse order: first undo the standardization \(restoring each column’s stored mean and standard deviation\), then apply the inverse feature transform \(e\.g\., sigmoid for logit\) to map predictions back to the original score scale\. For percentage\-scale benchmarks we clip the final predictions to\[0,100\]\[0,100\]; non\-percentage benchmarks are left unconstrained\.

##### Prediction methods\.

We compare the following methods, each evaluated across multiple feature transforms:

- •*Benchmark mean\.*Predict each missing score as the column average\. No tunable parameters\.
- •*Model mean\.*Adjust the benchmark mean by each model’s overall strength percentile\. No tunable parameters\.
- •*Benchmark\-KNN*\(*Bench\-KNN*\)\. For each missing entry, find thekkbenchmarks most correlated111Throughout the paper, “correlation” refers to the Pearson correlation: for two columnsa,ba,bof lengthnn,ρ​\(a,b\)=∑i\(ai−a¯\)​\(bi−b¯\)∑i\(ai−a¯\)2⋅∑i\(bi−b¯\)2∈\[−1,1\]\\rho\(a,b\)=\\frac\{\\sum\_\{i\}\(a\_\{i\}\-\\bar\{a\}\)\(b\_\{i\}\-\\bar\{b\}\)\}\{\\sqrt\{\\sum\_\{i\}\(a\_\{i\}\-\\bar\{a\}\)^\{2\}\\cdot\\sum\_\{i\}\(b\_\{i\}\-\\bar\{b\}\)^\{2\}\}\}\\in\[\-1,1\], wherea¯,b¯\\bar\{a\},\\bar\{b\}are the column means\.with the target benchmark and predict from the model’s observed scores on those neighbors, using correlation\-based weights\. Hyperparameter:kk\(number of neighbors\)\.
- •*Model\-KNN\.*Find thekkmodels closest to the target model by root\-mean\-square distance over shared observed benchmarks, then average their scores on the target benchmark\. Hyperparameter:kk\.
- •*Per\-benchmark regression*\(*BenchReg*\)\. For each target benchmark, BenchReg selects thekkmost correlated predictor benchmarks, fits one univariate regression per predictor benchmark, and combines the available predictions withR2R^\{2\}weights222We use the coefficient of determinationR2=1−SSE/SSTR^\{2\}=1\-\\mathrm\{SSE\}/\\mathrm\{SST\}, whereSST=∑i\(yi−y¯\)2\\mathrm\{SST\}=\\sum\_\{i\}\(y\_\{i\}\-\\bar\{y\}\)^\{2\}is the total variance of the target around its meany¯\\bar\{y\}andSSE=∑i\(yi−y^i\)2\\mathrm\{SSE\}=\\sum\_\{i\}\(y\_\{i\}\-\\hat\{y\}\_\{i\}\)^\{2\}is the residual variance left by the fit, withyiy\_\{i\}the values we are trying to predict \(the target benchmark’s observed scores across shared models in the univariate\-regression case\) andy^i\\hat\{y\}\_\{i\}the corresponding predicted values\.R2=1R^\{2\}=1means a perfect fit \(SSE=0\\mathrm\{SSE\}=0\),R2=0R^\{2\}=0means the fit does no better than predicting the constant meany¯\\bar\{y\}, andR2<0R^\{2\}<0means it does worse than the constant mean\.\. Targets and predictor pairs with fewer than five observations are skipped\. When a model lacks observations on some predictors, BenchReg uses only the observed predictors; if none are observed, the cell is left unpredicted \(coverage<100%<100\\%\)\. We use an ensemble of univariate regressions rather than a single multivariate model because the number of shared observations per benchmark pair is often very small \(5–12 models\), making joint estimation ofkkcoefficients prone to overfitting\. Hyperparameters:k∈\{3,5,7\}k\\in\\\{3,5,7\\\},Rmin2∈\{0\.1,0\.2,0\.3\}R^\{2\}\_\{\\min\}\\in\\\{0\.1,0\.2,0\.3\\\}\.
- •*Per\-model regression*\(*ModelReg*\)\. ModelReg is the row\-wise counterpart to BenchReg\. For each target model, it selects thekkmost correlated predictor models over shared benchmarks, fits univariate regressions from each predictor model’s benchmark scores to the target model’s scores, and combines the resulting predictions withR2R^\{2\}weights\. Like BenchReg, it skips targets and predictor pairs with fewer than five observations and can leave a cell unpredicted when no usable predictor has enough shared observations\. Hyperparameters:k∈\{3,5,7\}k\\in\\\{3,5,7\\\},Rmin2∈\{0\.1,0\.2,0\.3\}R^\{2\}\_\{\\min\}\\in\\\{0\.1,0\.2,0\.3\\\}\.
- •*Soft\-Impute*\. Soft\-Impute\(mazumder2010\)iterates between SVD truncation at a chosen rank and re\-imputation of missing entries until convergence\. We fix the rank to 2 following the held\-out rank sweep in[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3)\(no tunable hyperparameters\)\.
- •*NMF\.*Non\-negative matrix factorization\(lee1999nmf\), constraining both factors to be non\-negative\. Hyperparameter: rankrr\.
- •*PMF\.*Probabilistic matrix factorization\(salakhutdinov2008pmf\)with Gaussian priors on both factors\. Hyperparameter: rankrr\.
- •*Nuclear norm minimization\.*Convex relaxation of rank minimization\(candes2009\): minimize the nuclear norm of the completed matrix plus a squared\-error fit on observed entries, withλ\\lambdatrading off low rank against data fidelity\. Hyperparameter:λ\\lambda\.
- •*Bias\-decomposed alternating least squares \(ALS\)*\(koren2009\)\. Bias ALS is the full\-coverage method that we later adopt\. After the feature transform and column standardization, letX∈ℝM×BX\\in\\mathbb\{R\}^\{M\\times B\}be the transformed score matrix overMMmodels andBBbenchmarks, letxm​bx\_\{mb\}be the observed entry for modelmmand benchmarkbb, and letΩ\\Omegabe the observed cells\. Letx¯\\bar\{x\},x¯m⁣⋅\\bar\{x\}\_\{m\\cdot\}, andx¯⋅b\\bar\{x\}\_\{\\cdot b\}be the observed global, model, and benchmark means in this transformed space\. For rankRRand regularizationλ\\lambda, Bias ALS fitsU∈ℝM×RU\\in\\mathbb\{R\}^\{M\\times R\}andV∈ℝB×RV\\in\\mathbb\{R\}^\{B\\times R\}by \(U,V\)=arg⁡minU,V\\displaystyle\(U,V\)=\\arg\\min\_\{U,V\}∑\(m,b\)∈Ω\[xm​b−\(x¯\+\(x¯m⁣⋅−x¯\)\+\(x¯⋅b−x¯\)\)−\(U​V⊤\)m​b\]2\\displaystyle\\sum\_\{\(m,b\)\\in\\Omega\}\\Bigl\[x\_\{mb\}\-\\bigl\(\\bar\{x\}\+\(\\bar\{x\}\_\{m\\cdot\}\-\\bar\{x\}\)\+\(\\bar\{x\}\_\{\\cdot b\}\-\\bar\{x\}\)\\bigr\)\-\(UV^\{\\top\}\)\_\{mb\}\\Bigr\]^\{2\}\(2\)\+λ​\(‖U‖F2\+‖V‖F2\)\.\\displaystyle\+\\lambda\\left\(\\\|U\\\|\_\{F\}^\{2\}\+\\\|V\\\|\_\{F\}^\{2\}\\right\)\.Its prediction is the sum of a global level, a model offset, a benchmark offset, and a rank\-RRresidual correction: x^m​b=x⏟¯global level\+\(x¯m⁣⋅−x¯\)⏟model​m​offset\+\(x¯⋅b−x¯\)⏟benchmark​b​offset\+\(U​V⊤\)m​b⏟rank\-​R​residual correction\.\\hat\{x\}\_\{mb\}=\\underbrace\{\\bar\{x\}\}\_\{\\text\{global level\}\}\+\\underbrace\{\(\\bar\{x\}\_\{m\\cdot\}\-\\bar\{x\}\)\}\_\{\\text\{model \}m\\text\{ offset\}\}\+\\underbrace\{\(\\bar\{x\}\_\{\\cdot b\}\-\\bar\{x\}\)\}\_\{\\text\{benchmark \}b\\text\{ offset\}\}\+\\underbrace\{\(UV^\{\\top\}\)\_\{mb\}\}\_\{\\text\{rank\-\}R\\text\{ residual correction\}\}\.\(3\)The biases absorb row and column offsets, so the low\-rank term only has to model residual model–benchmark interaction structure\. ALS updates each block in closed form via ridge regression on observed entries; we ensemble\-average over multiple random initializations to reduce sensitivity to local minima\. We fix the rank to 2 following the held\-out rank sweep in[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3); the only tunable hyperparameter is the regularizationλ\\lambda\.
- •*Neural baseline*\(*MLP*\)\. A 2\-layer MLP \(hidden dimension 32\) with binary mask for missing entries, trained for 500 epochs\. Hyperparameter: learning rate\.

Full definitions of all prediction methods, including equations and fallback rules, are in[Section˜C\.1](https://arxiv.org/html/2606.24020#A3.SS1)\.

### 4\.2From Candidate Methods toBenchPress

We evaluate all combinations of the seven feature transforms and twelve prediction methods, selecting hyperparameters independently for each combination\. For each model, we randomly hide half of its known benchmark scores, train on the remaining half \(plus all other models’ data\), predict the hidden scores, and measure error\. We use 3 folds per seed and 10 seeds \(∼20,000\{\\sim\}20\{,\}000test predictions per pair\)\. Each \(transform, method\) pair follows a four\-stage pipeline: \(i\) apply the feature transform, \(ii\) standardize each column, \(iii\) run the prediction method, \(iv\) invert both transforms to recover original\-scale predictions\. Predictions on percentage\-scale benchmarks are clipped to\[0,100\]\[0,100\]; omitting standardization degrades most methods, especially NMF and PMF\.

##### Metrics\.

Using the notation from[Section˜3\.2](https://arxiv.org/html/2606.24020#S3.SS2), we measure prediction quality with two score\-error metrics:*\(i\)**median absolute percentage error*\(𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}↓\\downarrow\), computed within each held\-out fold as the median of\|s^−s\|/\|s\|×100%\{\|\\hat\{s\}\-s\|\}/\{\|s\|\}\\times 100\\%and then summarized by the median over folds, and*\(ii\)**median absolute error*\(𝖬𝖾𝖽𝖠𝖤\\mathsf\{MedAE\}↓\\downarrow\), computed analogously from\|s^−s\|\|\\hat\{s\}\-s\|in raw score points\. We use median\-based score\-error metrics because the error distribution is heavy\-tailed: across many models and benchmarks, near\-zero denominators and hard outliers can make averages volatile and unrepresentative of the typical prediction quality\. For benchmark\- and model\-level analyses below, paper\-facing curves, bars, and headline deltas aggregate error records with medians rather than raw\-record averages\. We also report*coverage*, the fraction of held\-out entries for which a method produces a finite prediction, because some methods \(e\.g\., regression\) cannot predict when insufficient correlated data exists\.

##### Hyperparameter selection\.

For each \(transform, method\) pair we grid\-search over the hyperparameters listed below and select the configuration with the lowest pooled MedAPE:

- •*Benchmark Mean, Model Mean\.*No tunable parameters\.
- •*Bench\-KNN, Model\-KNN\.*Number of neighborsk∈\{3,5,7,10\}k\\in\\\{3,5,7,10\\\}\.
- •*BenchReg, ModelReg\.*Number of predictorsk∈\{3,5,7\}k\\in\\\{3,5,7\\\}, minimum correlationRmin2∈\{0\.1,0\.2,0\.3\}R^\{2\}\_\{\\min\}\\in\\\{0\.1,0\.2,0\.3\\\}\(9 configurations each\)\.
- •*Soft\-Impute\.*No tunable hyperparameters; rank fixed at 2\.
- •*Bias\-decomposed ALS\.*Regularizationλ∈\{0\.01,0\.1,1\.0\}\\lambda\\in\\\{0\.01,0\.1,1\.0\\\}; rank fixed at 2\.
- •*NMF\.*Rankrrin\{1,2,3,5\}\\\{1,2,3,5\\\}\.
- •*PMF\.*Rankrrin\{1,2,3,5\}\\\{1,2,3,5\\\}\.
- •*Nuclear Norm\.*Regularizationλ∈\{0\.1,0\.5,1\.0,5\.0\}\\lambda\\in\\\{0\.1,0\.5,1\.0,5\.0\\\}\.
- •*MLP\.*Learning rate∈\{10−4,10−3,10−2\}\\in\\\{10^\{\-4\},10^\{\-3\},10^\{\-2\}\\\}; architecture fixed at 2 layers with hidden dimension 32 and 500 training epochs\.

##### Results\.

[Table˜3](https://arxiv.org/html/2606.24020#S4.T3)ranks the best\-performing configurations by the two score\-error metrics\. Several patterns emerge: \(i\) BenchReg and ModelReg dominate the top score\-error entries, but their coverage is not always complete: some cells are left blank rather than predicted\. \(ii\) Among methods that predict every missing cell, Logit Bias ALS is the strongest and remains very close to the best regression entries\. \(iii\) The leading configurations are not separated by a large qualitative gap: related logit/probit variants and regularization choices give similar performance\. We therefore use Logit Bias ALS withλ=0\.1\\lambda=0\.1and rank 2 asBenchPress’s default score predictor because it sits near the top of the leaderboard, has full coverage, and keeps the downstream error and reliability analyses tied to a single simple configuration\. The full7×127\\times 12transform–method grid is in[Section˜C\.2](https://arxiv.org/html/2606.24020#A3.SS2)\.

Finding 2:Logit\-transformed, bias\-decomposed alternating least squares \(ALS\) matrix completion\(koren2009\), with rank 2 and regularization 0\.1, gives near\-best score\-prediction accuracy while predicting every missing model–benchmark score\.

Table 3:Top\-15 transform–method configurations ranked independently by score\-error metric\. \# is per\-metric rank; values are shown as metric value followed by coverage in parentheses, e\.g\. 7\.8 \(100%\)\. All results use standardization and report the median over 10 seeds×\\times3 folds\. Pink highlights mark the Logit Bias ALS configuration adopted asBenchPress’s main score predictor\.
##### TheBenchPresspredictor\.

Because Logit Bias ALS is the strongest full\-coverage configuration in the comparison above, we adopt this configuration asBenchPress’s default score predictor for the rest of the paper\.

BenchPress: Point Prediction RecipeGiven a partially observed model–benchmark score matrix,BenchPresspredicts every missing cell as follows:1\.Transform percentage scores with the logit transform; leave non\-percentage scores on their native scale\.2\.Standardize each benchmark column using the observed training entries\.3\.Fit bias\-decomposed alternating least squares \(ALS\) matrix completion\(koren2009\)with a global level, model offsets, benchmark offsets, and a rank\-2 residual interaction \(λ=0\.1\\lambda=0\.1\)\.4\.Invert the standardization and feature transform to return predictions on the original benchmark scale\.

### 4\.3BenchPressvs\. LLMs as Benchmark Score Predictors

BenchPresspredicts from the observed score matrix alone\. Another natural question is whether a frontier LLM can predict a benchmark score directly from the target model, the target benchmark, and a few nearest\-peer examples\. This is a per\-cell LLM predictor: each target score is queried separately, and the prompt may expose public model and benchmark identities\.

##### Experiment setting\.

We use the same held\-out cells as the method\-comparison experiment in[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\. For each target cell \(model, benchmark\), we select peer examples using only the training matrix for that fold\. A candidate peer model must have an observed score on the target benchmark and share at least five visible benchmarks with the target model\. Among eligible peers, we choose the five models with the highest Pearson correlation to the target model over shared visible scores\. The prompt then asks GPT\-5\.5 to predict the target model’s score on the target benchmark from these five peer examples\. We consider two scenarios\. In the*informed*condition, the prompt keeps the real model and benchmark names\. In the*blind*condition, model and benchmark identifiers are anonymized within the prompt, while the scores and peer\-example structure are preserved\. The informed condition tests whether a frontier LLM can exploit public model and benchmark semantics; the blind condition tests whether the numerical peer structure alone is enough\. We score both conditions on the same held\-out cells asBenchPressusing MedAPE and MedAE\. The exact prompt template is given in[Section˜C\.3](https://arxiv.org/html/2606.24020#A3.SS3)\.

##### Results\.

Table 4:LLM score prediction from peer examples\.The informed prompt sees real model and benchmark names; the blind prompt does not\. Lower is better\.[Table˜4](https://arxiv.org/html/2606.24020#S4.T4)shows that the informed prompt is a strong baseline and confirms that a frontier LLM can often predict held\-out benchmark scores from nearest\-peer examples\. But this is not the same capability as score prediction from the matrix alone: names give the LLM access to public model reputations, benchmark semantics, and possibly memorized leaderboard facts\. The blind condition removes that channel and is therefore the diagnostic comparison\. There, the LLM is close toBenchPressbut not a cheaper or more reliable replacement: it still requires paid generation over many target cells, whileBenchPressfits the score matrix once and predicts every missing cell deterministically\.

Finding 3:A frontier LLM can predict benchmark scores when real names are visible, but that advantage depends on model and benchmark memory\. In the blind setting,BenchPressis more accurate and more scalable because it fits the score matrix once rather than querying an LLM over target cells\.

## 5WhatBenchPressEnables for Model Evaluation

With the default score predictor fixed, we now evaluate whatBenchPressenables across realistic model\-evaluation tasks\. Three questions guide this section\.First, under a fixed evaluation budget, which probe benchmarks should a practitioner run so thatBenchPressbest recovers the model’s scorecard \([Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)\)?Second, doBenchPress’s predicted scores preserve same\-benchmark model rankings well enough that practitioners can use them to compare models \([Section˜5\.2](https://arxiv.org/html/2606.24020#S5.SS2)\)?Third, when a brand\-new model is released after the matrix was assembled, canBenchPressstill produce useful predictions from a small seed evaluation \([Section˜5\.3](https://arxiv.org/html/2606.24020#S5.SS3)\)?

Unless otherwise stated, all analyses in this section use the defaultBenchPressscore predictor from[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2), and final metrics are reported after mapping predictions back to the original raw\-score scale\.

### 5\.1Budgeted Scorecard Recovery

The rank\-2 structure identified in[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3)suggests that benchmark scores contain substantial shared information, rather than133133independent measurements\. Direct evidence supports this: most benchmarks can be predicted from the rest with low error \([Section˜E\.1\.1](https://arxiv.org/html/2606.24020#A5.SS1.SSS1.Px2)\), and almost every benchmark column has at least one strongly correlated peer \([Section˜D\.1](https://arxiv.org/html/2606.24020#A4.SS1)\)\. Many scores can therefore plausibly be inferred from a small amount of carefully chosen evidence\. We therefore ask the operational version of the same question: if a practitioner can run only a few benchmarks on a new model, which scores should be measured, and which scores canBenchPressinfer from them?

##### Experiment setting\.

We simulate a practitioner who evaluates each target model on a fixed probe set and then asksBenchPressto complete the rest of that model’s public scorecard\. For a target model, only the probe columns remain visible in its row; all other models keep their observed rows\. We evaluate every observed model\-benchmark cell at every budget, so the denominator is fixed across probe\-set sizes\. Observed probe cells are counted as exact predictions with zero error, and all remaining observed cells for the target model are predicted from the masked matrix\. Thus the curves measure how much of the current score matrix can be reconstructed from a small number of selected probes\. They should not be read as held\-out transfer estimates for a fixed universal probe set, because the probe identities are themselves selected on this current score matrix\.

Starting from an empty probe set, we build a ten\-benchmark set greedily\. At each step, we try every remaining candidate benchmark, temporarily add it to the current probe set, evaluate pooled error on the fixed universe, and keep the candidate with the lowest error\. We compare two greedy probe\-set methods against a random baseline:

- •Cost\-unaware greedy:any benchmark can be selected as the next probe\.
- •Cost\-aware greedy:candidates are restricted by the low\-cost allowlist\.
- •Random baseline:we run 10 seeds\. Each seed draws one global random benchmark ordering, and each budget uses the corresponding prefix for every target model\. The plotted line is the mean across seeds; the shaded band is the 25th–75th percentile range\.

Table 5:Top\-10 probe sets selected by each objective\.Each row uses the same set of observed model–benchmark cells and the same greedy procedure; only the objective and candidate pool change\.![[Uncaptioned image]](https://arxiv.org/html/2606.24020v1/bp_probe_evaluation_cost_aware.pdf)

Figure 5:MedAPE during probe\-set construction\.Pooled MedAPE decreases as selected benchmark scores are revealed; every budget is evaluated on the same observed cells\.

![[Uncaptioned image]](https://arxiv.org/html/2606.24020v1/bp_ranking_preservation_overall.pdf)

Figure 6:Overall ranking preservation\.Pairwise ranking accuracy as the probe budget grows\.

![[Uncaptioned image]](https://arxiv.org/html/2606.24020v1/bp_temporal_deployment_boxplot.pdf)

Figure 7:Predicting newly released models under a pre\-specified temporal window\.Each dot is one target model; boxes show the median and interquartile range across 27 targets\.

##### Results\.

The greedy procedure yields four ten\-probe sets \([Table˜5](https://arxiv.org/html/2606.24020#S5.T5)\), one for each combination of*objective*\(MedAPE or MedAE\) and*candidate pool*\(any benchmark, or the low\-cost allowlist\)\. All four sets lean heavily on reasoning\- and math\-oriented benchmarks \(GPQA Diamond, ARC\-AGI, MATH\-500, multiple AIME and HMMT contests, and similar\): reasoning is a dominant axis of variation in the score matrix, so these benchmarks supply the cleanest signal forBenchPressto triangulate the rest of a model’s profile\.[Figure˜5](https://arxiv.org/html/2606.24020#S5.F5)\(MedAPE\) and[Figure˜1\(b\)](https://arxiv.org/html/2606.24020#S0.F1.sf2)\(MedAE\) trace the corresponding error curves as the probe budget grows from one to ten\. Two qualitative trends emerge\. First, cost\-aware greedy tracks the cost\-unaware curve closely at every budget; restricting probes to the low\-cost allowlist costs surprisingly little, because the cleanest reasoning\-axis probes are already low\-cost\. Second, both greedy curves clearly separate from the random baseline at every budget:*which*benchmarks are chosen matters far more than how many\. A more exhaustive probe\-selection analysis in[Section˜D\.1](https://arxiv.org/html/2606.24020#A4.SS1)shows that greedy probe selection is cost\-effective: its performance matches or differs only marginally from more exhaustive search\.

Finding 4:With only five probe benchmarks,BenchPresspredicts the remaining score profile to a pooled MedAE of3\.933\.93score points; restricting probes to the low\-cost allowlist still reaches4\.554\.55score points\.

### 5\.2Preserving Model Rankings

The practical value of score prediction is not only numerical accuracy, but whether the predictions support the same evaluation decisions\. Here the operational question is simple: when two models differ meaningfully on the same benchmark, doesBenchPresspreserve which model is better?

##### Experiment setting\.

We reuse the holdout setting of[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\(10 seeds, three folds per model\) and complete each benchmark leaderboard with true scores on seen cells andBenchPresspredictions on held\-out cells\. Because adjacent leaderboard slots are often separated by tiny gaps that a small prediction error can flip, we evaluate margin\-aware pairwise ordering rather than exact ranks; a shortlist\-recovery view is reported in[Section˜D\.2](https://arxiv.org/html/2606.24020#A4.SS2)\. For each benchmark, we form all same\-benchmark model pairs with at least one held\-out cell, keep those whose true score gap is at least the row’s margin, and report the median across benchmarks of

pairwise accuracy=\#​\{comparable pairs whose completed order matches the true order\}\#​\{comparable pairs\}\.\\text\{pairwise accuracy\}=\\frac\{\\\#\\\{\\text\{comparable pairs whose completed order matches the true order\}\\\}\}\{\\\#\\\{\\text\{comparable pairs\}\\\}\}\.\(4\)
The margin\-0 row includes every non\-tied pair and is most sensitive to near\-ties; larger margins focus on clearer model differences\.

##### Results\.

Table 6:Pairwise ranking preservation\.The margin\-aware results in[Table˜6](https://arxiv.org/html/2606.24020#S5.T6)show that score\-prediction errors rarely overturn meaningful ordering decisions\. The margin\-0 row is lower because it includes many near\-tied model pairs where either ordering is fragile\. At a two\-point score margin,BenchPressachieves88\.0%88\.0\\%pairwise ranking accuracy across531,498531\{,\}498comparable pairs\. When the true score gap is at least five points, pairwise ranking accuracy rises to92\.1%92\.1\\%\. We additionally plot pairwise ranking accuracy as a probe budget grows from one to ten benchmarks \([Figure˜6](https://arxiv.org/html/2606.24020#S5.F6)\); both informativeness\-greedy and cost\-aware probe sets stay well above the random baseline, and the underlying greedy probe\-set selection is detailed in[Section˜D\.2](https://arxiv.org/html/2606.24020#A4.SS2)\.

Finding 5:For same\-benchmark model pairs separated by at least five score points,BenchPress’s predicted scores yield the correct ordering92\.1%92\.1\\%of the time\.

### 5\.3Predicting Newly Released Models

The evaluations so far hide cells from the same matrix used to fitBenchPress, so every model in the test set has already contributed training signal elsewhere in the matrix\. Real deployment is stricter: when a new model is released, the matrix was assembled before that release and contains no information about the new model\. We therefore ask whetherBenchPresscan still produce useful scores for a brand\-new model from the historical matrix plus a small seed evaluation on that model\.

##### Experiment setting\.

We evaluate an intermediate segment of the release timeline, chosen using only release metadata and matrix coverage before inspecting prediction errors\. This choice avoids two uninformative extremes: very early targets leave too few older models in the training matrix, while the latest releases would be predicted from almost the full snapshot rather than a meaningfully historical matrix\. Concretely, we use models from the post\-DeepSeek\-R1 reasoning era through GPT\-5\.1, and keep only models with more than 20 observed benchmark scores\. The coverage threshold ensures that, after revealing up to ten seed scores, each target still has enough hidden cells for a meaningful per\-model error estimate\. This yields 27 target models across the recent reasoning\-era release window\. For each target model, we trainBenchPresson only the models released before the target’s release date, so the predictor sees no information about the new release beyond what we explicitly reveal\. We then revealk∈\{1,5,10\}k\\in\\\{1,5,10\\\}of that target model’s observed benchmark scores and predict the rest, repeating each setting over1010random seeds and reporting the median\. Revealed cells contribute zero error to the pooled metric, hidden cells with finite predictions enter MedAPE and MedAE, and hidden cells without a finite deployment prediction are dropped\.

##### Results\.

Two patterns stand out across the 27 targets in[Figure˜7](https://arxiv.org/html/2606.24020#S5.F7)\. First, even with strict time cutoffs, revealing a small seed set sharply reduces prediction error: the median target drops from9\.209\.20MedAE atk=1k\{=\}1to4\.834\.83atk=5k\{=\}5and2\.572\.57atk=10k\{=\}10\. The corresponding MedAPE drops from15\.57%15\.57\\%to8\.40%8\.40\\%and then5\.02%5\.02\\%\. Second, the distribution narrows as more seed scores are revealed, showing that the gain is not driven by only a few easy releases\. A small seed evaluation on the new release contributes more than additional historical models\.

Finding 6:Across 27 target models in a pre\-specified temporal window, five seed scores bringBenchPress’s predictions within4\.834\.83points of the true value, and ten seeds tighten this to2\.572\.57points\.

## 6When to TrustBenchPress’s Predictions

The practical question is not only whetherBenchPresscan fill in missing scores, but when those filled\-in scores are safe to use\. This section explains when to trust the defaultBenchPressscore predictor selected in[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\. We first identify benchmark\- and model\-side factors associated with prediction quality, then use those signals together with predictor disagreement to estimate prediction reliability\.

### 6\.1What Affects Prediction Reliability

The natural next question is where prediction error comes from\. Some sources of error may be benchmark\-side: sparse observations, weak benchmark neighbors, or score distributions that are intrinsically hard to interpolate\. Others may be model\-side: sparse model rows, weak peers, provider effects, scale, or recency\.

##### Methodology\.

In what follows, we propose a set of hypotheses about whyBenchPressmispredicts certain cells: seven targeting benchmark\-side factors and nine targeting model\-side factors\. Each hypothesis is phrased as a no\-effect claim: a candidate factor is not associated withBenchPress’s prediction quality, measured by MedAPE and MedAE\.

We assess each one with a standard statistical hypothesis test that returns app\-value: assuming the no\-effect hypothesis were true, this is the probability that random chance alone would produce data deviating from “no effect” by at least as much as ours\. A smallppmeans such data would be extremely unlikely under the hypothesis, so the observation is inconsistent with the hypothesis and we reject it; a largeppmeans our data is well within what random chance could produce, so we have no grounds to reject\.

Throughout this section we treatp<0\.01p<0\.01as our rejection threshold: whenp<0\.01p<0\.01we reject the hypothesis, i\.e\., the data provide strong evidence that the factor*does*matter; otherwise we fail to reject \(which may either mean the factor truly has no effect, or that we lack the sample size to detect one\)\. Depending on the shape of the hypothesis, we use one of two tests:

- •*Spearman rank correlation test\.*We use this for observational hypotheses, where each benchmark or model contributes a measured feature and itsBenchPressprediction error\. Spearman tests whether higher feature values are monotonically associated with higher or lower error, while being less sensitive to outliers than raw\-value correlation\.
- •*Paired Wilcoxon signed\-rank test\.*We use this for intervention\-style hypotheses, where the same benchmark or model is evaluated under a baseline setting and an ablated setting\. The paired design controls for inherent target difficulty, and the rank\-based test is more reliable than a pairedtt\-test for our heavy\-tailed error shifts\.

[Section˜E\.1](https://arxiv.org/html/2606.24020#A5.SS1)gives the full test definitions, approximations, andpp\-value calculations\.

##### Benchmark analysis\.

What determines whether a benchmark is easy or hard forBenchPressto predict? We test seven hypotheses split into two families\. H1–H3 probe benchmark\-intrinsic features \(low\-rank fit, score level, score spread\), each evaluated by univariate Spearman correlation between the feature andBenchPress’s per\-benchmark error across all 133 targets\. H4–H7 probe data availability and structural overlap with other benchmarks, each evaluated by a paired hide\-half ablation: we intervene on the training matrix and compare prediction quality against an unintervened baseline \(paired Wilcoxon over benchmarks\)\.

- •*H1 Low\-rank fit\.**Hypothesis:*a benchmark’s columnR2R^\{2\}under the rank\-2 SVD reconstruction is not associated with how well it can be predicted\.*Feature:*columnR2R^\{2\}under the rank\-2 reconstruction of the standardized, zero\-imputed score matrix\.*Test:*Spearman\.
- •*H2 Score level\.**Hypothesis:*the overall score level \(i\.e\., difficulty\) of a benchmark is not associated with how well it can be predicted\.*Feature:*median observed score per benchmark\.*Test:*Spearman\.
- •*H3 Score spread\.**Hypothesis:*the spread of scores across models on a benchmark is not associated with how well it can be predicted\.*Feature:*standard deviation of observed scores per benchmark\.*Test:*Spearman\.
- •*H4 Target coverage\.**Hypothesis:*reducing the amount of training evidence for a target benchmark does not change its prediction error\.*Intervention:*for each target benchmark we first split its observed cells in half: one half is held out for evaluation and the other half remains available for training; we then compare the full training half against a version where three quarters of those training cells are removed\.*Test:*paired Wilcoxon\.
- •*H5 Strong\-neighbor presence\.**Hypothesis:*masking strongly correlated neighbor benchmarks does not change a target benchmark’s prediction error\.*Intervention:*for each target benchmark, mask every neighbor benchmark whose Pearson correlation with the target is at least 0\.85 on shared models, then rerunBenchPresson the target’s held\-out cells\.*Test:*paired Wilcoxon\.
- •*H6 Strong\-neighbor support\.**Hypothesis:*reducing overlapping evidence from the strongest neighbor does not change a target benchmark’s prediction error\.*Intervention:*for each target benchmark, identify its strongest neighbor, keep only the models scored by both benchmarks, and compare the full shared\-evidence condition against a version where three quarters of those overlapping neighbor cells are removed\.*Test:*paired Wilcoxon\.
- •*H7 Same\-category evidence\.**Hypothesis:*masking same\-category benchmarks does not change a target benchmark’s prediction error\.*Intervention:*mask all same\-category benchmarks during training \(43 benchmarks×\\times10 seeds\)\.*Test:*paired Wilcoxon\.

Table 7:Which features predict per\-benchmark prediction quality?H1–H3 use the Spearman rank correlation test across target benchmarks; H4–H7 use paired Wilcoxon signed\-rank tests on hide\-half ablations\. Thepp\-value is the probability of seeing an effect at least this large by chance if the listed \(no\-effect\) hypothesis were true; smallerppmeans stronger evidence against it\. We reject a hypothesis whenp<0\.01p<0\.01\(bold\), i\.e\., the data provide strong evidence that the factor does matter\. Pink rows are rejected under both MedAPE and MedAE and are visualized in[Figure˜8](https://arxiv.org/html/2606.24020#S6.F8)\.HypothesisMedAPE↓\\downarrow\(pp\-value\)MedAE↓\\downarrow\(pp\-value\)H1 Low\-rank fitp=0\.150p=0\.150p=0\.036p=0\.036H2 Score levelp<0\.001p<0\.001p=0\.054p=0\.054H3 Score spreadp<0\.001p<0\.001p<0\.001p<0\.001H4 Target coveragep<0\.001p<0\.001p<0\.001p<0\.001H5 Strong\-neighbor presencep<0\.001p<0\.001p<0\.001p<0\.001H6 Strong\-neighbor supportp=0\.209p=0\.209p=0\.035p=0\.035H7 Same\-category evidencep=0\.832p=0\.832p=0\.725p=0\.725![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_predictability_factors_51.pdf)Figure 8:Benchmark\-level prediction\-error patterns\.Three benchmark\-side factors jointly affect how hard a benchmark is to predict: a wider score spread across models makes prediction harder \(H3\), more observed model scores on the target benchmark makes it easier \(H4\), and having at least one strongly correlated neighbor benchmark in the training matrix makes it easier \(H5\)\.[Table˜7](https://arxiv.org/html/2606.24020#S6.T7)reports the test results: three benchmark\-side hypotheses are rejected under both error metrics \(H3 score spread, H4 target coverage, and H5 strong\-neighbor presence\), and[Figure˜8](https://arxiv.org/html/2606.24020#S6.F8)visualizes the corresponding effects\.

Three hypotheses are rejected jointly, i\.e\., these factors do affect prediction quality\. Rejecting H3 \(score spread\) means benchmarks with wider score ranges across models are harder to predict\. Rejecting H4 \(target coverage\) and H5 \(strong\-neighbor presence\) means a benchmark is easier to predict when it has many observed model scores, and when at least one strongly correlated neighbor remains in the training matrix\. The remaining hypotheses \(H1 low\-rank fit, H2 score level, H6 strong\-neighbor support, H7 same\-category evidence\) are not rejected under both metrics; in particular, failing to reject H7 indicates thatBenchPressbenefits from observed correlations among benchmarks, not from category metadata\. The full7×27\\times 2hypothesis×\\timesmetric grid is reported in[Section˜E\.1\.1](https://arxiv.org/html/2606.24020#A5.SS1.SSS1.Px1)\.

##### Model analysis\.

Symmetrically, what determines whether a model is easy or hard forBenchPressto predict? Per\-model prediction error varies by∼\\sim40×\\timesacross the 84 models in our matrix, so this matters in practice: harder\-to\-predict models warrant less trust in their point estimates\. For each model we run the same hide\-half evaluation \(10 random splits with the full Logit Bias ALS pipeline\) and aggregate the held\-out predictions into per\-model𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}and𝖬𝖾𝖽𝖠𝖤\\mathsf\{MedAE\}\.

We test nine hypotheses split into three families\. H1–H4 probe model\-intrinsic features \(size, type, score level, low\-rank fit\), each evaluated by univariate Spearman correlation against per\-model error \(H1 uses then=25n=25models with disclosed parameter counts; H2–H4 use all 84\)\. H5–H8 probe data availability and overlap with other models, each evaluated by a paired hide\-half ablation that modifies the training matrix and measures the change in error\. H9 tests temporal generalization via a rolling simulation: train only on older models and predict newer ones\.

- •*H1 Model size\.**Hypothesis:*model size is not associated with how well a model can be predicted\.*Feature:*parameter count for the 25 models with disclosed sizes\.*Test:*Spearman\.
- •*H2 Model type\.**Hypothesis:*whether a model is a reasoning model is not associated with how well it can be predicted\.*Feature:*binary reasoning vs\. non\-reasoning indicator among models with annotated type\.*Test:*Spearman\.
- •*H3 Score level\.**Hypothesis:*the overall score level \(i\.e\., capability\) of a model is not associated with how well it can be predicted\.*Feature:*per\-model median observed score\.*Test:*Spearman\.
- •*H4 Low\-rank fit\.**Hypothesis:*a model’s rowR2R^\{2\}under the rank\-2 SVD reconstruction is not associated with how well it can be predicted\.*Feature:*row\-levelR2R^\{2\}under the rank\-2 reconstruction of the standardized, zero\-imputed score matrix\.*Test:*Spearman\.
- •*H5 Strong\-peer presence\.**Hypothesis:*masking strongly correlated peer models does not change a target model’s prediction error\.*Intervention:*for each target model, mask all peer models whose Pearson correlation with the target is at least 0\.95 on shared benchmarks, then rerunBenchPresson the target’s hide\-half cells\.*Test:*paired Wilcoxon\.
- •*H6 Strong\-peer support\.**Hypothesis:*reducing overlapping evidence from the strongest peer does not change a target model’s prediction error\.*Intervention:*for each target model and hide\-half split, identify the strongest peer model \(highest\|r\|\|r\|, requiring\|r\|≥0\.95\|r\|\\geq 0\.95\), restrict to benchmarks observed by both the target and that peer, and drop nested prefixesf∈\{0,0\.25,0\.5,0\.75\}f\\in\\\{0,0\.25,0\.5,0\.75\\\}of those overlapping peer cells before rerunningBenchPresson the target’s held\-out cells; we comparef=0f\{=\}0againstf=0\.75f\{=\}0\.75\.*Test:*paired Wilcoxon\.
- •*H7 Same\-provider evidence\.**Hypothesis:*masking same\-provider variants does not change a target model’s prediction error\.*Intervention:*mask all same\-provider rows \(e\.g\. all GPT variants when predicting a GPT model\) and rerunBenchPresson the target’s hide\-half cells\.*Test:*paired Wilcoxon\.
- •*H8 Observation count\.**Hypothesis:*reducing the amount of training evidence for a target model does not change its prediction error\.*Intervention:*compare the standard hide\-half split against a more severe split that hides three quarters of each model’s observed scores \([Figure˜9](https://arxiv.org/html/2606.24020#S6.F9)shows the full trajectory across the evaluated hide fractions\)\.*Test:*paired Wilcoxon\.
- •*H9 Training\-anchor recency\.**Hypothesis:*the recency of the training matrix is not associated with how well newly released models can be predicted\.*Intervention:*sort all 84 models by release date and split into oldest, middle, and newest thirds; train theBenchPressscore predictor using only the oldest third or only the middle third, reveal a small number of benchmark scores for each newest\-third target model, and predict the rest\. The table reports the condition with three revealed benchmarks\.*Test:*paired Wilcoxon\.

Table 8:What makes a model easy or hard to predict?H1–H4 use the Spearman rank correlation test; H5–H8 use paired Wilcoxon signed\-rank tests on hide\-half ablations; H9 compares older vs\. more recent training data\. Thepp\-value is the probability of seeing an effect at least this large by chance if the listed \(no\-effect\) hypothesis were true; smallerppmeans stronger evidence against it\. We reject a hypothesis whenp<0\.01p<0\.01\(bold\), i\.e\., the data provide strong evidence that the factor does matter\. Pink rows are rejected under both MedAPE and MedAE and are visualized in[Figure˜9](https://arxiv.org/html/2606.24020#S6.F9)\.HypothesisMedAPE↓\\downarrow\(pp\-value\)MedAE↓\\downarrow\(pp\-value\)H1 Model size \(log10\\log\_\{10\}params\)p=0\.101p=0\.101p=0\.263p=0\.263H2 Model typep<0\.001p<0\.001p=0\.003p=0\.003H3 Score levelp<0\.001p<0\.001p<0\.001p<0\.001H4 Low\-rank fitp=0\.389p=0\.389p=0\.042p=0\.042H5 Strong\-peer presencep<0\.001p<0\.001p=0\.004p=0\.004H6 Strong\-peer supportp=0\.033p=0\.033p=0\.309p=0\.309H7 Same\-provider evidencep=0\.011p=0\.011p=0\.094p=0\.094H8 Observation countp<0\.001p<0\.001p<0\.001p<0\.001H9 Training\-anchor recencyp=0\.002p=0\.002p=0\.002p=0\.002![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_error_hypotheses_52.pdf)Figure 9:Representative model\-level prediction\-error patterns\.Five model\-side factors jointly affect how hard a model is to predict: reasoning models are easier than non\-reasoning ones \(H2\), higher\-scoring models are easier than lower\-scoring ones \(H3\), having at least one strongly correlated peer model in the training matrix makes prediction easier \(H5\), more observed benchmark scores on the target model makes prediction easier \(H8\), and a training matrix containing models recently released relative to the target makes prediction easier \(H9\)\.[Table˜8](https://arxiv.org/html/2606.24020#S6.T8)reports the full results, and[Figure˜9](https://arxiv.org/html/2606.24020#S6.F9)visualizes the model\-side hypotheses that are rejected under both error metrics: H2, H3, H5, H8, and H9\.

Five hypotheses are rejected jointly, i\.e\., these factors do affect prediction quality\. Reasoning models are easier to predict than non\-reasoning ones \(H2\), and higher\-scoring models are easier than lower\-scoring ones \(H3\)\. Among the ablations, models with at least one strongly correlated peer in the matrix are easier to predict \(H5\), models with more observed benchmark scores are easier \(H8\), and the temporal experiment shows that prediction quality on newer models improves when the training matrix contains more recent anchors rather than only the oldest third \(H9\)\. The remaining hypotheses \(H1 model size, H4 low\-rank fit, H6 strong\-peer support, H7 same\-provider evidence\) are not rejected under both metrics; in particular, failing to reject H7 indicates thatBenchPressuses capability\-profile similarity rather than provider identity\. The full9×29\\times 2hypothesis×\\timesmetric grid is reported in[Section˜E\.1\.2](https://arxiv.org/html/2606.24020#A5.SS1.SSS2.Px1), and a per\-model predictability ranking is given in[Section˜E\.1\.2](https://arxiv.org/html/2606.24020#A5.SS1.SSS2.Px2)\.

Finding 7:A predicted score is most trustworthy when both sides of the matrix are well supported: the target benchmark has many observations and correlated neighbors, the target model has many observed scores and correlated peers, and the training matrix contains recent models near the target\.

### 6\.2Estimating Prediction Reliability

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_confidence_calibration.pdf)Figure 10:Reliability estimates identify safer predictions\.Lower curves mean safer subsets are identified more reliably; the hybrid estimator gives the cleanest ordering\.Point predictions alone are not enough for deciding when to skip benchmark runs\. If a predicted score will be used to decide whether to skip an expensive evaluation, the useful question is not only “what score would this model get?”, but also “how much should we trust this prediction?” We therefore train reliability estimators for predictions from the defaultBenchPressscore predictor in[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\. Each estimator assigns a predicted cell a risk score, where larger risk means the score prediction is less reliable\. We compare the reliability estimators by asking which one best identifies safe\-to\-use predictions before the benchmark is run\. The same risk score can also be calibrated into a trust probability and a conformal prediction interval; those details and interval\-width results are reported in[Section˜E\.2](https://arxiv.org/html/2606.24020#A5.SS2)\. Throughout this subsection we report MedAE only, since the risk\-coverage curve is measured on the absolute score\-point scale\.

##### Reliability estimators\.

We compare three lightweight ways to compute this risk score, all sharing the same setup\. Each one is a small model trained on held\-out cells from the training folds: given features about a cell, predict how far off the Logit Bias ALS point estimate will be on that cell\. At test time it sees only what is available*before*running the benchmark, and outputs a larger risk for cells where the point prediction is likely to be less accurate\. The three methods differ only in which features they use\.

- •*Ensemble\-spread reliability estimator*uses only disagreement among score predictors\. For the same hidden cell, we collect the point predictions made by the Logit Bias ALS regularization settings around the selected one and by the strongest full\-coverage methods from[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\. The features are simple summaries of how much those predictions spread out: their standard deviation, median absolute deviation, central 80% span, and distance between the selected Logit Bias ALS prediction and the median prediction\. If many plausible predictors disagree, this estimator should assign higher risk\.
- •*Matrix\-support reliability estimator*ignores predictor disagreement and instead reuses the model\- and benchmark\-side signals from[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1), in the same order as the hypothesis tables\. From the benchmark side: median score \(H2\), score spread \(H3\), observation count \(H4\), and strongest\-neighbor correlation \(H5\)\. From the model side: median score \(H3\), strongest\-peer correlation \(H5\), strongest\-peer overlap \(H6\), and observation count \(H8\)\. Median score on either axis did not reach joint significance, but we keep it as a low\-cost control\. This estimator asks whether the cell is easy to infer from observed scores on correlated peer models and benchmarks\.
- •*Hybrid reliability estimator*uses both feature groups in one model\. It can learn cases where predictor disagreement is enough to flag risk, cases where sparse structural support is enough to flag risk, and cases where both signals reinforce each other\.

All three reliability estimators use the same fold\-internal model selection over a linear ridge baseline and small ReLU MLPs\. For each evaluated fold, we train candidate risk models on the other folds only, standardize features using that training split, select the risk\-model architecture inside the training folds, and then predict risks for the held\-out fold\. This keeps the reliability experiment honest: the risk score for a hidden cell is learned from other cells, not from its own error\. Full feature lists and training details for all three estimators are in[Section˜E\.2](https://arxiv.org/html/2606.24020#A5.SS2)\.

Once the reliability estimator outputs a risk score, the main\-text evaluation treats it as a ranking signal: predictions with lower risk should have lower realized error\.

##### Evaluation setting\.

All reliability estimators are evaluated on the same held\-out folds as the point\-prediction comparison\. For every hidden score, the reliability estimator may use the training matrix, the fixed Logit Bias ALS point prediction, and auxiliary quantities derived from the training fold, but not the held\-out value\. We evaluate reliability ranking with a risk\-coverage curve: sort cells from lowest to highest risk and plot MedAE after keeping the most trusted 100%, 80%, 60%, 40%, or 20% of cells\. This asks whether the risk score can identify predictions that are safe enough to use for triage, while flagging predictions that should still trigger benchmark runs\.

##### Results\.

When keeping only the most\-trusted 20% of cells, the hybrid estimator lowers selective MedAE to 1\.83 score points, beating both single\-feature variants \([Figure˜10](https://arxiv.org/html/2606.24020#S6.F10)\); at 40% and 60% kept, its MedAE is 2\.51 and 3\.10\. We therefore use thehybrid reliability estimatoras the reliability layer: it takes both ensemble spread and matrix support into account, and assigns each prediction a risk score that identifies whether the prediction is safe to use\.

Finding 8:A hybrid reliability estimator uses ensemble spread and matrix support to identify low\-risk score predictions before running the benchmark\.

## 7Discussion

This paper shows that the public benchmark landscape has enough shared structure to support score prediction at scale\. Starting from an84×13384\\times 133public score matrix, we find that its dominant variation is effectively rank\-2 and buildBenchPress, a matrix\-completion predictor for missing model–benchmark scores\. With this predictor fixed, we show that a small probe set can recover much of a model’s scorecard, that predicted scores preserve most meaningful same\-benchmark rankings, and that a few seed evaluations can anchor predictions for models in a pre\-specified temporal window\. Finally, the reliability analysis identifies when those point predictions are well supported by the observed matrix and when the benchmark should still be run\.

##### Limitations and future work\.

We close with four pairings of a current limitation and the most natural extension it suggests\.*First*,BenchPresscannot reliably predict a candidate model whose capability profile lacks a close neighbor in the matrix; incorporating external signals about the model \(training\-data composition, architecture, model size, and other published metadata\) and computing model\-to\-model similarity from these features could anchor outliers even before any benchmark scores are observed\.*Second*, benchmark\-level predictions are only as good as the benchmarks themselves: a noisy or poorly constructed benchmark is faithfully predicted as such; pushing score prediction beyond aggregated benchmark scores to instance\-level outcomes would letBenchPresscapture within\-benchmark structure and improve predictions on the hardest tails\.*Third*, our matrix already covers mainstream text and vision\-language benchmarks, but more specialized ecosystems \(audio and speech, robotics and embodied agents, scientific simulators\) remain untested; whether the same low\-rank treatment carries over to these settings is an open question\.*Fourth*, the rank\-2 geometry is a property of the current snapshot rather than a guarantee for future releases; as the matrix grows, tracking whether the rank stays at two, or whether a third latent factor emerges, will determine the long\-term viability of this approach and signal when a refresh of the score\-prediction recipe is warranted\.

## References

Appendix

The appendix is organized as supplements to the main text\.[Appendix˜A](https://arxiv.org/html/2606.24020#A1)supplements[Section˜1](https://arxiv.org/html/2606.24020#S1)with the experiment setting for[Figure˜1](https://arxiv.org/html/2606.24020#S0.F1)\.[Appendix˜B](https://arxiv.org/html/2606.24020#A2)supplements[Section˜3](https://arxiv.org/html/2606.24020#S3)with data\-collection provenance, full benchmark and model catalogs, and additional evidence for the low\-rank structure\.[Appendix˜C](https://arxiv.org/html/2606.24020#A3)supplements[Section˜4](https://arxiv.org/html/2606.24020#S4)with a comprehensive method comparison and the additional LLM baseline prompt template\.[Appendix˜D](https://arxiv.org/html/2606.24020#A4)supplements[Section˜5](https://arxiv.org/html/2606.24020#S5)with budgeted scorecard recovery details and ranking preservation\.[Appendix˜E](https://arxiv.org/html/2606.24020#A5)supplements[Section˜6](https://arxiv.org/html/2606.24020#S6)with prediction\-error and prediction\-reliability details\.

Contents

[A](https://arxiv.org/html/2606.24020#A1)Supplemental to[Section˜1](https://arxiv.org/html/2606.24020#S1):[Introduction](https://arxiv.org/html/2606.24020#A1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2606.24020#A1)

[A\.1 Experiment Setting forFigure˜1](https://arxiv.org/html/2606.24020#A1.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.1](https://arxiv.org/html/2606.24020#A1.SS1)

[B](https://arxiv.org/html/2606.24020#A2)Supplemental to[Section˜3](https://arxiv.org/html/2606.24020#S3):[The Score Matrix and Its Geometry](https://arxiv.org/html/2606.24020#A2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2606.24020#A2)

[B\.1 Data Collection](https://arxiv.org/html/2606.24020#A2.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.1](https://arxiv.org/html/2606.24020#A2.SS1)

[B\.2 The Final Score Matrix](https://arxiv.org/html/2606.24020#A2.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.2](https://arxiv.org/html/2606.24020#A2.SS2)

[C](https://arxiv.org/html/2606.24020#A3)Supplemental to[Section˜4](https://arxiv.org/html/2606.24020#S4):[BenchPress: A Low\-rank Benchmark Score Predictor](https://arxiv.org/html/2606.24020#A3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2606.24020#A3)

[C\.1 Candidate Methods](https://arxiv.org/html/2606.24020#A3.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.1](https://arxiv.org/html/2606.24020#A3.SS1)

[C\.2 From Candidate Methods toBenchPress](https://arxiv.org/html/2606.24020#A3.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2](https://arxiv.org/html/2606.24020#A3.SS2)

[C\.3BenchPressvs\. LLMs as Benchmark Score Predictors](https://arxiv.org/html/2606.24020#A3.SS3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3](https://arxiv.org/html/2606.24020#A3.SS3)

[D](https://arxiv.org/html/2606.24020#A4)Supplemental to[Section˜5](https://arxiv.org/html/2606.24020#S5):[WhatBenchPressEnables for Model Evaluation](https://arxiv.org/html/2606.24020#A4)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2606.24020#A4)

[D\.1 Budgeted Scorecard Recovery](https://arxiv.org/html/2606.24020#A4.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.1](https://arxiv.org/html/2606.24020#A4.SS1)

[D\.2 Preserving Model Rankings](https://arxiv.org/html/2606.24020#A4.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.2](https://arxiv.org/html/2606.24020#A4.SS2)

[D\.3 Predicting Newly Released Models](https://arxiv.org/html/2606.24020#A4.SS3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.3](https://arxiv.org/html/2606.24020#A4.SS3)

[E](https://arxiv.org/html/2606.24020#A5)Supplemental to[Section˜6](https://arxiv.org/html/2606.24020#S6):[When to TrustBenchPress’s Predictions](https://arxiv.org/html/2606.24020#A5)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2606.24020#A5)

[E\.1 What Affects Prediction Reliability](https://arxiv.org/html/2606.24020#A5.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.1](https://arxiv.org/html/2606.24020#A5.SS1)

[E\.2 Estimating Prediction Reliability](https://arxiv.org/html/2606.24020#A5.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E\.2](https://arxiv.org/html/2606.24020#A5.SS2)

## Appendix ASupplemental to[Section˜1](https://arxiv.org/html/2606.24020#S1): Introduction

### A\.1Experiment Setting for[Figure˜1](https://arxiv.org/html/2606.24020#S0.F1)

##### Left panel \(per\-cell error\)\.

We pick four highlighted cells: \(Claude Opus 4\.7, SWE\-bench Verified\), \(GPT\-5\.5, Terminal\-Bench\), \(Gemini 3\.1 Pro, LiveCodeBench\), and \(DeepSeek\-V4\-Pro, HLE Text\)\. For each cell and eachk∈\{1,…,10\}k\\in\\\{1,\\dots,10\\\}, we \(i\) hide the target cell, \(ii\) samplekkscores uniformly at random from the same model’s other observed cells, \(iii\) feed the resulting masked matrix toBenchPress, and \(iv\) record the absolute error on the held\-out target cell\. We repeat over1010seeds \(base seed4242\); the line is the per\-cell median and the shaded band is the 25–75 percentile range\. Whenever the target cell itself appears in the revealed prefix the error drops to zero\. The diamond atk=0k\{=\}0marks the benchmark\-median baseline \(noBenchPress\)\.

##### Right panel \(overall pooled error\)\.

The right panel reuses the global probe\-set setting of[Sections˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)and[D\.1](https://arxiv.org/html/2606.24020#A4.SS1): a fixed probe set ofkkbenchmarks is chosen, every model is evaluated on whichever probe scores it has observed, andBenchPresspredicts the rest of each model’s observed cells\. Pooled MedAE is reported across all evaluated cells\. The greedy curves use the cost\-aware and cost\-unaware MedAE orderings from[Table˜5](https://arxiv.org/html/2606.24020#S5.T5); the gray random baseline draws one global random benchmark ordering per seed \(1010seeds, base seed4242\) and uses the corresponding prefix for every target model\. The shaded band is the 25–75 percentile range across seeds\.

## Appendix BSupplemental to[Section˜3](https://arxiv.org/html/2606.24020#S3): The Score Matrix and Its Geometry

### B\.1Data Collection

[Section˜3\.1](https://arxiv.org/html/2606.24020#S3.SS1)describes how we crawl public model releases, technical reports, model cards, and primary leaderboards, then canonicalize and filter the resulting raw matrix\. Here we make explicit what information is preserved in the released data, because later analyses reuse these fields without re\-crawling the original sources\. The released data keeps three linked record types: one record per model, one record per benchmark, and one record per observed model–benchmark score\.

Example model record\{ "id": "gpt\-5\.2", "name": "GPT\-5\.2", "provider": "OpenAI", "release\_date": "2025\-12\-11", "is\_reasoning": true, "open\_weights": false, "canonical\_setting": \{ "mode": "thinking", "effort": "xhigh", "tools": "none", "sampling": "pass@1", "judge": "rule\-based", "harness": "official", "prompt\_style": "default" \} \}

Example benchmark record\{ "id": "aime\_2025", "name": "AIME 2025", "category": "Math", "metric": "% correct \(pass@1\)", "num\_problems": 30, "source\_url": "https://artofproblemsolving\.com/wiki/index\.php/2025\_AIME", "canonical\_setting": \{ "version": "AIME\-2025\-I\+II", "metric\_type": "pct", "range": \[0, 100\], "higher\_is\_better": true, "multimodal\_input": false, "tools": "none" \} \}

Example observed cell record\{ "model\_id": "gpt\-5\.2", "benchmark\_id": "aime\_2025", "score": 100\.0, "reference\_url": "https://openai\.com/index/introducing\-gpt\-5\-2/", "source\_type": "official\_blog", "audit\_status": "verified", "reported\_setting": \{ "mode": "thinking", "effort": "xhigh", "tools": "none", "sampling": "pass@1", "judge": "rule\-based", "harness": "official" \}, "matches\_canonical": true, "candidates": \[\{"score": 99\.0, "source\_type": "model\_card"\}, \.\.\.\] \}

This structure separates entity metadata from score provenance\. Benchmark\-level fields, such as item count and modality, support analyses that reason about the columns of the matrix\. Model\-level fields, such as provider, release date, and reasoning capability, support analyses that reason about the rows\. Cell\-level fields keep the audit trail for the actual number used in the matrix: where it came from, how the model was run, whether that setting matches the canonical setting, and which alternative values were seen but not selected as the primary score\.

### B\.2The Final Score Matrix

[Section˜3\.2](https://arxiv.org/html/2606.24020#S3.SS2)introduced the benchmark score matrix as an84×13384\\times 133table withsm​bs\_\{mb\}the score of modelmmon benchmarkbb, populated from publicly available evaluations and 23\.3% filled\. This appendix provides the complete benchmark and model inventories underlying that matrix\.LABEL:tab:benchmarkslists all 133 benchmarks with their categories, metrics, item counts, and source links\.[Table˜10](https://arxiv.org/html/2606.24020#A2.T10)enumerates all 84 retained models with parameter counts, reasoning capability, open\-weight status, release dates, and source links\. Every score in the matrix is attributed to one of these sources; the full \(model, benchmark\)→\\toURL mapping is released with the accompanying repository\.

Table 9:Benchmark inventory\.All 133 benchmarks in the adopted score matrix\. Categories are grouped to match the main\-text summary\.CategoryBenchmarkMetricItemsLinkAgentic & tool use \(26\)BFCL——[https://cohere.com/research/papers/command-a-technical-report.pdf](https://cohere.com/research/papers/command-a-technical-report.pdf)BFCL v3——[https://gorilla.cs.berkeley.edu/leaderboard.html](https://gorilla.cs.berkeley.edu/leaderboard.html)BrowseComp% correct1,266[https://openai.com/index/browsecomp/](https://openai.com/index/browsecomp/)BrowseComp\-ZH—1,156[https://github.com/PALIN2018/BrowseComp-ZH](https://github.com/PALIN2018/BrowseComp-ZH)ComplexFuncBench%1,000[https://github.com/THUDM/ComplexFuncBench](https://github.com/THUDM/ComplexFuncBench)CyberGym% solved1,507[https://www.cybergym.io/](https://www.cybergym.io/)DeepSearchQA \(Accuracy\)%900[https://huggingface.co/datasets/google/deepsearchqa](https://huggingface.co/datasets/google/deepsearchqa)Finance Agent v1\.1% solved537[https://arxiv.org/abs/2508.00828](https://arxiv.org/abs/2508.00828)FinSearchComp\-Global%317[https://arxiv.org/abs/2509.13160](https://arxiv.org/abs/2509.13160)Frames%824[https://arxiv.org/abs/2409.12941](https://arxiv.org/abs/2409.12941)GAIA \(text only\)%103[https://arxiv.org/abs/2509.06501](https://arxiv.org/abs/2509.06501)MCPAtlas Public% correct \(pass@1\)500[https://huggingface.co/datasets/ScaleAI/MCP-Atlas](https://huggingface.co/datasets/ScaleAI/MCP-Atlas)MCPMark% success \(pass@1\)127[https://github.com/eval-sys/mcpmark](https://github.com/eval-sys/mcpmark)OSWorld% success369[https://os-world.github.io/](https://os-world.github.io/)tau\-bench Airline% success50[https://arxiv.org/abs/2406.12045](https://arxiv.org/abs/2406.12045)Tau\-Bench Retail% success115[https://arxiv.org/abs/2406.12045](https://arxiv.org/abs/2406.12045)Terminal\-Bench 1\.0% solved—[https://terminal-bench.com/](https://terminal-bench.com/)Terminal\-Bench 2\.0% solved—[https://www.tbench.ai/leaderboard/terminal-bench/2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0)Toolathlon% correct \(pass@1\)108[https://toolathlon.github.io/](https://toolathlon.github.io/)Vending\-Bench 2—15,000[https://andonlabs.com/evals/vending-bench-2](https://andonlabs.com/evals/vending-bench-2)WideSearch \(item\-F1\)%200[https://huggingface.co/datasets/ByteDance-Seed/WideSearch](https://huggingface.co/datasets/ByteDance-Seed/WideSearch)xbench\-DeepSearch%100[https://huggingface.co/datasets/xbench/DeepSearch](https://huggingface.co/datasets/xbench/DeepSearch)τ2\\tau^\{2\}\-bench Airline% success50[https://arxiv.org/abs/2506.07982](https://arxiv.org/abs/2506.07982)τ2\\tau^\{2\}\-bench Retail% success115[https://arxiv.org/abs/2506.07982](https://arxiv.org/abs/2506.07982)τ2\\tau^\{2\}\-bench Telecom% success114[https://arxiv.org/abs/2506.07982](https://arxiv.org/abs/2506.07982)τ3\\tau^\{3\}\-Bench%1,500[https://z.ai/blog/glm-5.1](https://z.ai/blog/glm-5.1)Math \(23\)AIME 2024% correct \(pass@1\)30[https://artofproblemsolving.com/wiki/index.php/2024_AIME](https://artofproblemsolving.com/wiki/index.php/2024_AIME)AIME 2025% correct \(pass@1\)30[https://artofproblemsolving.com/wiki/index.php/2025_AIME](https://artofproblemsolving.com/wiki/index.php/2025_AIME)AIME 2026% correct \(pass@1\)30[https://huggingface.co/datasets/MathArena/aime_2026_I](https://huggingface.co/datasets/MathArena/aime_2026_I)Beyond AIME%100[https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME](https://huggingface.co/datasets/ByteDance-Seed/BeyondAIME)BRUMO 2025% correct \(pass@1\)30[https://huggingface.co/datasets/MathArena/brumo_2025](https://huggingface.co/datasets/MathArena/brumo_2025)CMIMC 2025% correct \(pass@1\)40[https://huggingface.co/datasets/MathArena/cmimc_2025](https://huggingface.co/datasets/MathArena/cmimc_2025)CNMO 2024%6[https://www.cms.org.cn/Home/comp/comp_details/id/1253.html](https://www.cms.org.cn/Home/comp/comp_details/id/1253.html)FrontierMath% correct T1\-3300[https://epoch.ai/benchmarks/frontiermath](https://epoch.ai/benchmarks/frontiermath)FrontierMath Tier 4%48[https://epoch.ai/benchmarks/frontiermath](https://epoch.ai/benchmarks/frontiermath)GSM8K% correct1,319[https://arxiv.org/abs/2110.14168](https://arxiv.org/abs/2110.14168)HMMT Feb 2025%30[https://huggingface.co/datasets/MathArena/hmmt_feb_2025](https://huggingface.co/datasets/MathArena/hmmt_feb_2025)HMMT Feb 2026% correct \(pass@1\)33[https://huggingface.co/datasets/MathArena/hmmt_feb_2026](https://huggingface.co/datasets/MathArena/hmmt_feb_2026)HMMT Nov 2025% correct30[https://huggingface.co/datasets/MathArena/hmmt_nov_2025](https://huggingface.co/datasets/MathArena/hmmt_nov_2025)IMO\-AnswerBench—400[https://imobench.github.io/](https://imobench.github.io/)MATH—12,500[https://github.com/hendrycks/math](https://github.com/hendrycks/math)MATH\-500% correct500[https://arxiv.org/abs/2103.03874](https://arxiv.org/abs/2103.03874)MathArena Apex 2025% correct12[https://matharena.ai/apex/](https://matharena.ai/apex/)MathVision% correct3,040[https://huggingface.co/datasets/MathLLMs/MathVision](https://huggingface.co/datasets/MathLLMs/MathVision)MathVista%1,000[https://huggingface.co/datasets/AI4Math/MathVista](https://huggingface.co/datasets/AI4Math/MathVista)MGSMexact match \(%\)2,500[https://github.com/google-research/url-nlp/tree/main/mgsm](https://github.com/google-research/url-nlp/tree/main/mgsm)MT\-AIME2024%1,650[https://huggingface.co/datasets/amphora/MCLM](https://huggingface.co/datasets/amphora/MCLM)SMT 2025% correct \(pass@1\)53[https://huggingface.co/datasets/MathArena/smt_2025](https://huggingface.co/datasets/MathArena/smt_2025)USAMO 2025% of 42 points6[https://huggingface.co/datasets/MathArena/usamo_2025](https://huggingface.co/datasets/MathArena/usamo_2025)Coding \(21\)Aider Polyglot \(diff mode\)%450[https://aider.chat/2024/12/21/polyglot.html](https://aider.chat/2024/12/21/polyglot.html)Aider Polyglot \(whole mode\)%450[https://aider.chat/2024/12/21/polyglot.html](https://aider.chat/2024/12/21/polyglot.html)ArtifactsBench%5,475[https://github.com/Tencent-Hunyuan/ArtifactsBenchmark](https://github.com/Tencent-Hunyuan/ArtifactsBenchmark)BigCodeBenchpass@1 %1,140[https://bigcode-bench.github.io/](https://bigcode-bench.github.io/)Bird\-SQL \(Dev\)——[https://bird-bench.github.io/](https://bird-bench.github.io/)Codeforces RatingElo rating—[https://codeforces.com/](https://codeforces.com/)HumanEvalpass@1 %164[https://github.com/openai/human-eval](https://github.com/openai/human-eval)LiveCodeBenchpass@1 %1,055[https://livecodebench.github.io/](https://livecodebench.github.io/)MBPP\+——[https://cohere.com/research/papers/command-a-technical-report.pdf](https://cohere.com/research/papers/command-a-technical-report.pdf)Multi\-SWE\-bench%1,632[https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench](https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench)MultiPL\-E \(average\)%12,667[https://huggingface.co/datasets/nuprl/MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E)NL2Repo\-Bench%104[https://arxiv.org/abs/2512.12730](https://arxiv.org/abs/2512.12730)OJBench%232[https://arxiv.org/abs/2506.16395](https://arxiv.org/abs/2506.16395)RepoQA—500[https://arxiv.org/abs/2406.06025](https://arxiv.org/abs/2406.06025)SciCode% correct338[https://scicode-bench.github.io/](https://scicode-bench.github.io/)SWE\-bench Multilingual% resolved—[https://www.swebench.com/](https://www.swebench.com/)SWE\-bench Pro% resolved731[https://scale.com/leaderboard/swe_bench_pro_public](https://scale.com/leaderboard/swe_bench_pro_public)SWE\-bench Verified% resolved500[https://www.swebench.com/](https://www.swebench.com/)SWE\-Lancer IC Diamond%198[https://github.com/openai/frontier-evals/tree/main/project/swelancer](https://github.com/openai/frontier-evals/tree/main/project/swelancer)SWE\-Lancer IC SWE Diamond Freelance \($\)dollars198[https://github.com/openai/frontier-evals/tree/main/project/swelancer](https://github.com/openai/frontier-evals/tree/main/project/swelancer)Terminal\-Bench Hard%—[https://z.ai/blog/glm-4.7](https://z.ai/blog/glm-4.7)Multimodal & vision \(12\)BabyVision% accuracy388[https://huggingface.co/datasets/UnipatAI/BabyVision](https://huggingface.co/datasets/UnipatAI/BabyVision)CharXiv Descriptive% accuracy4,000[https://charxiv.github.io/](https://charxiv.github.io/)CharXiv Reasoning% accuracy1,000[https://charxiv.github.io/](https://charxiv.github.io/)ERQA%400[https://github.com/embodiedreasoning/ERQA](https://github.com/embodiedreasoning/ERQA)MMMU% correct900[https://mmmu-benchmark.github.io/](https://mmmu-benchmark.github.io/)MMMU\-Pro% correct3,460[https://huggingface.co/datasets/MMMU/MMMU_Pro](https://huggingface.co/datasets/MMMU/MMMU_Pro)OmniDocBench \(normalized edit distance, lower is better\)edit distance \(lower=better\)1,651[https://huggingface.co/datasets/opendatalab/OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)OmniDocBench 1\.5edit distance \(lower=better\)1,355[https://github.com/opendatalab/OmniDocBench](https://github.com/opendatalab/OmniDocBench)ScreenSpot\-Pro—1,581[https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding)Vibe\-Eval—269[https://github.com/reka-ai/reka-vibe-eval](https://github.com/reka-ai/reka-vibe-eval)Video\-MME—2,700[https://video-mme.github.io/](https://video-mme.github.io/)Video\-MMMU%900[https://videommmu.github.io/](https://videommmu.github.io/)Long context \(9\)AA Long Context Reasoning% correct300[https://artificialanalysis.ai/methodology/intelligence-benchmarking](https://artificialanalysis.ai/methodology/intelligence-benchmarking)BrowseComp Long Context 128k% accuracy1,266[https://openai.com/index/gpt-5-1-for-developers/](https://openai.com/index/gpt-5-1-for-developers/)GraphWalks BFS 0\-128K%300[https://huggingface.co/datasets/openai/graphwalks](https://huggingface.co/datasets/openai/graphwalks)GraphWalks parents 0\-128K%350[https://huggingface.co/datasets/openai/graphwalks](https://huggingface.co/datasets/openai/graphwalks)LongBench\-V2%503[https://huggingface.co/datasets/THUDM/LongBench-v2](https://huggingface.co/datasets/THUDM/LongBench-v2)MRCR v1—2,000[https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf](https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf)MRCR v2% correct2,400[https://huggingface.co/datasets/openai/mrcr](https://huggingface.co/datasets/openai/mrcr)OpenAI MRCR v2 \(2 needle, 128k\)%500[https://huggingface.co/datasets/openai/mrcr](https://huggingface.co/datasets/openai/mrcr)OpenAI MRCR v2 \(8\-needle\)%800[https://huggingface.co/datasets/openai/mrcr](https://huggingface.co/datasets/openai/mrcr)Instruction following \(9\)Arena\-Hard Auto% win rate500[https://lmarena.ai/](https://lmarena.ai/)COLLIE%2,080[https://arxiv.org/abs/2307.08689](https://arxiv.org/abs/2307.08689)IFBench% correct300[https://github.com/allenai/IFBench](https://github.com/allenai/IFBench)IFEval% correct \(prompt strict\)541[https://arxiv.org/abs/2311.07911](https://arxiv.org/abs/2311.07911)InFoBench—2,250[https://github.com/qinyiwei/InfoBench](https://github.com/qinyiwei/InfoBench)Internal API IF Hard%—[https://openai.com/index/introducing-gpt-5-for-developers/](https://openai.com/index/introducing-gpt-5-for-developers/)Multi\-IF%13,503[https://huggingface.co/datasets/facebook/Multi-IF](https://huggingface.co/datasets/facebook/Multi-IF)MultiChallenge%273[https://github.com/ekwinox117/multi-challenge](https://github.com/ekwinox117/multi-challenge)MultiChallenge \(o3\-mini grader\)%273[https://github.com/ekwinox117/multi-challenge](https://github.com/ekwinox117/multi-challenge)Knowledge & QA \(9\)C\-Eval \(Chinese\)%12,342[https://huggingface.co/datasets/ceval/ceval-exam](https://huggingface.co/datasets/ceval/ceval-exam)Chinese\-SimpleQA%3,000[https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA](https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA)GDPval \(Artificial Analysis ELO\)score220[https://huggingface.co/datasets/openai/gdpval](https://huggingface.co/datasets/openai/gdpval)HealthBench%5,000[https://huggingface.co/datasets/openai/healthbench](https://huggingface.co/datasets/openai/healthbench)MMLU\-Pro% correct12,032[https://arxiv.org/abs/2406.01574](https://arxiv.org/abs/2406.01574)MMMLU% correct258,090[https://huggingface.co/datasets/openai/MMMLU](https://huggingface.co/datasets/openai/MMMLU)PopQA—14,267[https://huggingface.co/datasets/akariasai/PopQA](https://huggingface.co/datasets/akariasai/PopQA)SimpleQA% correct4,326[https://openai.com/index/introducing-simpleqa/](https://openai.com/index/introducing-simpleqa/)SimpleQA\-Verified% correct \(pass@1\)—[https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)Reasoning \(8\)ARC\-AGI\-1% correct400[https://arcprize.org/arc-agi/1/](https://arcprize.org/arc-agi/1/)ARC\-AGI\-2% correct400[https://arcprize.org/arc-agi/2/](https://arcprize.org/arc-agi/2/)BigBench Hard \(BBH\)——[https://arxiv.org/abs/2210.09261](https://arxiv.org/abs/2210.09261)DROP%9,536[https://huggingface.co/datasets/EleutherAI/drop](https://huggingface.co/datasets/EleutherAI/drop)Global PIQA—6,283[https://huggingface.co/datasets/mrlbenchmarks/global-piqa-parallel](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-parallel)HLE \(Humanity’s Last Exam\)% correct2,500[https://lastexam.ai/](https://lastexam.ai/)HLE \(w/ tools\)accuracy \(%\)2,500[https://huggingface.co/moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5)HLE Text%2,158[https://labs.scale.com/leaderboard/humanitys_last_exam_text_only](https://labs.scale.com/leaderboard/humanitys_last_exam_text_only)Hallucination & factuality \(5\)FACTS Grounding—1,719[https://arxiv.org/abs/2501.03200](https://arxiv.org/abs/2501.03200)FActScore \(hallucination rate\)%500[https://github.com/shmsw25/FActScore](https://github.com/shmsw25/FActScore)LongFact\-Concepts \(hallucination rate\)%1,140[https://github.com/google-deepmind/long-form-factuality/tree/main/longfact](https://github.com/google-deepmind/long-form-factuality/tree/main/longfact)LongFact\-Objects \(hallucination rate\)%1,140[https://github.com/google-deepmind/long-form-factuality/tree/main/longfact](https://github.com/google-deepmind/long-form-factuality/tree/main/longfact)TruthfulQA—817[https://github.com/sylinrl/TruthfulQA](https://github.com/sylinrl/TruthfulQA)Science \(4\)CritPt% correct70[https://huggingface.co/datasets/CritPt-Benchmark/CritPt](https://huggingface.co/datasets/CritPt-Benchmark/CritPt)GPQA Diamond% correct198[https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)GPQA Main \(full set\)—448[https://arxiv.org/abs/2311.12022](https://arxiv.org/abs/2311.12022)SuperGPQA%26,529[https://huggingface.co/datasets/m-a-p/SuperGPQA](https://huggingface.co/datasets/m-a-p/SuperGPQA)Other \(7\)AA Intelligence Indexindex score12,826[https://artificialanalysis.ai/methodology/intelligence-benchmarking](https://artificialanalysis.ai/methodology/intelligence-benchmarking)AlpacaEval 2\.0 \(LC\-winrate\)%—[https://arxiv.org/abs/2501.12948](https://arxiv.org/abs/2501.12948)Bullshit\-Bench \(Clear Pushback\)% clear pushback55[https://github.com/petergpt/bullshit-benchmark](https://github.com/petergpt/bullshit-benchmark)Chatbot Arena EloElo rating8,000[https://arxiv.org/abs/2403.04132](https://arxiv.org/abs/2403.04132)CLUEWSC%2,574[https://huggingface.co/datasets/clue/clue](https://huggingface.co/datasets/clue/clue)LiveBenchoverall score1,000[https://github.com/LiveBench/LiveBench](https://github.com/LiveBench/LiveBench)Safety \(OLMES suite\)——[https://arxiv.org/abs/2501.00656](https://arxiv.org/abs/2501.00656)Table 10:Model inventory\.All 84 models from 13 providers\.*R*= reasoning \(chain\-of\-thought\)\.*O*= open\-weight\. Parameter counts in billions; “— = undisclosed\. Active parameters shown only for MoE models\.ProviderModelBAct\.RORel\.OpenAIGPT\-3\.5 Turbo \(0125\)——✗✗2024\-01[https://platform.openai.com/docs/models](https://platform.openai.com/docs/models)GPT\-4o \(2024\-05\-13\)——✗✗2024\-05[https://openai.com/index/hello-gpt-4o/](https://openai.com/index/hello-gpt-4o/)GPT\-4o mini——✗✗2024\-07[https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)GPT\-4o \(2024\-11\-20\)——✗✗2024\-11[https://openai.com/index/hello-gpt-4o/](https://openai.com/index/hello-gpt-4o/)OpenAI o1 \(high\)——✓✗2024\-12o3\-mini \(high\)——✓✗2025\-01[https://github.com/openai/simple-evals](https://github.com/openai/simple-evals)GPT\-4\.5——✗✗2025\-02[https://www.helicone.ai/blog/gpt-4.5-benchmarks](https://www.helicone.ai/blog/gpt-4.5-benchmarks)GPT\-4\.1——✗✗2025\-04[https://arxiv.org/abs/2507.20534](https://arxiv.org/abs/2507.20534)GPT\-4\.1 mini——✗✗2025\-04[https://www.helicone.ai/blog/gpt-4.1-full-developer-guide](https://www.helicone.ai/blog/gpt-4.1-full-developer-guide)GPT\-4\.1 nano——✗✗2025\-04[https://www.datacamp.com/blog/gpt-4-1](https://www.datacamp.com/blog/gpt-4-1)o3 \(high\)——✓✗2025\-04[https://openai.com/index/introducing-o3-and-o4-mini/](https://openai.com/index/introducing-o3-and-o4-mini/)o4\-mini \(high\)——✓✗2025\-04[https://www.datacamp.com/blog/o4-mini](https://www.datacamp.com/blog/o4-mini)GPT\-5 mini——✓✗2025\-07[https://openai.com/index/introducing-gpt-5-2/](https://openai.com/index/introducing-gpt-5-2/)GPT\-5 nano——✓✗2025\-07[https://openai.com/index/introducing-gpt-5-2/](https://openai.com/index/introducing-gpt-5-2/)GPT\-5——✓✗2025\-08[https://openai.com/index/introducing-gpt-5/](https://openai.com/index/introducing-gpt-5/)gpt\-oss\-120B116\.85\.1✓✓2025\-08[https://arxiv.org/abs/2508.10925](https://arxiv.org/abs/2508.10925)GPT\-5\.1——✓✗2025\-11[https://www.vellum.ai/blog/gpt-5-2-benchmarks](https://www.vellum.ai/blog/gpt-5-2-benchmarks)GPT\-5\.2——✓✗2025\-12[https://openai.com/index/introducing-gpt-5-2/](https://openai.com/index/introducing-gpt-5-2/)GPT\-5\.4——✓✗2026\-03GPT\-5\.5——✓✗2026\-04GoogleGemini 1\.5 Flash——✗✗2024\-05[https://deepmind.google/technologies/gemini/flash/](https://deepmind.google/technologies/gemini/flash/)Gemini 1\.5 Pro——✗✗2024\-05[https://deepmind.google/technologies/gemini/pro/](https://deepmind.google/technologies/gemini/pro/)Gemma 2 27B2727✗✓2024\-06[https://blog.google/technology/developers/google-gemma-2/](https://blog.google/technology/developers/google-gemma-2/)Gemma 2 9B99✗✓2024\-06[https://blog.google/technology/developers/google-gemma-2/](https://blog.google/technology/developers/google-gemma-2/)Gemma 3 1B——✓✓2025[https://blog.google/technology/developers/gemma-3/](https://blog.google/technology/developers/gemma-3/)Gemini 2\.0 Flash——✗✗2025\-02[https://artificialanalysis.ai/models/gemini-2-0-flash](https://artificialanalysis.ai/models/gemini-2-0-flash)Gemma 3 27B2727✗✓2025\-03[https://llm-stats.com/benchmarks](https://llm-stats.com/benchmarks)Gemini 2\.5 Flash——✓✗2025\-05[https://llm-stats.com/models/gemini-2.5-flash](https://llm-stats.com/models/gemini-2.5-flash)Gemini 2\.5 Pro \(GA\)——✓✗2025\-06[https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/)Gemini 3 Flash——✓✗2025\-11[https://www.vellum.ai/blog/google-gemini-3-benchmarks](https://www.vellum.ai/blog/google-gemini-3-benchmarks)Gemini 3 Pro——✓✗2025\-11[https://www.vellum.ai/blog/google-gemini-3-benchmarks](https://www.vellum.ai/blog/google-gemini-3-benchmarks)Gemini 3\.1 Pro——✓✗2026\-02[https://www.digitalapplied.com/blog/google-gemini-3-1-pro-benchmarks-pricing-guide](https://www.digitalapplied.com/blog/google-gemini-3-1-pro-benchmarks-pricing-guide)AnthropicClaude 3\.5 Sonnet \(1022\)——✗✗2024\-10[https://www.anthropic.com/news/claude-3-5-sonnet](https://www.anthropic.com/news/claude-3-5-sonnet)Claude 3\.7 Sonnet——✓✗2025\-02[https://www.anthropic.com/news/claude-3-7-sonnet](https://www.anthropic.com/news/claude-3-7-sonnet)Claude Opus 4——✓✗2025\-05[https://arxiv.org/abs/2507.20534](https://arxiv.org/abs/2507.20534)Claude Sonnet 4——✗✗2025\-05[https://arxiv.org/abs/2507.20534](https://arxiv.org/abs/2507.20534)Claude Opus 4\.1——✓✗2025\-08[https://www.anthropic.com/news/claude-opus-4-1](https://www.anthropic.com/news/claude-opus-4-1)Claude Sonnet 4\.5——✓✗2025\-09[https://www.anthropic.com/news/claude-sonnet-4-5](https://www.anthropic.com/news/claude-sonnet-4-5)Claude Haiku 4\.5——✓✗2025\-10[https://www.anthropic.com/news/claude-haiku-4-5](https://www.anthropic.com/news/claude-haiku-4-5)Claude Opus 4\.5——✓✗2025\-11[https://www.anthropic.com/news/claude-opus-4-5](https://www.anthropic.com/news/claude-opus-4-5)Claude Opus 4\.6——✓✗2026\-02[https://www.vellum.ai/blog/claude-opus-4-6-benchmarks](https://www.vellum.ai/blog/claude-opus-4-6-benchmarks)Claude Sonnet 4\.6——✓✗2026\-02[https://www.anthropic.com/news/claude-sonnet-4-6](https://www.anthropic.com/news/claude-sonnet-4-6)Claude Opus 4\.7——✓✗2026\-04
ProviderModelBAct\.RORel\.AlibabaQwen2\.5 72B Instruct——✗✓2024\-09[https://qwenlm.github.io/blog/qwen2.5/](https://qwenlm.github.io/blog/qwen2.5/)Qwen2\.5\-14B1414✗✓2024\-09Qwen2\.5\-32B\-Instruct3232✗✓2024\-09Qwen2\.5\-7B\-Instruct77✗✓2024\-09QwQ\-32B32\.832\.8✓✓2025\-03[https://qwenlm.github.io/blog/qwq-32b/](https://qwenlm.github.io/blog/qwq-32b/)Qwen3\-235B\-A22B23522✓✓2025\-05[https://qwenlm.github.io/blog/qwen3/](https://qwenlm.github.io/blog/qwen3/)Qwen3\-30B\-A3B303✓✓2025\-05Qwen3\-32B3232✓✓2025\-05[https://arxiv.org/abs/2505.09388](https://arxiv.org/abs/2505.09388)Qwen3\-8B88✓✓2025\-05Qwen3\.5\-397B\-A17B39717✓✓2026\-02[https://venturebeat.com/technology/alibabas-qwen-3-5-397b-a17/](https://venturebeat.com/technology/alibabas-qwen-3-5-397b-a17/)Qwen3\.6\-Plus——✓✗2026\-03[https://docs.apiyi.com/en/news/qwen-3-6-plus-launch](https://docs.apiyi.com/en/news/qwen-3-6-plus-launch)DeepSeekDeepSeek\-V2\-0506——✗✓2024\-05[https://github.com/deepseek-ai/DeepSeek-V2](https://github.com/deepseek-ai/DeepSeek-V2)DeepSeek\-V2\.5\-0905——✗✓2024\-09[https://github.com/deepseek-ai/DeepSeek-V2.5](https://github.com/deepseek-ai/DeepSeek-V2.5)DeepSeek\-V367137✗✓2025\-01[https://github.com/deepseek-ai/DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3)DeepSeek\-R167137✓✓2025\-01[https://github.com/deepseek-ai/DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)DeepSeek\-R1\-Distill\-Llama\-70B7070✓✓2025\-01DeepSeek\-R1\-052867137✓✓2025\-05[https://huggingface.co/deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)DeepSeek\-V3\.267137✓✓2025\-12[https://arxiv.org/abs/2512.02556](https://arxiv.org/abs/2512.02556)DeepSeek\-V4\-Flash28413✓✓2026\-04DeepSeek\-V4\-Pro160049✓✓2026\-04MetaLLaMA\-3\.1 405B Instruct——✗✓2024\-07Llama 3\.1 8B Instruct88✗✓2024\-07Llama 3\.2 1B——✓✓2024\-09Llama\-3\.3\-70B\-Instruct7070✗✓2024\-12Llama 4 Maverick40217✗✓2025\-04[https://ai.meta.com/blog/llama-4-multimodal-intelligence/](https://ai.meta.com/blog/llama-4-multimodal-intelligence/)Muse Spark——✓✗2026\-04[https://about.fb.com/news/2026/04/introducing-muse-spark-meta-superintelligence-labs/](https://about.fb.com/news/2026/04/introducing-muse-spark-meta-superintelligence-labs/)Zhipu AIGLM\-4\.6——✓✗2025\-09[https://llm-stats.com/models/glm-4.6](https://llm-stats.com/models/glm-4.6)GLM\-4\.7——✓✗2025\-12[https://medium.com/@leucopsis/a-technical-analysis-of-glm-4-7](https://medium.com/@leucopsis/a-technical-analysis-of-glm-4-7)GLM\-5——✓✓2026\-03GLM\-5\.1——✓✓2026\-04[https://docs.apiyi.com/en/news/glm-5-1-launch](https://docs.apiyi.com/en/news/glm-5-1-launch)Moonshot AIKimi K2——✗✓2025\-07[https://arxiv.org/abs/2507.20534](https://arxiv.org/abs/2507.20534)Kimi K2\.5——✓✓2026\-01[https://huggingface.co/moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5)Kimi K2\.6——✓✓2026\-04xAIGrok 3 Beta——✗✗2025\-02[https://x.ai/news/grok-3](https://x.ai/news/grok-3)Grok 4——✓✗2025\-07[https://matharena.ai/](https://matharena.ai/)Grok 4\.1——✓✗2025\-11[https://matharena.ai/](https://matharena.ai/)MiniMaxMiniMax\-M2——✓✗2025\-10[https://artificialanalysis.ai/models/minimax-m2](https://artificialanalysis.ai/models/minimax-m2)MiniMax M2\.1——✓✓2025\-12CohereCommand A111—✗✓2025\-03[https://cohere.com/research/papers/command-a-technical-report.pdf](https://cohere.com/research/papers/command-a-technical-report.pdf)ByteDanceDoubao Seed 2\.0 Pro——✓✗2026\-02[https://www.digitalapplied.com/blog/bytedance-seed-2-doubao-ai-model-benchmarks-guide](https://www.digitalapplied.com/blog/bytedance-seed-2-doubao-ai-model-benchmarks-guide)MistralMinistral 8B Instruct 241088✗✓2024\-10

## Appendix CSupplemental to[Section˜4](https://arxiv.org/html/2606.24020#S4):BenchPress: A Low\-rank Benchmark Score Predictor

### C\.1Candidate Methods

This appendix gives the formal definitions of the matrix\-completion methods compared in[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\. All methods below operate on the same transformed, column\-standardized matrix\.

We use the following notation throughout this subsection:

- •X∈ℝM×BX\\in\\mathbb\{R\}^\{M\\times B\}is the transformed, column\-standardized version of the adopted score matrix, withM=84M=84models andB=133B=133benchmarks\.
- •xm​bx\_\{mb\}is the entry for modelmmon benchmarkbbin this transformed space\.
- •Ω\\Omegais the set of observed cells\.
- •Ωm=\{b:\(m,b\)∈Ω\}\\Omega\_\{m\}=\\\{b:\(m,b\)\\in\\Omega\\\}is the set of benchmarks observed for modelmm\.
- •Ωb=\{m:\(m,b\)∈Ω\}\\Omega^\{b\}=\\\{m:\(m,b\)\\in\\Omega\\\}is the set of models observed for benchmarkbb\.
- •x¯m⁣⋅=\|Ωm\|−1​∑b∈Ωmxm​b\\bar\{x\}\_\{m\\cdot\}=\|\\Omega\_\{m\}\|^\{\-1\}\\sum\_\{b\\in\\Omega\_\{m\}\}x\_\{mb\}is the observed mean for modelmm\.
- •x¯⋅b=\|Ωb\|−1​∑m∈Ωbxm​b\\bar\{x\}\_\{\\cdot b\}=\|\\Omega^\{b\}\|^\{\-1\}\\sum\_\{m\\in\\Omega^\{b\}\}x\_\{mb\}is the observed mean for benchmarkbb\.
- •x¯=\|Ω\|−1​∑\(m,b\)∈Ωxm​b\\bar\{x\}=\|\\Omega\|^\{\-1\}\\sum\_\{\(m,b\)\\in\\Omega\}x\_\{mb\}is the observed global mean\.
- •x^m​b\\hat\{x\}\_\{mb\}is a method’s prediction for cell\(m,b\)\(m,b\)\.
- •RRis the rank used by low\-rank methods in this subsection\.
- •Nk​\(t;ρ\)N\_\{k\}\(t;\\rho\)is a method\-local top\-kkneighbor set\. For benchmark targets,t=b∈\{1,…,B\}t=b\\in\\\{1,\\ldots,B\\\}and candidate neighbors areu∈\{1,…,B\}∖\{b\}u\\in\\\{1,\\ldots,B\\\}\\setminus\\\{b\\\}; for model targets,t=m∈\{1,…,M\}t=m\\in\\\{1,\\ldots,M\\\}and candidate neighbors areu∈\{1,…,M\}∖\{m\}u\\in\\\{1,\\ldots,M\\\}\\setminus\\\{m\\\}\. The scoring functionρ​\(t,u\)∈ℝ\\rho\(t,u\)\\in\\mathbb\{R\}ranks each eligible candidateuufor targettt; larger scores are preferred, so distances are used with a minus sign\.

After a method produces predictions in this space, the pipeline maps them back to the original score scale by undoing the standardization and feature transform\.

##### Benchmark mean\.

The benchmark\-mean baseline predicts each missing cell from the observed mean of the target benchmark:

x^m​b=x¯⋅b\.\\hat\{x\}\_\{mb\}=\\bar\{x\}\_\{\\cdot b\}\.\(5\)

##### Model mean\.

The model\-mean baseline predicts each missing cell from the observed mean of the target model:

x^m​b=x¯m⁣⋅\.\\hat\{x\}\_\{mb\}=\\bar\{x\}\_\{m\\cdot\}\.\(6\)

##### Bench\-KNN\.

LetΩb​b′=Ωb∩Ωb′\\Omega^\{bb^\{\\prime\}\}=\\Omega^\{b\}\\cap\\Omega^\{b^\{\\prime\}\}, and letcorr⁡\(b,b′\)∈\[−1,1\]\\operatorname\{corr\}\(b,b^\{\\prime\}\)\\in\[\-1,1\]be the Pearson correlation between benchmark columnsbbandb′b^\{\\prime\}overΩb​b′\\Omega^\{bb^\{\\prime\}\}when this correlation is defined\. HereNk​\(b;corr\)N\_\{k\}\(b;\\operatorname\{corr\}\)selects benchmark neighborsb′≠bb^\{\\prime\}\\neq busing scoreρ​\(b,b′\)=corr⁡\(b,b′\)\\rho\(b,b^\{\\prime\}\)=\\operatorname\{corr\}\(b,b^\{\\prime\}\)\. For a missing cell\(m,b\)\(m,b\), Bench\-KNN predicts by the correlation\-weighted average overNk​\(b;corr\)∩ΩmN\_\{k\}\(b;\\operatorname\{corr\}\)\\cap\\Omega\_\{m\}:

x^m​b=∑b′∈Nk​\(b;corr\)∩Ωmmax⁡\(corr⁡\(b,b′\),0\.01\)​xm​b′∑b′∈Nk​\(b;corr\)∩Ωmmax⁡\(corr⁡\(b,b′\),0\.01\)\.\\hat\{x\}\_\{mb\}=\\frac\{\\sum\_\{b^\{\\prime\}\\in N\_\{k\}\(b;\\operatorname\{corr\}\)\\cap\\Omega\_\{m\}\}\\max\(\\operatorname\{corr\}\(b,b^\{\\prime\}\),0\.01\)x\_\{mb^\{\\prime\}\}\}\{\\sum\_\{b^\{\\prime\}\\in N\_\{k\}\(b;\\operatorname\{corr\}\)\\cap\\Omega\_\{m\}\}\\max\(\\operatorname\{corr\}\(b,b^\{\\prime\}\),0\.01\)\}\.\(7\)IfNk​\(b;corr\)∩ΩmN\_\{k\}\(b;\\operatorname\{corr\}\)\\cap\\Omega\_\{m\}is empty, it falls back tox¯⋅b\\bar\{x\}\_\{\\cdot b\}\.

##### Model\-KNN\.

LetΩm​m′=Ωm∩Ωm′\\Omega\_\{mm^\{\\prime\}\}=\\Omega\_\{m\}\\cap\\Omega\_\{m^\{\\prime\}\}\. For each model pair\(m,m′\)\(m,m^\{\\prime\}\)with\|Ωm​m′\|≥3\|\\Omega\_\{mm^\{\\prime\}\}\|\\geq 3, define the shared\-benchmark distance functionΔ​\(m,m′\)∈ℝ≥0\\Delta\(m,m^\{\\prime\}\)\\in\\mathbb\{R\}\_\{\\geq 0\}by

Δ​\(m,m′\)=1\|Ωm​m′\|​∑b∈Ωm​m′\(xm​b−xm′​b\)2\.\\Delta\(m,m^\{\\prime\}\)=\\sqrt\{\\frac\{1\}\{\|\\Omega\_\{mm^\{\\prime\}\}\|\}\\sum\_\{b\\in\\Omega\_\{mm^\{\\prime\}\}\}\(x\_\{mb\}\-x\_\{m^\{\\prime\}b\}\)^\{2\}\}\.\(8\)HereNk​\(m;−Δ\)N\_\{k\}\(m;\-\\Delta\)selects model neighborsm′≠mm^\{\\prime\}\\neq musing scoreρ​\(m,m′\)=−Δ​\(m,m′\)\\rho\(m,m^\{\\prime\}\)=\-\\Delta\(m,m^\{\\prime\}\)\. For a missing cell\(m,b\)\(m,b\), Model\-KNN predicts by the average overNk​\(m;−Δ\)∩ΩbN\_\{k\}\(m;\-\\Delta\)\\cap\\Omega^\{b\}:

x^m​b=1\|Nk​\(m;−Δ\)∩Ωb\|​∑m′∈Nk​\(m;−Δ\)∩Ωbxm′​b\.\\hat\{x\}\_\{mb\}=\\frac\{1\}\{\|N\_\{k\}\(m;\-\\Delta\)\\cap\\Omega^\{b\}\|\}\\sum\_\{m^\{\\prime\}\\in N\_\{k\}\(m;\-\\Delta\)\\cap\\Omega^\{b\}\}x\_\{m^\{\\prime\}b\}\.\(9\)IfNk​\(m;−Δ\)∩ΩbN\_\{k\}\(m;\-\\Delta\)\\cap\\Omega^\{b\}is empty, it falls back tox¯⋅b\\bar\{x\}\_\{\\cdot b\}\.

##### BenchReg\.

For each target benchmarkbband predictor benchmarkb′b^\{\\prime\}, BenchReg fits a one\-dimensional linear predictorfb​b′:ℝ→ℝf\_\{bb^\{\\prime\}\}:\\mathbb\{R\}\\to\\mathbb\{R\}onΩb​b′=Ωb∩Ωb′\\Omega^\{bb^\{\\prime\}\}=\\Omega^\{b\}\\cap\\Omega^\{b^\{\\prime\}\}\. HereR2​\(fb​b′\)∈ℝR^\{2\}\(f\_\{bb^\{\\prime\}\}\)\\in\\mathbb\{R\}is the coefficient of determination of this linear fit on the shared observations\. For BenchReg,Nk​\(b;R2\)N\_\{k\}\(b;R^\{2\}\)selects benchmark neighborsb′≠bb^\{\\prime\}\\neq bwith\|Ωb​b′\|≥5\|\\Omega^\{bb^\{\\prime\}\}\|\\geq 5andR2​\(fb​b′\)≥Rmin2R^\{2\}\(f\_\{bb^\{\\prime\}\}\)\\geq R^\{2\}\_\{\\min\}using scoreρ​\(b,b′\)=R2​\(fb​b′\)\\rho\(b,b^\{\\prime\}\)=R^\{2\}\(f\_\{bb^\{\\prime\}\}\)\. For a missing cell\(m,b\)\(m,b\), BenchReg predicts by theR2R^\{2\}\-weighted average overNk​\(b;R2\)∩ΩmN\_\{k\}\(b;R^\{2\}\)\\cap\\Omega\_\{m\}:

x^m​b=∑b′∈Nk​\(b;R2\)∩ΩmR2​\(fb​b′\)​fb​b′​\(xm​b′\)∑b′∈Nk​\(b;R2\)∩ΩmR2​\(fb​b′\)\.\\hat\{x\}\_\{mb\}=\\frac\{\\sum\_\{b^\{\\prime\}\\in N\_\{k\}\(b;R^\{2\}\)\\cap\\Omega\_\{m\}\}R^\{2\}\(f\_\{bb^\{\\prime\}\}\)\\,f\_\{bb^\{\\prime\}\}\(x\_\{mb^\{\\prime\}\}\)\}\{\\sum\_\{b^\{\\prime\}\\in N\_\{k\}\(b;R^\{2\}\)\\cap\\Omega\_\{m\}\}R^\{2\}\(f\_\{bb^\{\\prime\}\}\)\}\.\(10\)IfNk​\(b;R2\)∩ΩmN\_\{k\}\(b;R^\{2\}\)\\cap\\Omega\_\{m\}is empty, BenchReg leaves the cell unpredicted\.

##### ModelReg\.

ModelReg is the row\-wise counterpart of BenchReg\. For each target modelmmand predictor modelm′m^\{\\prime\}, it fits a one\-dimensional linear predictorfm​m′:ℝ→ℝf\_\{mm^\{\\prime\}\}:\\mathbb\{R\}\\to\\mathbb\{R\}onΩm​m′=Ωm∩Ωm′\\Omega\_\{mm^\{\\prime\}\}=\\Omega\_\{m\}\\cap\\Omega\_\{m^\{\\prime\}\}\. HereR2​\(fm​m′\)∈ℝR^\{2\}\(f\_\{mm^\{\\prime\}\}\)\\in\\mathbb\{R\}is the coefficient of determination of this linear fit on the shared benchmarks\. For ModelReg,Nk​\(m;R2\)N\_\{k\}\(m;R^\{2\}\)selects model neighborsm′≠mm^\{\\prime\}\\neq mwith\|Ωm​m′\|≥5\|\\Omega\_\{mm^\{\\prime\}\}\|\\geq 5andR2​\(fm​m′\)≥Rmin2R^\{2\}\(f\_\{mm^\{\\prime\}\}\)\\geq R^\{2\}\_\{\\min\}using scoreρ​\(m,m′\)=R2​\(fm​m′\)\\rho\(m,m^\{\\prime\}\)=R^\{2\}\(f\_\{mm^\{\\prime\}\}\)\. For a missing cell\(m,b\)\(m,b\), ModelReg predicts by theR2R^\{2\}\-weighted average overNk​\(m;R2\)∩ΩbN\_\{k\}\(m;R^\{2\}\)\\cap\\Omega^\{b\}:

x^m​b=∑m′∈Nk​\(m;R2\)∩ΩbR2​\(fm​m′\)​fm​m′​\(xm′​b\)∑m′∈Nk​\(m;R2\)∩ΩbR2​\(fm​m′\)\.\\hat\{x\}\_\{mb\}=\\frac\{\\sum\_\{m^\{\\prime\}\\in N\_\{k\}\(m;R^\{2\}\)\\cap\\Omega^\{b\}\}R^\{2\}\(f\_\{mm^\{\\prime\}\}\)\\,f\_\{mm^\{\\prime\}\}\(x\_\{m^\{\\prime\}b\}\)\}\{\\sum\_\{m^\{\\prime\}\\in N\_\{k\}\(m;R^\{2\}\)\\cap\\Omega^\{b\}\}R^\{2\}\(f\_\{mm^\{\\prime\}\}\)\}\.\(11\)IfNk​\(m;R2\)∩ΩbN\_\{k\}\(m;R^\{2\}\)\\cap\\Omega^\{b\}is empty, ModelReg leaves the cell unpredicted\.

##### Soft\-Impute\.

Soft\-Impute\[mazumder2010\]initializes missing cells and then alternates between a rank\-RRSVD projection and clamping the observed entries:

Xm​b\(ℓ\+1\)=xm​bfor​\(m,b\)∈Ω,XΩc\(ℓ\+1\)=\[SVDR⁡\(X\(ℓ\)\)\]Ωc\.X^\{\(\\ell\+1\)\}\_\{mb\}=x\_\{mb\}\\quad\\text\{for \}\(m,b\)\\in\\Omega,\\qquad X^\{\(\\ell\+1\)\}\_\{\\Omega^\{c\}\}=\\bigl\[\\operatorname\{SVD\}\_\{R\}\(X^\{\(\\ell\)\}\)\\bigr\]\_\{\\Omega^\{c\}\}\.\(12\)The fixed point gives predictionsx^m​b=Xm​b\(∞\)\\hat\{x\}\_\{mb\}=X^\{\(\\infty\)\}\_\{mb\}\.

##### Bias\-decomposed ALS\.

Bias\-decomposed ALS adds a residual correctionU​V⊤UV^\{\\top\}, withU∈ℝM×RU\\in\\mathbb\{R\}^\{M\\times R\}andV∈ℝB×RV\\in\\mathbb\{R\}^\{B\\times R\}, fitted by

\(U,V\)=arg⁡minU,V\\displaystyle\(U,V\)=\\arg\\min\_\{U,V\}∑\(m,b\)∈Ω\[xm​b−\(x¯\+\(x¯m⁣⋅−x¯\)\+\(x¯⋅b−x¯\)\)−\(U​V⊤\)m​b\]2\\displaystyle\\sum\_\{\(m,b\)\\in\\Omega\}\\Bigl\[x\_\{mb\}\-\\bigl\(\\bar\{x\}\+\(\\bar\{x\}\_\{m\\cdot\}\-\\bar\{x\}\)\+\(\\bar\{x\}\_\{\\cdot b\}\-\\bar\{x\}\)\\bigr\)\-\(UV^\{\\top\}\)\_\{mb\}\\Bigr\]^\{2\}\(13\)\+λ​\(‖U‖F2\+‖V‖F2\)\.\\displaystyle\+\\lambda\\left\(\\\|U\\\|\_\{F\}^\{2\}\+\\\|V\\\|\_\{F\}^\{2\}\\right\)\.The prediction is therefore

x^m​b=x⏟¯global level\+\(x¯m⁣⋅−x¯\)⏟model​m​offset\+\(x¯⋅b−x¯\)⏟benchmark​b​offset\+\(U​V⊤\)m​b⏟rank\-​R​residual correction\.\\hat\{x\}\_\{mb\}=\\underbrace\{\\bar\{x\}\}\_\{\\text\{global level\}\}\+\\underbrace\{\(\\bar\{x\}\_\{m\\cdot\}\-\\bar\{x\}\)\}\_\{\\text\{model \}m\\text\{ offset\}\}\+\\underbrace\{\(\\bar\{x\}\_\{\\cdot b\}\-\\bar\{x\}\)\}\_\{\\text\{benchmark \}b\\text\{ offset\}\}\+\\underbrace\{\(UV^\{\\top\}\)\_\{mb\}\}\_\{\\text\{rank\-\}R\\text\{ residual correction\}\}\.\(14\)The correction satisfiesU​V⊤∈ℝM×BUV^\{\\top\}\\in\\mathbb\{R\}^\{M\\times B\}and has rank at mostRRbecauseUUandVVeach haveRRcolumns\. The default predictor uses rankR=2R=2,λ=0\.1\\lambda=0\.1, and averages the completed matrices from 10 random initializations \(seeds 42–51\)\.

##### NMF\.

Non\-negative matrix factorization \(NMF\)\[lee1999nmf\]first shifts each benchmark column, if needed, so that the observed entries are non\-negative\. Writing the shifted observed entries asxm​b′∈ℝ≥0x^\{\\prime\}\_\{mb\}\\in\\mathbb\{R\}\_\{\\geq 0\}, it solves

\(U,V\)=arg⁡minU∈ℝ\+M×RV∈ℝ\+B×R​∑\(m,b\)∈Ω\[xm​b′−\(U​V⊤\)m​b\]2\+λ​\(‖U‖F2\+‖V‖F2\),\(U,V\)=\\arg\\min\_\{\\begin\{subarray\}\{c\}U\\in\\mathbb\{R\}\_\{\+\}^\{M\\times R\}\\\\ V\\in\\mathbb\{R\}\_\{\+\}^\{B\\times R\}\\end\{subarray\}\}\\sum\_\{\(m,b\)\\in\\Omega\}\\bigl\[x^\{\\prime\}\_\{mb\}\-\(UV^\{\\top\}\)\_\{mb\}\\bigr\]^\{2\}\+\\lambda\(\\\|U\\\|\_\{F\}^\{2\}\+\\\|V\\\|\_\{F\}^\{2\}\),\(15\)then subtracts the column shifts fromU​V⊤UV^\{\\top\}to obtain predictions in the original transformed space\.

##### PMF\.

Probabilistic matrix factorization \(PMF\)\[salakhutdinov2008pmf\]uses the same factor\-matrix dimensions without the non\-negativity constraint:

\(U,V\)=arg⁡minU∈ℝM×RV∈ℝB×R​∑\(m,b\)∈Ω\[xm​b−\(U​V⊤\)m​b\]2\+λ​\(‖U‖F2\+‖V‖F2\),\(U,V\)=\\arg\\min\_\{\\begin\{subarray\}\{c\}U\\in\\mathbb\{R\}^\{M\\times R\}\\\\ V\\in\\mathbb\{R\}^\{B\\times R\}\\end\{subarray\}\}\\sum\_\{\(m,b\)\\in\\Omega\}\\bigl\[x\_\{mb\}\-\(UV^\{\\top\}\)\_\{mb\}\\bigr\]^\{2\}\+\\lambda\(\\\|U\\\|\_\{F\}^\{2\}\+\\\|V\\\|\_\{F\}^\{2\}\),\(16\)with predictionx^m​b=\(U​V⊤\)m​b\\hat\{x\}\_\{mb\}=\(UV^\{\\top\}\)\_\{mb\}\.

##### Nuclear norm minimization\.

The nuclear\-norm baseline\[candes2009\]solves the convex low\-rank surrogate

Z⋆=arg⁡minZ∈ℝM×B⁡12​∑\(m,b\)∈Ω\(Zm​b−xm​b\)2\+λ​‖Z‖∗,Z^\{\\star\}=\\arg\\min\_\{Z\\in\\mathbb\{R\}^\{M\\times B\}\}\\frac\{1\}\{2\}\\sum\_\{\(m,b\)\\in\\Omega\}\(Z\_\{mb\}\-x\_\{mb\}\)^\{2\}\+\\lambda\\\|Z\\\|\_\{\*\},\(17\)and predictsx^m​b=Zm​b⋆\\hat\{x\}\_\{mb\}=Z^\{\\star\}\_\{mb\}\.

##### Neural baseline\.

Letx~m∈ℝB\\tilde\{x\}\_\{m\}\\in\\mathbb\{R\}^\{B\}be rowmmwith missing entries filled by zero, and letom∈\{0,1\}Bo\_\{m\}\\in\\\{0,1\\\}^\{B\}be its binary observation mask\. The MLP baseline trains a two\-layer networkfθf\_\{\\theta\}with masked reconstruction loss, where⊙\\odotdenotes elementwise multiplication:

minθ​∑m‖om⊙\(fθ​\(x~m\)−x~m\)‖22,\\min\_\{\\theta\}\\sum\_\{m\}\\bigl\\\|o\_\{m\}\\odot\(f\_\{\\theta\}\(\\tilde\{x\}\_\{m\}\)\-\\tilde\{x\}\_\{m\}\)\\bigr\\\|\_\{2\}^\{2\},\(18\)and predictsx^m​b=\[fθ​\(x~m\)\]b\\hat\{x\}\_\{mb\}=\[f\_\{\\theta\}\(\\tilde\{x\}\_\{m\}\)\]\_\{b\}, averaged over three random seeds\.

### C\.2From Candidate Methods toBenchPress

This appendix expands the head\-to\-head comparison of[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)from a small selected set into the full transform×\\timesmethod grid, reported both as a heatmap and as a sortable table\.

The full transform×\\timesmethod grid \([Figure˜11](https://arxiv.org/html/2606.24020#A3.F11)\) evaluates all 84 combinations on a common experiment setting;LABEL:tab:full\_gridreports the same numbers in tabular form, sorted by𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_transform_method_grid_scores.pdf)Figure 11:Full transform×\\timesmethod grid \(7 transforms, 12 methods\) across the Section 4 metrics: MedAPE, MedAE, and coverage\. Each cell reports the best hyperparameter configuration for that pair, evaluated as the median over 10 seeds×\\times3 folds\. All methods operate in standardized space\. Green = better\.Table 11:Full transform×\\timesmethod grid: all 84 configurations from[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2), sorted by𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}\. Each row reports the best hyperparameter setting for that transform–method pair, evaluated as the median over 10 seeds×\\times3 folds in standardized space\.\#TransformMethodHyperparameterMedAPE \(%\)↓\\downarrowMedAE↓\\downarrowCov\.1ProbitModelRegRmin2=0\.2R^\{2\}\_\{\\min\}\{=\}0\.2,k=7k\{=\}77\.694\.7482%2ProbitBenchRegRmin2=0\.2R^\{2\}\_\{\\min\}\{=\}0\.2,k=7k\{=\}77\.724\.6685%3LogitModelRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=5k\{=\}57\.764\.7374%4LogitBenchRegRmin2=0\.1R^\{2\}\_\{\\min\}\{=\}0\.1,k=7k\{=\}77\.774\.6786%5LogitBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}27\.774\.63100%6ProbitBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}27\.794\.62100%7QuantileBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}27\.904\.66100%8IdentityModelRegRmin2=0\.2R^\{2\}\_\{\\min\}\{=\}0\.2,k=7k\{=\}77\.955\.0283%9IdentityBenchRegRmin2=0\.1R^\{2\}\_\{\\min\}\{=\}0\.1,k=7k\{=\}78\.004\.9586%10QuantileModelRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=7k\{=\}78\.064\.8780%11QuantileBenchRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=7k\{=\}78\.094\.8085%12IdentityBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}28\.175\.15100%13Square rootBenchRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=3k\{=\}38\.184\.9868%14ArcsinhBenchRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=5k\{=\}58\.225\.0879%15Square rootModelRegRmin2=0\.2R^\{2\}\_\{\\min\}\{=\}0\.2,k=5k\{=\}58\.335\.1874%16ArcsinhModelRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=7k\{=\}78\.395\.1782%17QuantileSoft\-Imputer=2r\{=\}28\.425\.04100%18LogBenchRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=7k\{=\}78\.445\.0784%19LogitSoft\-Imputer=2r\{=\}28\.495\.10100%20ProbitSoft\-Imputer=2r\{=\}28\.535\.08100%21ArcsinhBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}28\.805\.36100%22Square rootBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}28\.815\.31100%23IdentitySoft\-Imputer=2r\{=\}28\.895\.40100%24LogModelRegRmin2=0\.3R^\{2\}\_\{\\min\}\{=\}0\.3,k=5k\{=\}59\.015\.4973%25LogitModel\-KNNk=10k\{=\}109\.065\.39100%26ProbitModel\-KNNk=10k\{=\}109\.175\.44100%27IdentityModel\-KNNk=10k\{=\}109\.245\.60100%28QuantileModel\-KNNk=10k\{=\}109\.305\.53100%29ArcsinhModel\-KNNk=10k\{=\}109\.495\.81100%30Square rootModel\-KNNk=10k\{=\}109\.515\.77100%31QuantileNMFr=1r\{=\}19\.515\.93100%32QuantileMLPlr=0\.01\{=\}0\.019\.555\.77100%33ArcsinhSoft\-Imputer=2r\{=\}29\.555\.74100%34Square rootSoft\-Imputer=2r\{=\}29\.575\.74100%35LogitMLPlr=0\.001\{=\}0\.0019\.595\.71100%36LogBias ALSλ=0\.1\\lambda\{=\}0\.1,r=2r\{=\}29\.625\.61100%37ProbitMLPlr=0\.01\{=\}0\.019\.645\.75100%38IdentityMLPlr=0\.01\{=\}0\.019\.846\.24100%39LogitNMFr=1r\{=\}110\.076\.08100%40ProbitNMFr=1r\{=\}110\.176\.31100%41LogModel\-KNNk=10k\{=\}1010\.216\.03100%42QuantileBench\-KNNk=10k\{=\}1010\.426\.23100%43ArcsinhMLPlr=0\.001\{=\}0\.00110\.446\.44100%44Square rootMLPlr=0\.01\{=\}0\.0110\.456\.42100%45LogSoft\-Imputer=2r\{=\}210\.536\.10100%46ProbitNuclearλ=5\.0\\lambda\{=\}5\.010\.876\.82100%47IdentityNMFr=2r\{=\}210\.887\.08100%48LogitNuclearλ=5\.0\\lambda\{=\}5\.010\.946\.71100%49QuantileNuclearλ=1\.0\\lambda\{=\}1\.011\.076\.87100%50LogitBench\-KNNk=10k\{=\}1011\.166\.54100%51LogMLPlr=0\.01\{=\}0\.0111\.266\.77100%52ProbitBench\-KNNk=10k\{=\}1011\.276\.69100%53IdentityNuclearλ=5\.0\\lambda\{=\}5\.011\.297\.46100%54IdentityBench\-KNNk=10k\{=\}1011\.747\.20100%55ArcsinhNMFr=2r\{=\}211\.767\.47100%56Square rootNMFr=2r\{=\}211\.827\.35100%57ArcsinhNuclearλ=5\.0\\lambda\{=\}5\.011\.997\.69100%58LogitModel\-Mean—12\.067\.69100%59Square rootNuclearλ=5\.0\\lambda\{=\}5\.012\.097\.60100%60ArcsinhBench\-KNNk=10k\{=\}1012\.117\.42100%61Square rootBench\-KNNk=10k\{=\}1012\.127\.32100%62ProbitModel\-Mean—12\.297\.87100%63QuantileModel\-Mean—12\.427\.80100%64QuantilePMFr=5r\{=\}512\.437\.77100%65LogitPMFr=5r\{=\}512\.617\.90100%66LogNMFr=2r\{=\}212\.707\.68100%67ProbitPMFr=5r\{=\}512\.778\.05100%68LogBench\-KNNk=10k\{=\}1012\.817\.53100%69IdentityModel\-Mean—12\.948\.66100%70LogNuclearλ=5\.0\\lambda\{=\}5\.013\.197\.93100%71IdentityPMFr=5r\{=\}513\.238\.97100%72Square rootModel\-Mean—13\.528\.76100%73ArcsinhModel\-Mean—13\.538\.92100%74ArcsinhPMFr=5r\{=\}514\.259\.33100%75Square rootPMFr=5r\{=\}514\.319\.17100%76LogModel\-Mean—14\.578\.91100%77QuantileBench\-Mean—15\.269\.65100%78LogitBench\-Mean—15\.5610\.04100%79LogPMFr=5r\{=\}515\.639\.57100%80ProbitBench\-Mean—15\.7010\.21100%81IdentityBench\-Mean—16\.2911\.04100%82ArcsinhBench\-Mean—17\.0611\.40100%83Square rootBench\-Mean—17\.2811\.18100%84LogBench\-Mean—18\.7211\.56100%
### C\.3BenchPressvs\. LLMs as Benchmark Score Predictors

The LLM diagnostic in[Section˜4\.3](https://arxiv.org/html/2606.24020#S4.SS3)uses no separate system prompt: the API call passessystem\_message=None, and all task instructions are contained in the user message\. The user prompt is generated once per batch of target cells\. In the named condition, model and benchmark fields use the real names and the benchmark line also includes the benchmark scale\. In the blind condition, those fields are replaced by local labels such asTarget model q0,Benchmark A, andPeer model q0\-1; scores and the five\-shot peer\-example structure are unchanged\. Herequery\_idis the local identifier for a target cell within the batch, such asq0; it is used only to match the returned JSON value to the corresponding query\.

Five\-shot LLM user prompt templateYou are estimating benchmark results before running expensive evaluations\. Each query gives compact known scores for a target model and five nearest peer\-model examples\. Make a quick numerical estimate from the nearest peers; do not explain or show calculations\. Return ONLY valid JSON mapping each query id to a numeric score, e\.g\. \{"q0": 72\.5\}\.Query \{query id, e\.g\. q0\} Target model: \{target model name or blind target label\} Target known scores: \[\{benchmark label: score\}, \.\.\.\] Estimate the target model’s score on: \{target benchmark name and scale, or blind benchmark label\} Nearest peer examples: Example 1: model=\{peer model name or blind peer label\}; shared\_scores=\[\{benchmark label: score\}, \.\.\.\]; \{target benchmark label\} score: \{peer target score\} \.\.\. Example 5: model=\{peer model name or blind peer label\}; shared\_scores=\[\{benchmark label: score\}, \.\.\.\]; \{target benchmark label\} score: \{peer target score\}

## Appendix DSupplemental to[Section˜5](https://arxiv.org/html/2606.24020#S5): WhatBenchPressEnables for Model Evaluation

This section provides additional details for the model\-evaluation analyses in[Section˜5](https://arxiv.org/html/2606.24020#S5)\.[Section˜D\.1](https://arxiv.org/html/2606.24020#A4.SS1)supplements[Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)with a pairwise\-redundancy diagnostic and more exhaustive probe\-selection checks\.[Section˜D\.2](https://arxiv.org/html/2606.24020#A4.SS2)reports an auxiliary shortlist\-recovery metric for[Section˜5\.2](https://arxiv.org/html/2606.24020#S5.SS2)\.[Section˜D\.3](https://arxiv.org/html/2606.24020#A4.SS3)gives the full per\-target table for[Section˜5\.3](https://arxiv.org/html/2606.24020#S5.SS3)\.

### D\.1Budgeted Scorecard Recovery

This appendix supplements[Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)in two ways\. First, a pairwise\-redundancy diagnostic explains why a small probe set can recover most of the matrix: the typical benchmark already has another benchmark column that predicts it nearly perfectly\. Second, a more exhaustive probe\-selection analysis asks whether greedy’s computational simplicity comes at a material accuracy cost, comparing it with exact enumeration over the low\-cost allowlist and exact search after pruning in the unrestricted setting\.

##### Widespread redundancy across benchmarks\.

Before choosing a probe set, we first ask whether many benchmark columns are redundant enough that a small probe set could plausibly recover the rest\. For every ordered pair of benchmarks\(b,b′\)\(b,b^\{\\prime\}\), we collect the scoressm​bs\_\{mb\}andsm​b′s\_\{mb^\{\\prime\}\}of all modelsmmevaluated on both \(n≥5n\\geq 5shared models required\), apply a logit transform followed by per\-columnzz\-scoring, fit a univariate linear regression in this transformed space, and invert the transform to obtain predicted raw scoress^m​b\\hat\{s\}\_\{mb\}\. For each target benchmarkbb, we identify its*best predictor benchmark*, the predictorb′b^\{\\prime\}that maximizes the absolute Pearson correlation, and report the signed correlation,𝖬𝖾𝖽𝖠𝖤\\mathsf\{MedAE\}, and𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}\. Of the 133 benchmarks, 132 have at least one neighbor pair with≥5\\geq 5shared models; 1 is excluded for insufficient overlap\. Among these 132, 127 have a best\-neighbor absolute correlation≥0\.85\\geq 0\.85\(129 reach≥0\.80\\geq 0\.80\), and the median best\-neighbor absolute correlation is0\.970\.97\.[Table˜12](https://arxiv.org/html/2606.24020#A4.T12)lists the five most and five least predictable benchmarks\. The most predictable pair, the LongFact Concepts and LongFact Objects hallucination\-rate benchmarks, achieves a correlation of0\.9970\.997\. At the other extreme, Safety \(OLMES suite\) has the weakest best\-neighbor correlation \(0\.620\.62\), followed by MRCR v1 \(0\.680\.68\)\.

Table 12:Five most and least predictable benchmarks identified by pairwise linear regression in logit \+zz\-score space\. Rows are selected by absolute Pearson correlation between the target and its best predictor benchmark; the Corr\. column reports the signed value\.Caveat: high cross\-category correlation does not imply semantic similarity\.Some cross\-category pairs appear surprisingly predictable; for example, GDPval \(Artificial Analysis ELO\) and WideSearch have correlation0\.990\.99\. This does not mean GDP\-style task performance predicts search\-agent performance\. The regression is fit on only 5 models that have scores on both benchmarks, all of which are frontier models whose scores are dominated by a single general\-capability axis: whichever model is strongest overall tends to score highest on both\. With so few data points and so little capability diversity, nearly any two benchmarks will correlate\. These inflated cross\-category correlation values reflect sample composition, not a meaningful relationship between the benchmarks\.

Finding 9:Most benchmark scores are inferable from one well\-chosen peer, reflecting shared variation across the matrix\.

##### Pruned exhaustive probe selection\.

The budgeted scorecard recovery experiment in[Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1)selects probe sets greedily\. Greedy selection is simple and fast, but it does not certify that the selected five probes are the best possible subset\. We therefore run two exhaustive checks, one where exact enumeration is feasible and one where the unrestricted search space first has to be pruned\.

- •Cost\-aware exact exhaustive search\.The low\-cost allowlist has only 25 candidate probes, so exact enumeration is feasible without pruning\. We search all\(255\)=53,130\\binom\{25\}\{5\}=53\{,\}130five\-probe subsets to test whether the cost\-aware greedy prefix misses a better cheap combination\.
- •Cost\-unaware pruned exhaustive search\.The unrestricted setting has 133 candidate probes, making exact search over all\(1335\)\\binom\{133\}\{5\}five\-probe subsets too large\. We therefore build a top\-30 candidate pool from the full ten\-step MedAE greedy trajectory: at each step, every remaining candidate is ranked by the pooled MedAE it would achieve if added next, and ranks are averaged across steps\. This pruning keeps the full MedAE greedy top\-10 prefix, reduces the exact search to\(305\)=142,506\\binom\{30\}\{5\}=142\{,\}506subsets, and lets us test whether the greedy five\-probe solution is preserved when the final subset is selected exactly\.

The results are reported in[Table˜13](https://arxiv.org/html/2606.24020#A4.T13)\. In the unrestricted setting, pruned exhaustive search returns the same five probes as greedy\. In the low\-cost setting, exact exhaustive search improves slightly over greedy, but the gain is small; the greedy probe sets are already close to optimal under both policies\.

Table 13:Five\-probe MedAE selections\.Cost\-unaware greedy already matches the best five\-probe set found by exact search over the pruned top\-30 universe\. In the low\-cost allowlist, exhaustive search slightly improves over cost\-aware greedy\.

### D\.2Preserving Model Rankings

##### Auxiliary metric: shortlist recovery\.

[Section˜5\.2](https://arxiv.org/html/2606.24020#S5.SS2)reports same\-benchmark pairwise ranking accuracy as the main ranking metric\. As an auxiliary view, we also measure shortlist recovery\. For each benchmark and held\-out fold, we complete the full observed leaderboard by keeping seen scores fixed and replacing held\-out cells withBenchPresspredictions\. We then compare the completed top fraction with the true top fraction on that same observed leaderboard\. Because the predicted and true shortlists have the same size, the overlap rate is the fraction of true top models recovered by the predicted shortlist\.[Table˜14](https://arxiv.org/html/2606.24020#A4.T14)reports two summaries of this overlap:*recovery*computes top\-fraction recovery separately for each benchmark and then reports the median across benchmarks, while*shortlist slots*counts selection positions rather than unique models, so a benchmark\-fold group contributing four top\-20%20\\%positions contributes four slots\.

Table 14:Auxiliary shortlist recovery\.Recovery is median benchmark\-level overlap; slots count top\-fraction positions\.
##### Probe selection for ranking preservation\.

We also run a probe\-selection diagnostic that optimizes the ranking metric directly\. This greedy search evaluates the same set of observed model–benchmark cells as[Section˜5\.1](https://arxiv.org/html/2606.24020#S5.SS1), but scores each candidate prefix by same\-benchmark pairwise ranking accuracy with a five\-point score margin\. Probe cells are revealed exactly and remain in the denominator, so the question is which known benchmark scores most improve the completed leaderboard\.[Table˜15](https://arxiv.org/html/2606.24020#A4.T15)reports two top\-10 prefixes selected by this ranking\-aware objective: a cost\-unaware search that may choose any benchmark, and a cost\-aware search restricted by the low\-cost allowlist\. The cost\-aware constraint costs only a small amount of ranking accuracy at ten probes \(86\.2%86\.2\\%versus88\.9%88\.9\\%\) while selecting a more practical benchmark set\.

Table 15:Top\-10 probe sets selected for ranking preservation\.Both greedy searches optimize same\-benchmark pairwise ranking accuracy at a five\-point score margin\. The cost\-unaware search may choose any benchmark; the cost\-aware search is restricted by the low\-cost allowlist\. At each step, the selected probe prefix is evaluated on the same fixed universe of observed cells; revealed probe cells are exact and unrevealed observed cells are predicted byBenchPress\.

### D\.3Predicting Newly Released Models

The main text summarizes the temporal deployment stress test as a distribution across target models\.[Table˜16](https://arxiv.org/html/2606.24020#A4.T16)lists the per\-target results\. Each target model is selected by the same pre\-specified temporal\-window rule used in[Section˜5\.3](https://arxiv.org/html/2606.24020#S5.SS3): we use models from the post\-DeepSeek\-R1 reasoning era through GPT\-5\.1 and keep models with more than 20 observed scores in the final matrix\.

Table 16:Full temporal deployment results\.For each target model, Obs\. is the number of observed benchmark scores in the final matrix and Train is the number of older models available before the target’s release date\. Eachkkcolumn reveals that many seed scores from the target model and predicts the rest; numbers are medians across 10 random seeds\.

## Appendix ESupplemental to[Section˜6](https://arxiv.org/html/2606.24020#S6): When to TrustBenchPress’s Predictions

This section provides additional details for the trust analyses in[Section˜6](https://arxiv.org/html/2606.24020#S6)\.[Section˜E\.1](https://arxiv.org/html/2606.24020#A5.SS1)supplements[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1)with expanded prediction\-error diagnostics\.[Section˜E\.2](https://arxiv.org/html/2606.24020#A5.SS2)spells out the reliability estimators used in[Section˜6\.2](https://arxiv.org/html/2606.24020#S6.SS2)\.

### E\.1What Affects Prediction Reliability

This subsection provides extended experimental analysis that complements the prediction\-error analysis in[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1)\.[Section˜E\.1\.1](https://arxiv.org/html/2606.24020#A5.SS1.SSS1)extends the benchmark analysis of[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1.SSS0.Px2)with the full benchmark\-side hypothesis grid\.[Section˜E\.1\.2](https://arxiv.org/html/2606.24020#A5.SS1.SSS2)extends the model analysis of[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1.SSS0.Px3)with the full model\-side hypothesis grid\.

##### Spearman rank correlation tests\.

[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1)uses two hypothesis\-test families\. For observational hypotheses, each target contributes a single pair\(xi,yi\)\(x\_\{i\},y\_\{i\}\): a featurexix\_\{i\}that we measure but cannot intervene on \(e\.g\. a benchmark’s inherent rank\-2 reconstruction quality, or a model’s median observed score\), and itsBenchPressprediction erroryiy\_\{i\}\. We use the Spearman rank correlation test to ask whether targets with a higher feature value tend to have higher or lower prediction error\. We measure this association via the*rank*of each value, its position in the sorted ordering of its column, where the smallest value has rank11and the largest has ranknn\(this sense of “rank” is unrelated to the matrix\-rank quantity used in[Section˜3\.3](https://arxiv.org/html/2606.24020#S3.SS3)\)\. Using ranks rather than raw values makes the test sensitive to monotonic relationships, not only linear ones, and more robust to heavy\-tailed errors and outliers\. Concretely, we replace each column by its within\-column ranks and compute the Pearson correlation between the two ranked columns; the resultρ∈\[−1,1\]\\rho\\in\[\-1,1\]is the Spearman correlation\. Intuitively,ρ\>0\\rho\>0means targets with a higher feature value tend to also have higher prediction error,ρ<0\\rho<0means the opposite, and\|ρ\|\|\\rho\|measures how consistently the ranking holds\. Thepp\-value asks: if there were truly no monotonic association \(ρtrue=0\\rho\_\{\\text\{true\}\}=0\), how often would we see a sample correlation at least as extreme asρ\\rho? It is computed from thett\-approximationt=ρ​\(n−2\)/\(1−ρ2\)t=\\rho\\sqrt\{\(n\-2\)/\(1\-\\rho^\{2\}\)\}, which underH0H\_\{0\}approximately follows a Student\-ttdistribution withn−2n\-2degrees of freedom\[hollander2014nonparametric\]\. The approximation is reliable whennnis at least a few dozen \(all our targets satisfyn≥25n\\geq 25\)\. Since we test for any deviation fromρ=0\\rho=0, thepp\-value doubles the upper\-tail probability,p=2​\(1−Ftn−2​\(\|t\|\)\)p=2\\bigl\(1\-F\_\{t\_\{n\-2\}\}\(\|t\|\)\\bigr\), whereFtn−2F\_\{t\_\{n\-2\}\}is the CDF of the Student\-tn−2t\_\{n\-2\}distribution\. The intuition is:\|t\|\|t\|increases with both\|ρ\|\|\\rho\|and the sample sizenn, so thepp\-value gets smaller only when the correlation is meaningful and backed by enough targets\.

##### Paired Wilcoxon signed\-rank tests\.

For intervention\-style hypotheses, each target contributes a pair of errors\(yibaseline,yiintervention\)\(y^\{\\text\{baseline\}\}\_\{i\},y^\{\\text\{intervention\}\}\_\{i\}\), measured on the same target under two different settings \(e\.g\.BenchPresstrained on the original matrix vs\. on a matrix where every benchmark highly correlated with the target has been masked out\)\. We use the paired Wilcoxon signed\-rank test to ask whether the intervention shifts each target’s error in a consistent direction\. Because the comparison is within\-target, each target serves as its own control, removing the effect of inherent target difficulty\. We form per\-target differencesΔi=yiintervention−yibaseline\\Delta\_\{i\}=y^\{\\text\{intervention\}\}\_\{i\}\-y^\{\\text\{baseline\}\}\_\{i\}and ask whether their median is zero \(i\.e\. the intervention has no typical effect on prediction error\)\. The Wilcoxon signed\-rank test ranks\|Δi\|\|\\Delta\_\{i\}\|from smallest to largest, denotes the rank of pairiibyRiR\_\{i\}, and uses as its statistic the sum of ranks for positive differences,W\+=∑i:Δi\>0RiW^\{\+\}=\\sum\_\{i:\\,\\Delta\_\{i\}\>0\}R\_\{i\}\(in the rare case of anyΔi=0\\Delta\_\{i\}=0, that pair is dropped before ranking\)\. UnderH0H\_\{0\}\(the distribution ofΔ\\Deltais symmetric about0\),W\+W^\{\+\}has meann​\(n\+1\)/4n\(n\+1\)/4and variancen​\(n\+1\)​\(2​n\+1\)/24n\(n\+1\)\(2n\+1\)/24, and the standardized statisticz=\(W\+−𝔼​\[W\+\]\)/Var​\(W\+\)z=\(W^\{\+\}\-\\mathbb\{E\}\[W^\{\+\}\]\)/\\sqrt\{\\mathrm\{Var\}\(W^\{\+\}\)\}is approximately standard normal forn≳25n\\gtrsim 25\[hollander2014nonparametric\]\. Since we test for any deviation frommedian⁡Δ=0\\operatorname\{median\}\\Delta=0, thepp\-value doubles the upper\-tail probability,p=2​\(1−Φ​\(\|z\|\)\)p=2\\bigl\(1\-\\Phi\(\|z\|\)\\bigr\), whereΦ\\Phiis the standard normal CDF\. The same intuition applies:\|z\|\|z\|grows with both the magnitude of the per\-target shift and the sample sizenn, and only the combination of a substantial intervention effect and enough paired targets drives\|z\|\|z\|large enough, and thepp\-value small enough, to rule out chance\. We use Wilcoxon rather than a pairedtt\-test becauseΔ\\Deltais heavy\-tailed and not approximately Gaussian across targets\.

#### E\.1\.1Benchmark analysis

This appendix reports two extensions of the benchmark\-side analysis in[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1.SSS0.Px2): the full7×27\\times 2hypothesis×\\timesmetric grid, and a per\-benchmark predictability ranking that names which benchmarks are easiest and hardest forBenchPressto predict\.

##### Full hypothesis×\\timesmetric grid\.

[Figure˜8](https://arxiv.org/html/2606.24020#S6.F8)in the main text visualizes the benchmark\-side patterns that pass the joint\-support criterion \(H3, H4, and H5\)\. For completeness,[Figure˜12](https://arxiv.org/html/2606.24020#A5.F12)expands this to the full7×27\\times 2grid: every active benchmark\-side hypothesis \(H1–H7\) against both score\-error metrics, using the same correlational and ablation setups as[Table˜7](https://arxiv.org/html/2606.24020#S6.T7)\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_predictability_factors_full.pdf)Figure 12:All seven active benchmark\-level hypotheses against both score\-error metrics\.The left block shows H1–H3 \(correlational hypotheses\) and the right block shows H4–H7 \(ablations\)\. Columns within each block are𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}\(↓\\downarrow\) and𝖬𝖾𝖽𝖠𝖤\\mathsf\{MedAE\}\(↓\\downarrow\)\. Correlational rows show scatter \+ binned trend; ablation rows show line plots across drop fractions \(H4, H6\) or paired bars at the headline intervention \(H5 with\|r\|≥0\.85\|r\|\\\!\\geq\\\!0\.85peers; H7 with same\-category peers\)\.
##### Per\-benchmark predictability\.

We apply a direct cell\-holdout test for each benchmark column\. For each model we randomly hide half of its observed scores and predict them via theBenchPresspredictor from[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2); we then aggregate the test\-cell errors by benchmark column\. This is repeated over 10 random seeds for stability\.

[Figure˜13](https://arxiv.org/html/2606.24020#A5.F13)shows the per\-benchmark𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}for the evaluated benchmark columns\. Roughly 71% \(35/49\) of benchmarks fall below the 15%𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}threshold, indicating they are inferable with limited additional information loss\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_benchmark_predictability.pdf)Figure 13:Per\-benchmark predictability\.For each model, half of observed scores are held out and predicted viaBenchPress; errors are aggregated by benchmark column \(10 seeds\)\. Benchmarks below the 15%𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}threshold \(dashed red line\) are well\-predicted by others\. Color = benchmark category\.

#### E\.1\.2Model analysis

This appendix mirrors the benchmark\-side extensions for the model\-side analysis in[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1.SSS0.Px3): the full9×29\\times 2hypothesis×\\timesmetric grid, and a per\-model predictability ranking that names which models are easiest and hardest forBenchPressto predict\.

##### Full hypothesis×\\timesmetric grid\.

[Figure˜9](https://arxiv.org/html/2606.24020#S6.F9)in the main text visualizes five representative model\-level hypotheses \(H2, H3, H5, H8, H9\) under a single metric per panel\. For completeness,[Figure˜14](https://arxiv.org/html/2606.24020#A5.F14)expands this to the paper\-facing setting for each of the nine hypotheses \(H1–H9\) against both score\-error metrics, using the same correlational, ablation, and temporal setups as[Table˜8](https://arxiv.org/html/2606.24020#S6.T8)\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_model_predictability_factors_full.pdf)Figure 14:Selected settings for all nine model\-level hypotheses against both score\-error metrics\.The left block shows H1–H5 and the right block shows H6–H9\. Columns within each block are𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}\(↓\\downarrow\) and𝖬𝖾𝖽𝖠𝖤\\mathsf\{MedAE\}\(↓\\downarrow\)\. H1–H4 are correlational rows \(H2 grouped bars\), H5–H8 are ablations, and H9 is temporal\. Ablation rows show paired bars at the headline intervention \(H5:\|r\|≥0\.95\|r\|\\\!\\geq\\\!0\.95peers; H6: 75% strongest\-peer overlap mask; H7: same\-provider evidence\) or a line across hide fractions \(H8\)\. H9 compares oldest vs\. middle training anchors for the displayed revealed\-benchmark countsk∈\{1,3,5,10\}k\\in\\\{1,3,5,10\\\}; secondary H9 settings withk=8k=8andk=15k=15are not plotted in this figure\.
##### Per\-model predictability\.

Mirroring the per\-benchmark probe in[Section˜E\.1\.1](https://arxiv.org/html/2606.24020#A5.SS1.SSS1.Px2), we apply the same half\-per\-model holdout but aggregate the test\-cell errors by*model row*instead of benchmark column\. For each model we randomly hide half of its observed scores and predict them via theBenchPresspredictor from[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2), repeating over 10 random seeds for stability\.

[Figure˜15](https://arxiv.org/html/2606.24020#A5.F15)shows the per\-model𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}for the 84 evaluated models\. Roughly 88% \(74/84\) fall below the 15%𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}threshold and the median per\-model𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}is7\.6%7\.6\\%, indicating that most models are inferable from the rest of the matrix at limited additional information loss\.

![Refer to caption](https://arxiv.org/html/2606.24020v1/bp_model_predictability.pdf)Figure 15:Per\-model predictability\.For each model, half of observed scores are held out and predicted viaBenchPress; errors are aggregated by model row \(10 seeds\)\. Models below the 15%𝖬𝖾𝖽𝖠𝖯𝖤\\mathsf\{MedAPE\}threshold \(dashed red line\) are well\-predicted by others\. Color = provider\.

### E\.2Estimating Prediction Reliability

[Section˜6\.2](https://arxiv.org/html/2606.24020#S6.SS2)adds a reliability estimator to the defaultBenchPressscore predictor from[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)\. This appendix spells out the three reliability estimators used there\. All three models solve the same task: for a hidden model–benchmark cell, predict how large the absolute error of the fixed point prediction is likely to be\. During training, the target is the held\-out absolute error after alog⁡\(1\+x\)\\log\(1\+x\)transform\. During evaluation, the reliability estimator may use the training matrix, the fixed Logit Bias ALS prediction, and auxiliary predictions computed from the training fold, but it never sees the hidden score itself\.

##### Ensemble\-spread reliability estimator\.

The ensemble\-spread model asks whether plausible score predictors agree on the same hidden cell\. It builds two stacks of alternative point predictions\. The first stack measures local sensitivity of the selected score predictor: the three Logit Bias ALS configurations in the[Section˜4\.2](https://arxiv.org/html/2606.24020#S4.SS2)grid with rank 2 andλ∈\{0\.01,0\.1,1\.0\}\\lambda\\in\\\{0\.01,0\.1,1\.0\\\}\. Theλ=0\.1\\lambda=0\.1member is the selectedBenchPressscore predictor, and the other two show how much the prediction moves under the adjacent regularization strengths in the grid\. The second stack measures disagreement with other strong full\-coverage predictors\. We sort transform–method configurations by median percentage error inLABEL:tab:full\_grid, require coverage at least 99\.9%, remove the selected Logit Bias ALS predictor, and keep the first 12 remaining configurations\. In the checked\-in run, these are Probit Bias ALS, Quantile Bias ALS, Identity Bias ALS, Quantile Soft\-Impute, Logit Soft\-Impute, Probit Soft\-Impute, Arcsinh Bias ALS, Square root Bias ALS, Identity Soft\-Impute, Logit Model\-KNN, Probit Model\-KNN, and Identity Model\-KNN\. For each prediction stack, we record four spread summaries: standard deviation, median absolute deviation, central 80% span, and the distance between the selected Logit Bias ALS prediction and the stack median\. These eight nonnegative features are transformed withlog⁡\(1\+x\)\\log\(1\+x\)before split\-local standardization\.

##### Matrix\-support reliability estimator\.

The matrix\-support model ignores alternative predictors and uses only evidence available in the observed score matrix\. For the target model, it records the number of observed benchmark scores and the median observed score\. For the target benchmark, it records the number of observed model scores, the median observed score, and the standard deviation of observed scores\. It also records the strongest peer model for the target model and the strongest neighboring benchmark for the target benchmark, where “strongest” means highest absolute Pearson correlation over shared observed scores in the training matrix\. The peer\-model features are its absolute correlation with the target model and the number of shared observed benchmarks\. The benchmark\-neighbor feature is its absolute correlation with the target benchmark\. We do not include benchmark\-neighbor overlap because the stricter H7 ablation in[Section˜6\.1](https://arxiv.org/html/2606.24020#S6.SS1.SSS0.Px2)does not support it as a joint benchmark\-side reliability factor\.

##### Hybrid reliability estimator and calibration\.

The hybrid reliability estimator concatenates the ensemble\-spread features and the matrix\-support features, then uses the same risk\-model selection procedure as the two single\-signal models\. Before fitting any of the three learned reliability estimators, each feature column is clipped at zero, transformed withlog⁡\(1\+x\)\\log\(1\+x\), and standardized using only the training split\. For every evaluated fold, candidate risk models are trained on the other folds only, and the architecture is selected inside those training folds from a linear ridge model with zero hidden layers, one ReLU MLP layer of 16 units, one ReLU MLP layer of 32 units, or two ReLU MLP layers of 64 and 32 units\. Concretely, after holding out the evaluated fold, the remaining folds are split by fold index for architecture selection: cells with fold index divisible by 5 form the inner validation set and the rest form the inner training set, giving roughly a 4:1 split\. If this modulo\-5 split leaves too few validation or training cells, we fall back to fold index modulo 3, giving roughly a 2:1 split\. The selected folds chose only MLP configurations: the hybrid estimator selected 16, 32, and 64/32 hidden units in 7, 15, and 8 folds; the ensemble\-spread estimator selected them in 12, 5, and 13 folds; the matrix\-support estimator selected 32 and 64/32 hidden units in 3 and 27 folds\. After the architecture is selected, the MLP variants use ReLU activations, Adam,ℓ2\\ell\_\{2\}penalty10−310^\{\-3\}, learning rate3×10−33\{\\times\}10^\{\-3\}, a separate 15% early\-stopping validation fraction within the fitting routine, 25 no\-improvement iterations, at most 500 iterations, and deterministic seeds derived from base seed 42\. Each fitted model outputs a risk score, where larger values mean less reliable point predictions\. For display, we calibrate this ordering into a trust probability: the probability that predictions with similar risk fall within a chosen number of score points of the reported score\. The display calibration bins held\-out cells by hybrid risk, estimates the empirical within\-tolerance probability in each bin, enforces a monotone nonincreasing map from risk to trust probability, and interpolates this map for displayed cells\. For prediction intervals, we apply the same leave\-fold\-out conformal wrapper to each reliability estimator: on all folds except the evaluated fold, take the 90th percentile of\|s^−s\|/r\|\\hat\{s\}\-s\|/r, multiply the evaluated fold’s risk scorerrby that scale, and center the resulting 90% interval at the fixed Logit Bias ALS point predictions^\\hat\{s\}\.[Table˜17](https://arxiv.org/html/2606.24020#A5.T17)reports the resulting interval widths at three coverage levels\.

Table 17:Conformal interval widths\(score points\) at three coverage levels; lower is sharper\. The hybrid row is shaded\.

Similar Articles

Introducing BenchBench (5 minute read)

TLDR AI

Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

arXiv cs.AI

This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.

Open-World Evaluations for Measuring Frontier AI Capabilities

arXiv cs.AI

This paper argues that traditional benchmarks both overestimate and underestimate frontier AI capabilities, and proposes 'open-world evaluations'—long-horizon, real-world tasks assessed qualitatively—as a complementary approach. The CRUX project is introduced, with a demonstration where an AI agent successfully published an iOS app to the App Store with minimal intervention.