Forecasting Downstream Performance of LLMs With Proxy Metrics
Summary
This paper introduces proxy metrics based on token-level statistics from expert-written solutions to forecast downstream LLM performance, significantly outperforming loss-based methods in model selection, pretraining data selection, and training-time forecasting.
View Cached Full Text
Cached at: 05/22/26, 02:20 PM
Paper page - Forecasting Downstream Performance of LLMs With Proxy Metrics
Source: https://huggingface.co/papers/2605.18607
Abstract
Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.
Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited.Cross-entropy lossis poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, andexpert token rank, from a candidate model’snext token distributionover expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-familymodel selection, they rank a heterogeneous population of reasoning models with meanSpearman Rho= 0.81 (vs. Rho = 0.36 forcross-entropy loss); 2) Forpretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing thePareto frontierbeyond existing methods; and 3) fortraining-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.18607
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18607 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18607 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18607 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.
@AlphaSignalAI: You can now boost any LLM's accuracy 2-10x without training it. Most teams improve model accuracy by fine-tuning or swa…
OptiLLM is an open-source proxy that boosts any LLM's accuracy 2-10x by adding extra compute at inference time, using techniques like multi-agent cross-verification and Monte Carlo tree search.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
This paper identifies systemic measurement bias in production LLM inference benchmarks caused by single-process Python clients using asyncio, and proposes a multi-process evaluation framework and a new metric (NTPOT) to accurately profile serving engines at scale.
Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.