Forecasting Downstream Performance of LLMs With Proxy Metrics

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

llm forecasting proxy-metrics performance token-statistics training model-selection

Summary

This paper introduces proxy metrics based on token-level statistics from expert-written solutions to forecast downstream LLM performance, significantly outperforming loss-based methods in model selection, pretraining data selection, and training-time forecasting.

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

Original Article

View Cached Full Text

Cached at: 05/22/26, 02:20 PM

Paper page - Forecasting Downstream Performance of LLMs With Proxy Metrics

Source: https://huggingface.co/papers/2605.18607

Abstract

Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited.Cross-entropy lossis poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, andexpert token rank, from a candidate model’snext token distributionover expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-familymodel selection, they rank a heterogeneous population of reasoning models with meanSpearman Rho= 0.81 (vs. Rho = 0.36 forcross-entropy loss); 2) Forpretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing thePareto frontierbeyond existing methods; and 3) fortraining-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.18607

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18607 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18607 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18607 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Forecasting Downstream Performance of LLMs With Proxy Metrics

Paper page - Forecasting Downstream Performance of LLMs With Proxy Metrics

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

@AlphaSignalAI: You can now boost any LLM's accuracy 2-10x without training it. Most teams improve model accuracy by fine-tuning or swa…

Evaluating LLMs as Human Surrogates in Controlled Experiments

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

Submit Feedback

Similar Articles

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

@AlphaSignalAI: You can now boost any LLM's accuracy 2-10x without training it. Most teams improve model accuracy by fine-tuning or swa…

Evaluating LLMs as Human Surrogates in Controlled Experiments

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations