Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

arXiv cs.LG 06/09/26, 04:00 AM Papers
Summary
Introduces Item Response Scaling Laws (IRSL) that integrates Item Response Theory to efficiently estimate neural scaling laws, reducing required evaluation questions by 99.9% while achieving comparable accuracy.
arXiv:2606.07616v1 Announce Type: new Abstract: Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:51 AM
# Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
Source: [https://arxiv.org/html/2606.07616](https://arxiv.org/html/2606.07616)
###### Abstract

Scaling laws provide a fundamental framework for understanding the performance of Language Models \(LMs\), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples\. To address this, we introduce Item Response Scaling Laws \(IRSL\), a unified framework that integrates Item Response Theory \(IRT\) within the scaling law framework\. Unlike traditional approaches that treat each model\-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation forMMmodels andNNquestions to significantly reduce parameter complexity fromO\(M×N\)O\(M\\times N\)toO\(M\+N\)O\(M\+N\)\. We instantiate IRSL with Beta\-IRT, which leverages the empirical probability responses of LMs—such as token probabilities in pre\-training and pass rates in test\-time sampling—to capture richer signals than binary responses\. We validate our approach across two prevalent scaling paradigms: \(1\) pre\-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and \(2\) test\-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question\. Given a one\-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark \(a 99\.9% reduction\), achieving comparable or superior decision accuracy to traditional approaches\. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective\.

AI measurement, Scaling law, AI evaluation

## 1Introduction

LM1\\text\{LM\}\_\{1\}LM2\\text\{LM\}\_\{2\}⋮\\vdotsLMM\\text\{LM\}\_\{M\}New1New2Bench 1Bench 2⋯\\cdots\\;BenchBBResponse Matrix𝐑\\mathbf\{R\}\(M×NM\\\!\\times\\\!N\)IRTσ\\sigma\(\\left\(\\rule\{0\.0pt\}\{71\.13188pt\}\\right\. θ1\\theta\_\{1\}θ2\\theta\_\{2\}⋮\\vdotsθM\\theta\_\{M\}θ^1\\hat\{\\theta\}\_\{1\}θ^2\\hat\{\\theta\}\_\{2\}𝜽\\bm\{\\theta\}−\-⋯\\cdotsz1z\_\{1\}z2z\_\{2\}zNz\_\{N\}𝒛\\bm\{z\}\)\\left\.\\rule\{0\.0pt\}\{71\.13188pt\}\\right\) Scalexxθ^\\hat\{\\theta\}θ^new\\hat\{\\theta\}\_\{\\text\{new\}\}θ\\thetavs\. scaling variablexxPredictxxPerf\.EasyzzMed\.zzHardzzR^ij\(x\)=σ\(θi\(x\)−zj\)\\hat\{R\}\_\{ij\}\(x\)=\\sigma\\\!\\bigl\(\\theta\_\{i\}\(x\)\-z\_\{j\}\\bigr\)

Figure 1:IRSL reduces scaling law estimation fromO\(M×N\)O\(M\\times N\)toO\(M\+N\)O\(M\+N\)by factorizing model ability from question difficulty\.*Left:*The response matrix𝐑\\mathbf\{R\}records empirical probabilities across LMs and benchmark questions; sparse rows for new LMs illustrate query efficiency via adaptive testing\.*Center\-left:*IRT decomposes𝐑\\mathbf\{R\}into LM abilities𝜽\\bm\{\\theta\}\(orange\) and question difficulties𝒛\\bm\{z\}\(blue\), so thatRij≈σ\(θi−zj\)R\_\{ij\}\\approx\\sigma\(\\theta\_\{i\}\-z\_\{j\}\)\.*Center\-right:*The estimatedθ\\thetavalues scale predictably with the scaling variablexx\(e\.g\., pre\-training compute or test\-time samples\)\.*Right:*Recomposingθ\(x\)\\theta\(x\)with the calibratedzzyields per\-question scaling predictionsR^ij\(x\)=σ\(θi\(x\)−zj\)\\hat\{R\}\_\{ij\}\(x\)=\\sigma\(\\theta\_\{i\}\(x\)\-z\_\{j\}\), where questions of varying difficulty produce distinct curves\.Scaling laws provide a principled framework for predicting performance and allocating resources in Language Models \(LMs\)\. We focus on two primary forms: pre\-training downstream scaling, which characterizes how performance on downstream tasks improves with pre\-training compute\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.07616#bib.bib20); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.07616#bib.bib22); Bidermanet al\.,[2023](https://arxiv.org/html/2606.07616#bib.bib2); Grattafioriet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib63)\), and test\-time scaling, which describes how performance improves with the number of independent inference samples\. Test\-time scaling encompasses diverse strategies including chain\-of\-thought prompting, tree\-of\-thought search, repeated sampling, and reinforcement learning\-based reasoning\(Brownet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib16); Hugheset al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib47); Levi,[2024](https://arxiv.org/html/2606.07616#bib.bib70)\); in this work, we focus specifically on the repeated sampling paradigm\.

Deriving these laws is computationally expensive\. A pre\-training scaling study typically requires evaluating thousands of model checkpoints across tens of thousands of questions\. Similarly, establishing test\-time scaling laws requires a massive number of queries: number of models×\\timesnumber of questions×\\timesnumber of samples per question \(typically102×105×10410^\{2\}\\times 10^\{5\}\\times 10^\{4\}\)\. Consequently, practical studies are often constrained to small experimental scales\(Chenet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib39); Brownet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib16),[2020](https://arxiv.org/html/2606.07616#bib.bib68)\)\. The laws derived from such limited scales can exhibit unintuitive behaviors\. For example,Brownet al\.\([2024](https://arxiv.org/html/2606.07616#bib.bib16)\)empirically find a power\-law test\-time scaling relationship that, asSchaefferet al\.\([2025](https://arxiv.org/html/2606.07616#bib.bib15)\)demonstrates, holds only for specific, ill\-structured distributions of single\-sample success rates\.

To address the cost of evaluation, we turn to Item Response Theory \(IRT\)\. Originating in psychology and human testing, IRT is a probabilistic framework that models the interaction between test takers and questions, known for significantly reducing the number of queries required to reliably estimate the ability of test takers\. It has been highly successful in both human testing\(Lord,[1980](https://arxiv.org/html/2606.07616#bib.bib35)\)and recent LM leaderboard evaluations\(Truonget al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib7); Hofmannet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib49); Kipniset al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib58)\)\. Building on this, we introduce Item Response Scaling Laws \(IRSL\), a methodology that integrates IRT into the scaling law framework\. IRSL leverages the property of IRT to disentangle the ability of LMs from the characteristics of the questions, factorizing the problem intoMMsets of LM\-specific parameters andNNsets of question\-specific parameters, reducing the complexity fromO\(M×N\)O\(M\\times N\)toO\(M\+N\)O\(M\+N\)\. This factorization allows the estimated ability to be transferred across benchmarks that share the same measurement objective\.

Prior applications of IRT typically rely on binary responses111Where the response of a test taker to a question is either correct or incorrect\.\. However, unlike human testing, LMs provide empirical probability responses\. In pre\-training, LMs yield token probabilities that offer smoother scaling signals than discrete accuracy\(Schaefferet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib38); Magnussonet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib41)\)\. In test\-time sampling, LMs provide per\-attempt success rates averaged from many independent inferences\. Such empirical probability responses convey richer information than binary responses\. To leverage this information, we instantiate IRSL with Beta\-IRT, which uses a Beta loss to model these empirical probability responses\. While IRSL is a general framework compatible with any IRT model, Beta\-IRT enables it to exploit the richer probability signals that LMs uniquely provide\.

Our contributions are as follows:

- •We conduct a large\-scale study on 6,612 LM checkpoints and 37,682 questions from 10 benchmarks to demonstrate the effectiveness of our pre\-training downstream IRSL\. We show that it yields generalizable and robust estimates of scaling behavior with limited query budgets\.
- •On 12 LMs across 120 questions from 4 benchmarks with up to 2,500 samples per question, preliminary evidence suggests that IRSL similarly applies to test\-time scaling\.

By embedding the scaling law within the IRT framework, instantiated here via Beta\-IRT, our approach provides a theoretically principled and empirically validated alternative to traditional aggregate performance scaling\. Our code is released at[https://github\.com/aims\-foundations/irsl](https://github.com/aims-foundations/irsl)\.

## 2Related Work

#### Pre\-training Loss Scaling Laws

Many neural networks exhibit power\-law scaling for the pre\-training loss as a function of compute, data, or parameters\(Hestnesset al\.,[2017](https://arxiv.org/html/2606.07616#bib.bib26); Kaplanet al\.,[2020](https://arxiv.org/html/2606.07616#bib.bib20); Bahriet al\.,[2021](https://arxiv.org/html/2606.07616#bib.bib27); Hernandezet al\.,[2021](https://arxiv.org/html/2606.07616#bib.bib28); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.07616#bib.bib22); Muennighoffet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib23); Brownet al\.,[2020](https://arxiv.org/html/2606.07616#bib.bib68)\)\.

#### Downstream Performance Scaling Laws

Unlike predicting loss, predicting downstream performance from scale is generally harder\(Lourieet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib40); Schaefferet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib38)\)\. However, recent work has demonstrated that it can be done based on a two\-step prediction that chains together predictions from scale to loss and loss to downstream performance\(Bidermanet al\.,[2023](https://arxiv.org/html/2606.07616#bib.bib2); Magnussonet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib41); Gadreet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib29)\)\.

#### Test\-time Scaling Law

Test\-time scaling laws characterize how a model’s performance on a benchmark \(e\.g\., success rate\) improves as the number of stochastic samples drawn at inference increases, typically following a power law\(Brownet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib16); Snellet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib66); Hugheset al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib47)\)\. Later works demonstrate that such a power relationship holds only for ill\-structured response distributions in single\-sample success rates\(Schaefferet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib15); Levi,[2024](https://arxiv.org/html/2606.07616#bib.bib70)\)

#### Efficient LM Evaluation

Several recent works adopt Item Response Theory \(IRT\) as a foundation for LM evaluation using binary responses and Bernoulli loss, which we refer to as Binary\-IRT\.\(Truonget al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib7); Hofmannet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib49); Kipniset al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib58); Poloet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib60)\)\. Binary\-IRT has been shown to outperform many efficient evaluation methods, such as Anchor Points\(Viveket al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib71)\), SMART\(Guptaet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib72)\), MAGI\(Paech,[2024](https://arxiv.org/html/2606.07616#bib.bib73)\), and Stratified Sampling\(Perlitzet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib74)\)\. Our contribution integrates this framework into the scaling law estimation scenario and further uses Beta\-IRT, which leverages empirical probability responses unique to LMs to achieve better performance than Binary\-IRT\.

#### Continuous IRT Models

Traditional IRT relies on binary responses\.Chenet al\.\([2019](https://arxiv.org/html/2606.07616#bib.bib69)\)proposeβ3\\beta^\{3\}\-IRT, which uses a three\-parameter Beta distribution to model continuous responses\. Our Beta\-IRT differs in that we parameterize the Beta distribution mean via the standard IRT logistic functionσ\(d\(θ−z\)\)\\sigma\(d\(\\theta\-z\)\), preserving the interpretability of abilityθ\\thetaand difficultyzzwhile coupling naturally with scaling law estimation\. The key novelty of IRSL is not IRT for evaluation*per se*, but the integration of IRT into the scaling law framework for prediction\.

## 3Method

Item Response Theory \(IRT\) provides an elegant mathematical framework to model the interaction of LMs and benchmark questions\. We show how, under this framework, various known scaling laws arise naturally, and how the framework facilitates efficient and generalizable scaling laws estimation\. We show the definitions, traditional fitting approaches, and IRT\-based fitting approaches of the scaling laws in Table[1](https://arxiv.org/html/2606.07616#S3.T1)\.

\\rowcolorheaderbgDefinitionTraditional Fitting ApproachIRT\-based Fitting ApproachPre\-trainingAcc\\mathrm\{Acc\}Acc\(i,𝒟\)=1N∑j=1NYij\\mathrm\{Acc\}\(i,\{\\mathcal\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}Y\_\{ij\}a⋅σ\(b⋅\(α⋅FLOP−β\+γ−l0\)\)\+ca\\cdot\\sigma\(b\\cdot\(\\alpha\\cdot\\text\{FLOP\}^\{\-\\beta\}\+\\gamma\-l\_\{0\}\)\)\+c1N∑j=1Nσ\(dj⋅\(a⋅log⁡\(FLOPi\)\+b−zj\)\)\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\sigma\(d\_\{j\}\\cdot\(a\\cdot\\log\(\\mathrm\{FLOP\}\_\{i\}\)\+b\-z\_\{j\}\)\)Pre\-trainingpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}pCorrect Choice⁡\(i,𝒟\)=1N∑j=1NpCorrect Choice⁡\(i,j\)\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\(i,\{\\mathcal\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\(i,j\)a⋅σ\(b⋅\(α⋅FLOP−β\+γ−l0\)\)\+ca\\cdot\\sigma\(b\\cdot\(\\alpha\\cdot\\text\{FLOP\}^\{\-\\beta\}\+\\gamma\-l\_\{0\}\)\)\+c1N∑j=1Nσ\(dj⋅\(a⋅log⁡\(FLOPi\)\+b−zj\)\)\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\sigma\(d\_\{j\}\\cdot\(a\\cdot\\log\(\\mathrm\{FLOP\}\_\{i\}\)\+b\-z\_\{j\}\)\)Test\-timepass@k\\operatorname\{pass@k\}pass@k⁡\(i,𝒟\)=1N∑j=1Npass@k⁡\(i,j\)\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\operatorname\{pass@k\}\(i,j\)1N∑j=1N\(1−\(1−pass@1⁡\(i,j\)\)k\)\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\(1\-\(1\-\\operatorname\{pass@1\}\(i,j\)\)^\{k\}\)1N∑j=1N\(1−\(1−σ\(dj⋅\(θi−zj\)\)\)k\)\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\(1\-\(1\-\\sigma\(d\_\{j\}\\cdot\(\\theta\_\{i\}\-z\_\{j\}\)\)\)^\{k\}\)

Table 1:IRSL learns question\-level parameters, enabling generalization across question sets with the same measurement objective\.Definitions, traditional fitting approach, and IRT\-based fitting approach forAcc\\mathrm\{Acc\},pCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\(pre\-training downstream scaling law\), andpass@k\\operatorname\{pass@k\}\(test\-time scaling law\), using the 2PL model\. Traditional approaches fit parameters specific to LMs and benchmarks\.### 3\.1Traditional Binary\-IRT

EvaluatingMMmodels on benchmarks withNNquestions requiresM×NM\\times Nqueries, which is prohibitively expensive at scale\. Item Response Theory addresses this by modeling the interaction between test taker ability and question difficulty, enabling reliable evaluation from far fewer queries\. Formally, IRT refers to a class of probabilistic latent variable models that explain the relationship between the test taker’s latent ability, the question’s characteristics \(e\.g\., difficulty\), and the observed response from the test taker to the questions\(Baker,[2001](https://arxiv.org/html/2606.07616#bib.bib31); Van der Lindenet al\.,[2000](https://arxiv.org/html/2606.07616#bib.bib32)\)\. A central model in IRT is the 1PL model\(Rasch,[1993](https://arxiv.org/html/2606.07616#bib.bib17)\), where each test taker has an ability parameterθ\\theta, and each question has a difficulty parameterzz\. A higherθ\\thetadenotes greater ability, and a higherzzdenotes a more difficult question\. Letyydenote the binary response of the test taker to the question, wherey=1y=1if the response is correct and 0 otherwise\. The probability of a correct response is modeled byp\(y=1∣θ,z\)=σ\(θ−z\)p\(y=1\\mid\\theta,z\)=\\sigma\(\\theta\-z\), whereσ\\sigmais the sigmoid function\. Another widely adopted model in IRT is the 2PL model\(Lord,[1952](https://arxiv.org/html/2606.07616#bib.bib5); Birnbaum,[1968](https://arxiv.org/html/2606.07616#bib.bib6)\), which adds a discrimination parameterddto capture how sharply a question differentiates between test takers of different abilities, modeling the probability of a correct response asp\(y=1∣θ,z,d\)=σ\(d⋅\(θ−z\)\)\.p\(y=1\\mid\\theta,z,d\)=\\sigma\(d\\cdot\(\\theta\-z\)\)\.The difficultyzzand the discriminationddare collectively referred to as the item parameters\. The use of IRT consists of two phases: calibration, which estimates the item parameters, and adaptive testing, which enables efficient ability estimation for new test takers\.

During calibration, a binary response matrixYYof sizeM×NM\\times Nis collected, whereMMandNNdenote the number of test takers and questions, respectively\. EntryYijY\_\{ij\}represents the response of test takeriito questionjj\. With the binary response matrix, the item parameters can be estimated via either MLE or EM by minimizing the Bernoulli loss between the IRT predicted probabilities and the observed binary responsesℒBernoulli=−∑i=1M∑j=1N\[Yijlog⁡pij\+\(1−Yij\)log⁡\(1−pij\)\]\{\\mathcal\{L\}\}\_\{\\text\{Bernoulli\}\}=\-\\sum\_\{i=1\}^\{M\}\\sum\_\{j=1\}^\{N\}\\left\[Y\_\{ij\}\\log p\_\{ij\}\+\(1\-Y\_\{ij\}\)\\log\(1\-p\_\{ij\}\)\\right\], wherepij=σ\(dj\(θi−zj\)\)p\_\{ij\}=\\sigma\(d\_\{j\}\(\\theta\_\{i\}\-z\_\{j\}\)\)for the 2PL model \(orpij=σ\(θi−zj\)p\_\{ij\}=\\sigma\(\\theta\_\{i\}\-z\_\{j\}\)for 1PL\)\(Bock and Aitkin,[1981](https://arxiv.org/html/2606.07616#bib.bib18); Chalmers,[2012](https://arxiv.org/html/2606.07616#bib.bib19); Wuet al\.,[2020](https://arxiv.org/html/2606.07616#bib.bib33)\)\.

During adaptive testing, the ability of a new test taker is efficiently estimated through an iterative procedure that alternates between ability update and question selection\. In the ability update step, the test taker’s ability is estimated from their responses to all previously asked questions\. In the question selection step, the most informative question is selected for the query based on the current ability estimate\. Consequently, significantly fewer questions are required to obtain a reliable estimate of the new test taker’s ability \(e\.g\., 50 out of 37,682 in our experiments; see Section[4\.2](https://arxiv.org/html/2606.07616#S4.SS2)\)\(Meijer and Nering,[1999](https://arxiv.org/html/2606.07616#bib.bib55); Chang,[2015](https://arxiv.org/html/2606.07616#bib.bib56)\)\.

### 3\.2Traditional Scaling Laws

We investigate two scaling laws: the pre\-training downstream scaling law and the test\-time scaling law\. The pre\-training downstream scaling law characterizes how the performance of an LMiion a benchmark𝒟\{\\mathcal\{D\}\}scales with the pre\-training computeFLOP\\mathrm\{FLOP\}\. The traditional approach involves a two\-step fitting process: first modeling the relationship between pre\-training lossLLand computeFLOP\\mathrm\{FLOP\}, and subsequently mapping the lossLLto benchmark performancePerformance\(i,𝒟\)\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)\(Bhagiaet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib21)\):

L\\displaystyle L≈α⋅FLOP−β\+γ,\\displaystyle\\approx\\alpha\\cdot\\mathrm\{FLOP\}^\{\-\\beta\}\+\\gamma,\(1\)Performance\(i,𝒟\)\\displaystyle\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)≈a⋅σ\(b⋅\(L−l0\)\)\+c,\\displaystyle\\approx a\\cdot\\sigma\(b\\cdot\(L\-l\_\{0\}\)\)\+c,
whereσ\\sigmadenotes the sigmoid function, andα,β,γ,a,b,c\\alpha,\\beta,\\gamma,a,b,c, andl0l\_\{0\}are learnable parameters\. The sigmoid mapping in the second step is known to be sensitive to initialization and hyperparameters; our IRT\-based approach \(Section[3\.4](https://arxiv.org/html/2606.07616#S3.SS4)\) avoids this two\-step fitting\. FollowingBhagiaet al\.\([2024](https://arxiv.org/html/2606.07616#bib.bib21)\), we use the benchmark\-specific loss222Can be understood as the pre\-training validation loss on benchmark questions\.asLL\. Consequently, all scaling law parameters are benchmark\- and LM\-specific, implying that parameters derived for one LM\-benchmark pair do not generalize to others\.Performance\(i,𝒟\)\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)can be quantified using metrics such as accuracy \(Acc\\mathrm\{Acc\}\) or the average probability of the correct choice \(pCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\)\. Previous work notes that discrete metrics likeAcc\\mathrm\{Acc\}can exhibit performance jumps across scales, whereas continuous metrics likepCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}often reveal more predictable improvements\(Schaefferet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib38); Magnussonet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib41)\)\. See Appendix[A](https://arxiv.org/html/2606.07616#A1)for the calculation details ofLL,Acc\\mathrm\{Acc\}, andpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\.

The test\-time scaling law characterizes the relationship between the success rate of an LMiion a benchmark𝒟\{\\mathcal\{D\}\}and the number of independent inference sampleskk\(Brownet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib16); Levi,[2024](https://arxiv.org/html/2606.07616#bib.bib70)\)\. For an LMiiand a questionjj,pass@1⁡\(i,j\)\\operatorname\{pass@1\}\(i,j\)is defined as the probability that a single sample from LMiicorrectly answers questionjj\. The question\-level success rate,pass@k⁡\(i,j\)\\operatorname\{pass@k\}\(i,j\), is defined as the probability that at least one of thekkgenerated responses is correct\. The benchmark\-level success ratepass@k⁡\(i,𝒟\)\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)is computed by averaging the probabilities over all benchmark questionspass@k⁡\(i,𝒟\)=1N∑j=1Npass@k⁡\(i,j\)\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\operatorname\{pass@k\}\(i,j\)\. Previous studies empirically find that−log⁡pass@k\-\\log\\operatorname\{pass@k\}exhibits a power\-law decay with respect tokk\(Brownet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib16); Hugheset al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib47)\):−log⁡pass@k⁡\(i,𝒟\)≈uk−v\-\\log\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)\\approx uk^\{\-v\}, whereuuandvvare scaling law parameters\.Schaefferet al\.\([2025](https://arxiv.org/html/2606.07616#bib.bib15)\)note that while the question\-level success rate theoretically scales exponentially withkk, the benchmark\-level power law emerges because the distribution ofpass@1⁡\(i,j\)\\operatorname\{pass@1\}\(i,j\)is heavy\-tailed towards extremely difficult questions\. The relationship betweenpass@k⁡\(i,𝒟\)\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)andpass@1⁡\(i,j\)\\operatorname\{pass@1\}\(i,j\)can be expressed as:

pass@k⁡\(i,𝒟\)=1N∑j=1N\(1−\(1−pass@1⁡\(i,j\)\)k\),\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\(1\-\(1\-\\operatorname\{pass@1\}\(i,j\)\)^\{k\}\),\(2\)wherepass@1\\operatorname\{pass@1\}is benchmark\- and LM\-specific\.

### 3\.3Beta\-IRT

Unlike human testing, LMs provide empirical probability responses that convey richer information than binary responses, such aspCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}from pre\-training downstream scaling andpass@1\\operatorname\{pass@1\}from test\-time scaling\. Drawing on insights from Beta regression\(Ferrari and Cribari\-Neto,[2004](https://arxiv.org/html/2606.07616#bib.bib65)\)and related continuous IRT models\(Chenet al\.,[2019](https://arxiv.org/html/2606.07616#bib.bib69)\), we propose Beta\-IRT, which replaces the standard Bernoulli loss with the Beta loss:ℒBeta=−∑i=1M∑j=1Nlog⁡p\(Pij;pij,ϕ\),\{\\mathcal\{L\}\}\_\{\\text\{Beta\}\}=\-\\sum\_\{i=1\}^\{M\}\\sum\_\{j=1\}^\{N\}\\log p\(P\_\{ij\};p\_\{ij\},\\phi\),wherePijP\_\{ij\}denotes the empirical response probability,pijp\_\{ij\}denotes IRT predicted probability, andϕ\>0\\phi\>0is a precision parameter controlling the concentration of the Beta distribution around its mean \(higherϕ\\phiyields a tighter distribution\)\. Unlikeβ3\\beta^\{3\}\-IRT\(Chenet al\.,[2019](https://arxiv.org/html/2606.07616#bib.bib69)\), which uses a three\-parameter Beta distribution, our formulation parameterizes the Beta mean via the standard IRT logistic function, preserving the interpretability ofθ\\thetaandzz\. We empirically find that Beta\-IRT achieves reliable calibration with significantly fewer test takers than Binary\-IRT, substantially reducing calibration costs\.

### 3\.4Item Response Scaling Laws

The core idea is to modelpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}andpass@1\\operatorname\{pass@1\}within the IRT framework\. For the pre\-training downstream scaling law, we employ a two\-stage fitting procedure: first mapping pre\-training computeFLOP\\mathrm\{FLOP\}to the abilityθ\\theta, and subsequently mappingθ\\thetato the benchmark performancePerformance\(i,𝒟\)\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)\. Empirically, we observe that theθ\\thetascales linearly withlog⁡FLOP\\log\\mathrm\{FLOP\}\(Figure[12](https://arxiv.org/html/2606.07616#A2.F12)\):

θi\\displaystyle\\theta\_\{i\}≈a⋅log⁡\(FLOPi\)\+b,\\displaystyle\\approx a\\cdot\\log\(\\mathrm\{FLOP\}\_\{i\}\)\+b,\(3\)Performance\(i,𝒟\)\\displaystyle\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)≈1N∑j=1Nσ\(dj⋅\(θi−zj\)\),\\displaystyle\\approx\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\sigma\(d\_\{j\}\\cdot\(\\theta\_\{i\}\-z\_\{j\}\)\),
wherea,ba,b, andθi\\theta\_\{i\}are LM\-specific parameters, anddjd\_\{j\}andzjz\_\{j\}are question\-specific parameters\. Specifically, for the baseline scenario wherePerformance\(i,𝒟\)\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)is measured by accuracy, we employ Binary\-IRT with binary responses\. For our approach, wherePerformance\(i,𝒟\)\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)is measured bypCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}, we employ Beta\-IRT with empirical probability responses\.

With calibrated item parameters, adaptive testing enables the efficient estimation of a new LM’s ability using fewer questions, facilitating the rapid derivation of its pre\-training downstream scaling law\. Furthermore, IRSL offers generalizability across benchmarks\. For a target benchmark𝒟′\{\\mathcal\{D\}\}^\{\\prime\}sharing the same measurement objective as𝒟\{\\mathcal\{D\}\}, theθ\\thetaestimated from𝒟\{\\mathcal\{D\}\}is transferable\. This allows for the prediction of performance on𝒟′\{\\mathcal\{D\}\}^\{\\prime\}viaPerformance\(i,𝒟′\)≈1N′∑j=1N′σ\(dj′⋅\(θi−zj′\)\)\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}^\{\\prime\}\)\\approx\\frac\{1\}\{N^\{\\prime\}\}\\sum\_\{j=1\}^\{N^\{\\prime\}\}\\sigma\(d\_\{j\}^\{\\prime\}\\cdot\(\\theta\_\{i\}\-z\_\{j\}^\{\\prime\}\)\), obviating the need to collect empirical responses from LMiion𝒟′\{\\mathcal\{D\}\}^\{\\prime\}\.

For the test\-time scaling law, we model the benchmark\-level success rate by substituting the Beta\-IRT predicted single\-attempt probability forpass@1⁡\(i,j\)\\operatorname\{pass@1\}\(i,j\):

pass@k⁡\(i,𝒟\)=1N∑j=1N\(1−\(1−σ\(dj⋅\(θi−zj\)\)\)k\),\\operatorname\{pass@k\}\(i,\{\\mathcal\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\(1\-\(1\-\\sigma\(d\_\{j\}\\cdot\(\\theta\_\{i\}\-z\_\{j\}\)\)\)^\{k\}\),\(4\)whereθi\\theta\_\{i\}is an LM\-specific ability parameter estimated per benchmark, anddjd\_\{j\}andzjz\_\{j\}are question\-specific parameters\. Similar to pre\-training downstream scaling, our approach enables efficient estimation of a new LM’s ability using fewer questions, and the ability can generalize across different benchmarks sharing the measurement objective\. Furthermore, in test\-time scaling, a binary response tensor of shapeM×N×KM\\times N\\times Kis first collected, whereKKdenotes the total number of samples\. This tensor is averaged across the sample dimension to yield an empirical probability response matrix\. In this setting, we empirically find that Beta\-IRT facilitates the efficient estimation of a new LM’s ability using significantly fewer samples, further enhancing query efficiency\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/power_beta_bernoulli_comparison.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/power_beta_bernoulli_2pl_comparison.png)

Figure 2:Beta\-IRT achieves reliable calibration with as few as 2 test takers, requiring 30–60×\\timesfewer than Binary\-IRT\.We report RMSE \(Left\) and Correlation \(Right\) for both the 1PL model \(Top\) and the 2PL model \(Bottom\) as a function of the number of test takersMM\. Error bars indicate±1\\pm 1standard deviation over 10 trials\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/decision_accuracy/decision_accuracy_all_benches.png)Figure 3:Beta\-IRT provides more robust scaling law estimates, especially on lower\-quality benchmarks\.Decision Accuracy vs\. Proportion of Target FLOPs across 10 benchmarks\. We iteratively fit scaling laws by including larger models and extrapolating to the target size to predict benchmark accuracy rankings\. Results are averaged over five random train\-test splits\. Black lines denote Traditional Scaling; Blue and Red lines denote IRSL 1PL and 2PL, respectively\. Dashed lines indicate binary responses \(Acc\\mathrm\{Acc\}\), while solid lines indicate empirical probability responses \(pCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\)\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_arc_challenge.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_arc_easy.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_boolq.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_csqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_hellaswag.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_mmlu.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_openbookqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_piqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_socialiqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/corr_scatter/prob_corr_winogrande.png)

Figure 4:Beta\-IRT effectively captures the underlying response structure across all 10 benchmarks\.Correlation between Beta\-IRT 2PL predictedpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\(x\-axis\) and empiricalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\(y\-axis\), visualized using 2\-D KDE contour plots\. The Pearson correlation coefficient \(ρ\\rho\) is reported for each benchmark, with marginal histograms showing thepCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}distribution\. The corresponding results for 1PL are provided in Figure[16](https://arxiv.org/html/2606.07616#A2.F16)in Appendix[B](https://arxiv.org/html/2606.07616#A2)\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/generalization/generalization_law_curve_openbookqa_dclm-baseline-top-fw-3p.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/generalization/generalization_across_arc_dclm-baseline-top-fw-3p.png)

Figure 5:IRSL accurately predicts scaling trends on harder sets using the ability estimated from easy sets alone\.\(Left\) Within\-benchmark transfer on OpenBookQA\. \(Right\) Cross\-benchmark transfer from ARC Easy to ARC Challenge\. Solid lines represent the Ground Truth \(GT\) scaling curves, while dashed lines represent the estimated curves where LM ability is derived solely from the easy set\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/generalization/fig6_density.png)Figure 6:The abilityθ\\thetaestimated by IRSL is robustly transferable across benchmark sets\.MAE distribution for hard set estimation across all benchmarks and LM data mixtures\. We report the MAE between the ground truth scaling curve and the estimated curve for two settings: Within\-Benchmark Transfer \(blue\) and Cross\-Benchmark Transfer \(red\)\. See Figure[19](https://arxiv.org/html/2606.07616#A2.F19)for the full results\.In Section[4\.1](https://arxiv.org/html/2606.07616#S4.SS1), we conduct a simulation study to demonstrate the superior sample efficiency of Beta\-IRT\. In Section[4\.2](https://arxiv.org/html/2606.07616#S4.SS2), we demonstrate the advantages of the Item Response Scaling Law \(IRSL\) for pre\-training downstream scaling, and in Section[4\.3](https://arxiv.org/html/2606.07616#S4.SS3), we preliminarily validate its effectiveness for test\-time scaling\.

### 4\.1Sample Efficiency of Beta\-IRT

To quantify the information gain provided by empirical response probabilities, we conduct controlled simulations comparing the standard Binary\-IRT with our proposed Beta\-IRT for both 1PL and 2PL models\. We generate true abilitiesθi∼𝒩\(0,1\)\\theta\_\{i\}\\sim\\mathcal\{N\}\(0,1\)forMMtest takers and question difficultieszj∼𝒩\(0,1\)z\_\{j\}\\sim\\mathcal\{N\}\(0,1\)forN=100N=100questions\. For the 2PL model, question discriminations are sampled fromdj∼LogNormal\(0,0\.5\)d\_\{j\}\\sim\\text\{LogNormal\}\(0,0\.5\)\. We simulate binary response matricesYij∼Bernoulli\(pij\)Y\_\{ij\}\\sim\\text\{Bernoulli\}\(p\_\{ij\}\)and empirical probability matricesPij=pij\+εijP\_\{ij\}=p\_\{ij\}\+\\varepsilon\_\{ij\}, where the noise termεij∼𝒩\(0,0\.012\)\\varepsilon\_\{ij\}\\sim\\mathcal\{N\}\(0,0\.01^\{2\}\)mimics empirical uncertainty\.

We vary the number of test takersMMacross the set\{2,4,8,16,32,64,128\}\\\{2,4,8,16,32,64,128\\\}, a range chosen to reflect the typical availability of test takers in LM evaluation\. We report the Root Mean Square Error \(RMSE\) and Pearson correlation coefficient \(ρ\\rho\) between the estimated and true item parameters, averaged over 10 independent trials\. Figure[2](https://arxiv.org/html/2606.07616#S4.F2)illustrates the substantial sample efficiency advantage of Beta\-IRT\. In the 1PL setting, Beta\-IRT achieves near\-perfect parameter recovery \(RMSE<0\.05<0\.05,ρ\>0\.999\\rho\>0\.999\) with as few asM=2M=2test takers\. In contrast, Binary\-IRT requires significantly larger sample sizes \(M≥64M\\geq 64\) to attain comparable accuracy\. The 2PL model exhibits a similar trend: Beta\-IRT 2PL maintains an RMSE<0\.7<0\.7across all sample sizes, while Binary\-IRT 2PL begins with a high error and only approaches the performance of Beta\-IRT 2PL atM=128M=128\. These findings confirm that Beta\-IRT significantly improves sample efficiency for calibration, reducing the high computational costs associated with large\-scale LM benchmarking\.

### 4\.2Pre\-training Downstream IRSL

We use the data suite from DataDecide\(Magnussonet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib41)\), a large\-scale controlled experiment on pre\-training downstream scaling\. The objective is to identify which of the 25 pre\-training data mixtures yields the highest benchmark accuracy for the target model size \(here, 1B\)\. Because LMs are expensive to pretrain, standard practice involves fitting scaling laws on smaller models and extrapolating to the target size\. The suite comprises models pre\-trained on 25 data mixtures across 14 model sizes, ranging from 4M to 1B parameters\. Each run includes 6 to 30 checkpoints depending on the model size, resulting in a total of 6,612 model checkpoints\. All checkpoints are evaluated on 10 multiple\-choice benchmarks, totaling 37,682 questions\. From this, we extract two response matrices of shape6612×376826612\\times 37682: a binary response matrix and an empirical probability response matrix\. We randomly select 5 data mixtures to serve as the train set for calibration\. The remaining 20 data mixtures constitute the test set for adaptive testing, where we estimate the abilityθ\\thetausing a budget of only 50 questions per benchmark\.

We evaluate the effectiveness of a scaling law method using Decision Accuracy, a metric that quantifies rank consistency\. Let𝒫\{\\mathcal\{P\}\}denote the set of all pairs of data mixtures\(A,B\)\(A,B\)in the test set\. Letyyandy^\\hat\{y\}represent the ground truth benchmark accuracy at the 1B target size and the predicted performance extrapolated from the scaling law, respectively\. Decision Accuracy is defined as:

Decision Accuracy=1\|𝒫\|∑\(A,B\)∈𝒫𝕀\(sign\(y^A−y^B\)=sign\(yA−yB\)\)\.\\begin\{split\}&\\text\{Decision Accuracy\}=\\\\ &\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{\(A,B\)\\in\\mathcal\{P\}\}\\mathbb\{I\}\(\\mathrm\{sign\}\(\\hat\{y\}\_\{A\}\-\\hat\{y\}\_\{B\}\)=\\mathrm\{sign\}\(y\_\{A\}\-y\_\{B\}\)\)\.\\end\{split\}\(5\)
We iteratively include larger models for the scaling law fitting and extrapolate to the target size to predict the benchmark accuracy rankings\. Figure[3](https://arxiv.org/html/2606.07616#S4.F3)reports the Decision Accuracy against the proportion of target FLOPs across 10 benchmarks\. We compare six scaling law methods: traditional scaling law \(usingAcc\\mathrm\{Acc\}orpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}via Equation[1](https://arxiv.org/html/2606.07616#S3.E1)\) and IRSL \(Binary\-IRT and Beta\-IRT, using 1PL and 2PL variants via Equation[3](https://arxiv.org/html/2606.07616#S3.E3)\)\. On ARC Challenge, ARC Easy, and MMLU, Beta\-IRT matches the strong performance of TraditionalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}, and they outperform other methods\. On CommonsenseQA, OpenBookQA, PIQA, and SocialIQA, Beta\-IRT demonstrates superior reliability, outperforming other methods\. On BoolQ, HellaSwag, and WinoGrande, Beta\-IRT fails to capture a predictive trend\. We attribute this to benchmark homogeneity: the calibrated item parameters on these benchmarks are highly concentrated \(e\.g\., BoolQ:σz=0\.19\\sigma\_\{z\}\{=\}0\.19,σd=0\.14\\sigma\_\{d\}\{=\}0\.14\), meaning nearly all questions have similar difficulty and discrimination\. As a result, the Test Information Function \(TIF\) per item is low, limiting IRT’s ability to differentiate model abilities—in contrast to benchmarks like ARC Challenge \(σz=0\.55\\sigma\_\{z\}\{=\}0\.55,σd=0\.61\\sigma\_\{d\}\{=\}0\.61\), where diverse items provide substantially more information \(see Figure[20](https://arxiv.org/html/2606.07616#A3.F20)in Appendix[C](https://arxiv.org/html/2606.07616#A3)\)\. These observations align with the findings ofHeinemanet al\.\([2025](https://arxiv.org/html/2606.07616#bib.bib43)\), a follow\-up study on DataDecide that introduces a signal\-to\-noise ratio to assess benchmark quality in downstream scaling\. Specifically, we find that Beta\-IRT ties with TraditionalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}on high\-quality benchmarks, outperforms other methods on lower\-quality benchmarks, and fails to capture a trend on extremely noisy benchmarks\. We conclude that Beta\-IRT provides a more robust estimate of the scaling law curve with limited query budget, especially for benchmarks with lower quality\. We report the scaling curve fitting for the six methods in Figure[13](https://arxiv.org/html/2606.07616#A2.F13),[14](https://arxiv.org/html/2606.07616#A2.F14), and[15](https://arxiv.org/html/2606.07616#A2.F15)in Appendix[B](https://arxiv.org/html/2606.07616#A2)\.

We report the strong correlation between Beta\-IRT predictedpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}and the empiricalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}on the test set, as illustrated in Figure[4](https://arxiv.org/html/2606.07616#S4.F4)for the 2PL variant and Figure[16](https://arxiv.org/html/2606.07616#A2.F16)for the 1PL variant\. We conclude that Beta\-IRT effectively captures the underlying response structure\. We further report the Beta\-IRT curve on single questions in Figure[17](https://arxiv.org/html/2606.07616#A2.F17)and[18](https://arxiv.org/html/2606.07616#A2.F18)in Appendix[B](https://arxiv.org/html/2606.07616#A2)\.

Next, we demonstrate the generalizability of IRSL across benchmark sets with different difficulties\. We partition each benchmark into an easy and a hard subset based on the meanpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}of each question across all LM checkpoints\. We estimate the abilityθ\\thetaof each LM checkpoint using only the easy subset\. Then, using theseθ\\thetaestimates alongside the calibrated item parameters of the hard subset, we generate the scaling curve for the hard subset without accessing the responses\. Figure[5](https://arxiv.org/html/2606.07616#S4.F5)\(Left\) illustrates this within\-benchmark transfer for OpenBookQA on a representative LM data mixture\. We further demonstrate cross\-benchmark transfer in Figure[5](https://arxiv.org/html/2606.07616#S4.F5)\(Right\), showing thatθ\\thetaestimated on ARC Easy effectively predicts the scaling curve on ARC Challenge\. Figure[6](https://arxiv.org/html/2606.07616#S4.F6)reports the distribution of Mean Absolute Error \(MAE\) between the ground truth and the estimated scaling curve on the hard sets across all benchmarks and data mixtures \(full results in Figure[19](https://arxiv.org/html/2606.07616#A2.F19)\)\. We conclude that the ability estimated by IRSL is transferable, enabling reliable performance forecasting on benchmark sets with the same measurement objective\. In Appendix[D](https://arxiv.org/html/2606.07616#A4), we further discuss how to assess similar measurement objectives and show that cross\-benchmark transfer may be more broadly applicable\.

### 4\.3Test\-time IRSL

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/3_law_curve/resmat2/mmlu_pro/law_curve_after_filter_gemma-3-27b-it.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/3_law_curve/resmat2/aime2024/law_curve_after_filter_Qwen3-4B.png)

Figure 7:IRSL yields more reliable test\-time scaling estimates than traditional approaches given a limited query budget\.Comparison of three test\-time scaling curves: Ground Truth, Traditional scaling law, and IRSL, for two representative LM\-Benchmark pairs in the test set\. We plot−log⁡pass@k\-\\log\\operatorname\{pass@k\}against the number of sampleskk\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/irsl_testtime_resmat2_bench_distributions.png)Figure 8:IRSL consistently outperforms Traditional scaling across nearly all LM\-benchmark pairs\.We visualize the distribution of the performance gapTraditional MAE−IRSL MAE\\text\{Traditional MAE\}\-\\text\{IRSL MAE\}on four benchmarks across 100 random train\-test splits\. The distributions are consistently concentrated to the right of the zero line \(red line\), which indicates that IRSL achieves a lower MAE and thus provides a more accurate estimate\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/corr_scatter/resmat2_aime2024_prob_corr.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/corr_scatter/resmat2_aime2025_prob_corr.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/corr_scatter/resmat2_global_mmlu_lite_prob_corr.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/corr_scatter/resmat2_mmlu_pro_prob_corr.png)

Figure 9:Beta\-IRT predictedpass@1\\operatorname\{pass@1\}strongly correlates with empiricalpass@1\\operatorname\{pass@1\}across all test\-time benchmarks\.Correlation between Beta\-IRT 1PL predictedpass@1\\operatorname\{pass@1\}\(x\-axis\) and empiricalpass@1\\operatorname\{pass@1\}\(y\-axis\), visualized using 2\-D KDE contour plots\. The Pearson correlation coefficient \(ρ\\rho\) is reported for each benchmark\. The corresponding results for the 2PL variant are provided in Figure[24](https://arxiv.org/html/2606.07616#A5.F24)\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/5_generalization_law_curve/irsl_testtime_resmat2/mmlu_pro/hardeasy_irsl_testtime_resmat2_Qwen3-32B_mmlu_pro.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/6_generalization_across_aime/generalization_across_aime_Qwen3-4B.png)

Figure 10:Test\-time IRSL ability transfers reliably from easy to hard sets and across benchmarks\.\(Left\) Within\-benchmark transfer on MMLU Pro\. \(Right\) Cross\-benchmark transfer from AIME 2024 to AIME 2025\. The close alignment between Hard GT and Hard Est demonstrates that the test\-time scaling trend on harder sets can be reliably forecasted using ability parameters estimated from the easy sets\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/irsl_testtime_resmat2_transfer.png)Figure 11:Consistently low MAE confirms that test\-time IRSL ability is transferable across difficulty levels\.We report the MAE between the ground truth scaling curve and the estimated curve for two settings: Within\-Benchmark Transfer \(blue\) and Cross\-Benchmark Transfer \(red\)\. The consistent low MAE values indicate that the abilityθ\\thetaestimated by IRSL enables reliable performance forecasting on benchmark sets with the same measurement objective\.We collect a binary response tensor of shape12×120×250012\\times 120\\times 2500\(12 LMs, 120 questions from 4 benchmarks, 2500 samples, listed in Appendix[E](https://arxiv.org/html/2606.07616#A5)\)\. We obtain the empiricalpass@1\\operatorname\{pass@1\}response matrix by averaging over the last dimension\. We filter out questions with extremely lowpass@1\\operatorname\{pass@1\}as they offer no discriminatory power\. In each train\-test split, we randomly select 8 LMs to serve as the training set for calibration, while the remaining 4 LMs constitute the test set for adaptive testing with a query budget of 50 samples per question\. Given the limited number of LMs for calibration, we report the 1PL model as our primary result and present the 2PL findings in Appendix[E](https://arxiv.org/html/2606.07616#A5)333The 2PL model typically requires more test takers to achieve reliable calibration\.\.

We report three scaling curves \(−log⁡pass@k\-\\log\\operatorname\{pass@k\}versus the number of sampleskk\) for LMs in the test set in Figure[7](https://arxiv.org/html/2606.07616#S4.F7): \(1\) The Ground Truth curve, wherepass@k\\operatorname\{pass@k\}is estimated from all available samples using the unbiased and numerically stable estimator proposed byChenet al\.\([2021](https://arxiv.org/html/2606.07616#bib.bib61)\):pass@k⁡\(i,j\)≈1−\(H−cijk\)/\(Hk\)\\operatorname\{pass@k\}\(i,j\)\\approx 1\-\\binom\{H\-c\_\{ij\}\}\{k\}/\\binom\{H\}\{k\}, whereHHis the total number of samples andcijc\_\{ij\}is the number of correct samples by LMiion questionjj\. \(2\) The traditional scaling curve, wherepass@k\\operatorname\{pass@k\}is estimated from the limited query budget via Equation[2](https://arxiv.org/html/2606.07616#S3.E2)\. \(3\) The IRSL curve, where the abilityθ\\thetais estimated from the same limited query budget, andpass@k\\operatorname\{pass@k\}is subsequently derived using Equation[4](https://arxiv.org/html/2606.07616#S3.E4)\. As shown in Figure[7](https://arxiv.org/html/2606.07616#S4.F7), there is a high alignment between the IRSL curve and the Ground Truth curve\. To quantify the superiority of IRSL against traditional scaling law, we compute the MAE of−log⁡pass@k\-\\log\\operatorname\{pass@k\}for both methods relative to the ground truth\. We visualize the distribution of the performance gapTraditional MAE−IRSL MAE\\text\{Traditional MAE\}\-\\text\{IRSL MAE\}in Figure[8](https://arxiv.org/html/2606.07616#S4.F8)across all benchmarks and test LMs over 100 random train\-test splits\. The performance gap is predominantly positive, confirming that IRSL yields more reliable test\-time scaling law estimates given a limited query budget\.

We report the strong correlation between Beta\-IRT predictedpass@1\\operatorname\{pass@1\}and the empiricalpass@1\\operatorname\{pass@1\}on the test set, as illustrated in Figure[9](https://arxiv.org/html/2606.07616#S4.F9)for the 1PL variant and Figure[24](https://arxiv.org/html/2606.07616#A5.F24)for the 2PL variant\. We further report the Beta\-IRT curve on single questions in Figure[25](https://arxiv.org/html/2606.07616#A5.F25)and[26](https://arxiv.org/html/2606.07616#A5.F26)in Appendix[E](https://arxiv.org/html/2606.07616#A5)\.

Next, we validate the generalizability of test\-time IRSL across benchmark sets with different difficulty levels, following the same partitioning strategy used in our pre\-training analysis\. We estimate the abilityθ\\thetausing only the easy subset and transfer it to predict the scaling curve of the hard subset \(or a harder benchmark\) without accessing the response data\. Figure[10](https://arxiv.org/html/2606.07616#S4.F10)illustrates this capability: the left panel shows within\-benchmark transfer for MMLU Pro using Qwen3\-32B, while the right panel demonstrates cross\-benchmark transfer, whereθ\\thetaestimated on AIME 2024 effectively predicts performance on AIME 2025\. To quantify robustness across all settings, Figure[11](https://arxiv.org/html/2606.07616#S4.F11)reports the distribution of the MAE between ground truth and estimated scaling curves for the hard sets across all benchmarks and LLMs over 100 random train\-test splits\. The consistently low errors confirm that the ability parametersθ\\thetaestimated by IRSL are robustly transferable, enabling reliable test\-time forecasting on harder tasks sharing the same measurement objective\.

## 5Limitations, Discussions, and Future Work

IRSL excels when benchmarks have heterogeneous question difficulty, evaluation budgets are limited, and cross\-question generalization is needed\. However, traditional scaling withpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}already performs well on high\-quality benchmarks with smooth probability responses \(e\.g\., ARC Challenge, MMLU\)\. In such cases, IRSL offers comparable accuracy with added interpretability but may not justify calibration overhead if only aggregate metrics are needed\. On extremely noisy benchmarks \(e\.g\., BoolQ, HellaSwag\), neither approach captures reliable trends\. Unlike classical power\-law models that extrapolate to unseen compute regimes, IRSL requires pre\-calibrated item difficulties from prior model responses, limiting applicability to established benchmarks\. Difficulties calibrated under one evaluation setup may also not transfer to different conditions\. IRSL is thus best viewed as complementary to traditional scaling laws\. Besides, the restricted data scale for test\-time scaling analysis is another primary limitation of this work\.

In this work, we demonstrate that empirical probability information \(either from noisy probability observations or from repeated sampling\) provides additional signals that compensate for a limited number of test takers\. In human testing, a test\-taker sample size of 100 is typically insufficient for IRT, and practitioners can easily recruit more human test\-takers\. In contrast, LMs are relatively homogeneous\(Kimet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib75)\)and limited in number due to query cost\. However, LMs naturally provide token probabilities and support repeated sampling, which are not feasible in human testing\. A key insight is that, to achieve robust estimation, human testing increases the number of test takers, whereas LM evaluation leverages empirical probability\.

Future work includes scaling up the test\-time experimental setup, fitting shared latent abilities across benchmarks\(Truonget al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib7); Kipniset al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib58)\), exploring alternative probabilistic models \(e\.g\., Beta\-Binomial, zero\-inflated models\), extending to other scaling laws\(Ruanet al\.,[2024](https://arxiv.org/html/2606.07616#bib.bib3); Kaplanet al\.,[2020](https://arxiv.org/html/2606.07616#bib.bib20); Aroraet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib45)\), and polytomous IRT\(Ostini and Nering,[2006](https://arxiv.org/html/2606.07616#bib.bib46)\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## Acknowledge

SK acknowledges support by NSF 2046795 and 2205329, IES R305C240046, ARPA\-H, the MacArthur Foundation, Schmidt Sciences, HAI, OpenAI, Microsoft, and Google\.

## References

- A\. Arora, D\. Jurafsky, C\. Potts, and N\. D\. Goodman \(2025\)Bayesian scaling laws for in\-context learning\.External Links:2410\.16531,[Link](https://arxiv.org/abs/2410.16531)Cited by:[§5](https://arxiv.org/html/2606.07616#S5.p3.1)\.
- Y\. Bahri, E\. Dyer, J\. Kaplan, J\. Lee, and U\. Sharma \(2021\)Explaining neural scaling laws\.arXiv preprint arXiv:2102\.06701\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1)\.
- F\. B\. Baker \(2001\)The basics of item response theory\.ERIC\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p1.15)\.
- A\. Bhagia, J\. Liu, A\. Wettig, D\. Heineman, O\. Tafjord, A\. H\. Jha, L\. Soldaini, N\. A\. Smith, D\. Groeneveld, P\. W\. Koh,et al\.\(2024\)Establishing task scaling laws via compute\-efficient model ladders\.arXiv preprint arXiv:2412\.04403\.Cited by:[Appendix B](https://arxiv.org/html/2606.07616#A2.p2.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p1.7),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p2.12)\.
- S\. Biderman, H\. Schoelkopf, Q\. G\. Anthony, H\. Bradley, K\. O’Brien, E\. Hallahan, M\. A\. Khan, S\. Purohit, U\. S\. Prashanth, E\. Raff,et al\.\(2023\)Pythia: a suite for analyzing large language models across training and scaling\.InInternational Conference on Machine Learning,pp\. 2397–2430\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Birnbaum \(1968\)Some latent trait models and their use in inferring an examinee’s ability\.InStatistical Theories of Mental Test Scores,F\. M\. Lord and M\. Novick \(Eds\.\),pp\. 392–479\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p1.15)\.
- R\. D\. Bock and M\. Aitkin \(1981\)Marginal maximum likelihood estimation of item parameters: application of an em algorithm\.Psychometrika46\(4\),pp\. 443–459\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p2.10)\.
- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1),[§1](https://arxiv.org/html/2606.07616#S1.p2.3),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p3.21)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p2.3),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1)\.
- R\. P\. Chalmers \(2012\)Mirt: a multidimensional item response theory package for the r environment\.Journal of Statistical Software48\(6\),pp\. 1–29\.External Links:[Link](https://www.jstatsoft.org/index.php/jss/article/view/v048i06),[Document](https://dx.doi.org/10.18637/jss.v048.i06)Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p2.10)\.
- H\. Chang \(2015\)Psychometrics behind computerized adaptive testing\.Psychometrika80\(1\),pp\. 1–20\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p3.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Link](https://arxiv.org/abs/2107.03374)Cited by:[§4\.3](https://arxiv.org/html/2606.07616#S4.SS3.p2.13)\.
- Y\. Chen, B\. Huang, Y\. Gao, Z\. Wang, J\. Yang, and H\. Ji \(2024\)Scaling laws for predicting downstream performance in llms\.arXiv preprint arXiv:2410\.08527\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p2.3)\.
- Y\. Chen, T\. Silva Filho, R\. B\.C\. Prudêncio, T\. Diethe, and P\. Flach \(2019\)β3\\beta^\{3\}\-IRT: a new item response model and its applications\.InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics \(AISTATS\),Note:arXiv:1903\.04016Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px5.p1.4),[§3\.3](https://arxiv.org/html/2606.07616#S3.SS3.p1.10)\.
- S\. Ferrari and F\. Cribari\-Neto \(2004\)Beta regression for modelling rates and proportions\.Journal of Applied Statistics31\(7\),pp\. 799–815\.External Links:[Document](https://dx.doi.org/10.1080/0266476042000214501),[Link](https://doi.org/10.1080/0266476042000214501),https://doi\.org/10\.1080/0266476042000214501Cited by:[§3\.3](https://arxiv.org/html/2606.07616#S3.SS3.p1.10)\.
- S\. Y\. Gadre, G\. Smyrnis, V\. Shankar, S\. Gururangan, M\. Wortsman, R\. Shao, J\. Mercat, A\. Fang, J\. Li, S\. Keh, R\. Xin, M\. Nezhurina, I\. Vasiljevic, J\. Jitsev, L\. Soldaini, A\. G\. Dimakis, G\. Ilharco, P\. W\. Koh, S\. Song, T\. Kollar, Y\. Carmon, A\. Dave, R\. Heckel, N\. Muennighoff, and L\. Schmidt \(2024\)Language models scale reliably with over\-training and on downstream tasks\.External Links:2403\.08540,[Link](https://arxiv.org/abs/2403.08540)Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1)\.
- V\. Gupta, C\. Ross, D\. Pantoja, R\. J\. Passonneau, M\. Ung, and A\. Williams \(2025\)Improving model evaluation using smart filtering of benchmark datasets\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4595–4615\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Heineman, V\. Hofmann, I\. Magnusson, Y\. Gu, N\. A\. Smith, H\. Hajishirzi, K\. Lo, and J\. Dodge \(2025\)Signal and noise: a framework for reducing uncertainty in language model evaluation\.External Links:2508\.13144,[Link](https://arxiv.org/abs/2508.13144)Cited by:[Appendix D](https://arxiv.org/html/2606.07616#A4.SS0.SSS0.Px1.p1.4),[§4\.2](https://arxiv.org/html/2606.07616#S4.SS2.p3.8)\.
- D\. Hernandez, J\. Kaplan, T\. Henighan, and S\. McCandlish \(2021\)Scaling laws for transfer\.arXiv preprint arXiv:2102\.01293\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Hestness, S\. Narang, N\. Ardalani, G\. Diamos, H\. Jun, H\. Kianinejad, M\. M\. A\. Patwary, Y\. Yang, and Y\. Zhou \(2017\)Deep learning scaling is predictable, empirically\.arXiv preprint arXiv:1712\.00409\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1)\.
- V\. Hofmann, D\. Heineman, I\. Magnusson, K\. Lo, J\. Dodge, M\. Sap, P\. W\. Koh, C\. Wang, H\. Hajishirzi, and N\. A\. Smith \(2025\)Fluid language model benchmarking\.External Links:2509\.11106,[Link](https://arxiv.org/abs/2509.11106)Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p3.4),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Hughes, S\. Price, A\. Lynch, R\. Schaeffer, F\. Barez, S\. Koyejo, H\. Sleight, E\. Jones, E\. Perez, and M\. Sharma \(2024\)Best\-of\-n jailbreaking\.External Links:2412\.03556,[Link](https://arxiv.org/abs/2412.03556)Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p3.21)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.07616#S5.p3.1)\.
- E\. Kim, A\. Garg, K\. Peng, and N\. Garg \(2025\)Correlated errors in large language models\.arXiv preprint arXiv:2506\.07962\.Cited by:[§5](https://arxiv.org/html/2606.07616#S5.p2.1)\.
- A\. Kipnis, K\. Voudouris, L\. M\. S\. Buschoff, and E\. Schulz \(2025\)Metabench – a sparse benchmark of reasoning and knowledge in large language models\.External Links:2407\.12844,[Link](https://arxiv.org/abs/2407.12844)Cited by:[Appendix D](https://arxiv.org/html/2606.07616#A4.SS0.SSS0.Px1.p1.4),[§1](https://arxiv.org/html/2606.07616#S1.p3.4),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.07616#S5.p3.1)\.
- N\. Levi \(2024\)A simple model of inference scaling laws\.arXiv preprint arXiv:2410\.16377\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p1.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p3.21)\.
- F\. M\. Lord \(1952\)A Theory of Test Scores\.Psychometric Corporation,Richmond, VA\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p1.15)\.
- F\. M\. Lord \(1980\)Applications of item response theory to practical testing problems\.1st edition,Routledge\.External Links:[Document](https://dx.doi.org/10.4324/9780203056615)Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p3.4)\.
- N\. Lourie, M\. Y\. Hu, and K\. Cho \(2025\)Scaling laws are unreliable for downstream tasks: a reality check\.arXiv preprint arXiv:2507\.00885\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Magnusson, N\. Tai, B\. Bogin, D\. Heineman, J\. D\. Hwang, L\. Soldaini, A\. Bhagia, J\. Liu, D\. Groeneveld, O\. Tafjord, N\. A\. Smith, P\. W\. Koh, and J\. Dodge \(2025\)DataDecide: how to predict best pretraining data with small experiments\.External Links:2504\.11393,[Link](https://arxiv.org/abs/2504.11393)Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p4.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p2.12),[§4\.2](https://arxiv.org/html/2606.07616#S4.SS2.p1.2)\.
- R\. R\. Meijer and M\. L\. Nering \(1999\)Computerized adaptive testing: Overview and introduction\.Applied Psychological Measurement23\(3\),pp\. 187–194\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p3.1)\.
- N\. Muennighoff, A\. Rush, B\. Barak, T\. Le Scao, N\. Tazi, A\. Piktus, S\. Pyysalo, T\. Wolf, and C\. A\. Raffel \(2024\)Scaling data\-constrained language models\.Advances in Neural Information Processing Systems36\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Ostini and M\.L\. Nering \(2006\)Polytomous item response theory models\.Polytomous Item Response Theory Models,SAGE Publications\.External Links:ISBN 9780761930686,LCCN 2005005274,[Link](https://books.google.com.hk/books?id=wS8VEMtJ3UYC)Cited by:[§5](https://arxiv.org/html/2606.07616#S5.p3.1)\.
- S\. Paech \(2024\)Creating magi: a hard subset of mmlu and agieval\.Note:[https://sampaech\.substack\.com/p/creating\-magi\-a\-hard\-subset\-of\-mmlu](https://sampaech.substack.com/p/creating-magi-a-hard-subset-of-mmlu)Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Perlitz, E\. Bandel, A\. Gera, O\. Arviv, L\. E\. Dor, E\. Shnarch, N\. Slonim, M\. Shmueli\-Scheuer, and L\. Choshen \(2024\)Efficient benchmarking \(of language models\)\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 2519–2536\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1)\.
- F\. M\. Polo, L\. Weber, L\. Choshen, Y\. Sun, G\. Xu, and M\. Yurochkin \(2024\)TinyBenchmarks: evaluating llms with fewer examples\.External Links:2402\.14992,[Link](https://arxiv.org/abs/2402.14992)Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Rasch \(1993\)Probabilistic models for some intelligence and attainment tests\.\.ERIC\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p1.15)\.
- Y\. Ruan, C\. J\. Maddison, and T\. Hashimoto \(2024\)Observational scaling laws and the predictability of language model performance\.arXiv preprint arXiv:2405\.10938\.Cited by:[§5](https://arxiv.org/html/2606.07616#S5.p3.1)\.
- R\. Schaeffer, J\. Kazdan, J\. Hughes, J\. Juravsky, S\. Price, A\. Lynch, E\. Jones, R\. Kirk, A\. Mirhoseini, and S\. Koyejo \(2025\)How do large language monkeys get their power \(laws\)?\.External Links:2502\.17578,[Link](https://arxiv.org/abs/2502.17578)Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p2.3),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p3.21)\.
- R\. Schaeffer, H\. Schoelkopf, B\. Miranda, G\. Mukobi, V\. Madan, A\. Ibrahim, H\. Bradley, S\. Biderman, and S\. Koyejo \(2024\)Why has predicting downstream capabilities of frontier ai models with scale remained elusive?\.arXiv preprint arXiv:2406\.04391\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p4.1),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.07616#S3.SS2.p2.12)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Truong, Y\. Tu, P\. Liang, B\. Li, and S\. Koyejo \(2025\)Reliable and efficient amortized model\-based evaluation\.arXiv preprint arXiv:2503\.13335\.Cited by:[§1](https://arxiv.org/html/2606.07616#S1.p3.4),[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.07616#S5.p3.1)\.
- W\. J\. Van der Linden, C\. A\. Glas,et al\.\(2000\)Computerized adaptive testing: theory and practice\.Vol\.13,Springer\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p1.15)\.
- R\. Vivek, K\. Ethayarajh, D\. Yang, and D\. Kiela \(2024\)Anchor points: benchmarking models with much fewer examples\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1576–1601\.Cited by:[§2](https://arxiv.org/html/2606.07616#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Wu, R\. L\. Davis, B\. W\. Domingue, C\. Piech, and N\. Goodman \(2020\)Variational item response theory: fast, accurate, and expressive\.arXiv preprint arXiv:2002\.00276\.Cited by:[§3\.1](https://arxiv.org/html/2606.07616#S3.SS1.p2.10)\.

## Appendix APre\-training Downstream Scaling Law Metrics Calculation Details

In this section, we explain the calculation of the benchmark\-specific lossLL, accuracyAcc\\mathrm\{Acc\}, and the average probability of the correct choicepCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\. Consider a question from a multiple\-choice benchmark:

> \[Question Content\] A\. \[Choice A Content\] B\. \[Choice B Content\] C\. \[Choice C Content\] D\. \[Choice D Content\]

Assuming the correct answer is C, the metrics are calculated as follows:

- •Benchmark\-specific Loss:Also known as bits per byte \(BPB\)\. For an individual question, this is calculated as the negative log\-likelihood of the token sequence corresponding to the correct choice content \(i\.e\., \[Choice C Content\]\) conditioned on the question content \(i\.e\., \[Question Content\]\), normalized by the length of the correct choice content in bytes\. The benchmarked\-level value is averaged across all questions\.
- •Average Probability of Correct Choice:For an individual question, this measures the probability of the token sequence representing the correct choice content, conditioned on the question content, normalized by the character length of the choice\. The benchmarked\-level value is averaged across all questions\.
- •Accuracy:Also known as cloze formulation accuracy or RC format accuracy\. This is determined by computing the probability of the token sequence for each choice content given the question content, normalized by the character length of each choice\. The choice with the highest probability is selected as the predicted answer\. The question is assigned a score of 1 if the prediction matches the correct choice, and 0 otherwise\. The benchmarked\-level value is averaged across all questions\.

## Appendix BAdditional Results for Pre\-training Downstream IRSL

Figure[12](https://arxiv.org/html/2606.07616#A2.F12)shows the empirical observation of the linear relationship betweenθ\\thetaandlog⁡FLOP\\log\\mathrm\{FLOP\}for Beta\-IRT 2PL\. The trend is similar for Binary\-IRT and 1PL variants\.

Figure[13](https://arxiv.org/html/2606.07616#A2.F13)shows the scaling curve fitting for traditional scaling law step 1\. Figure[14](https://arxiv.org/html/2606.07616#A2.F14)shows the scaling curve fitting for traditional scaling law step 2\. Figure[15](https://arxiv.org/html/2606.07616#A2.F15)shows the scaling curve fitting for IRSL step 1\. FollowingBhagiaet al\.\([2024](https://arxiv.org/html/2606.07616#bib.bib21)\), we fit step 1 only on final checkpoints for each model size, as the learning rate schedule prevents accurate FLOP estimation on intermediate checkpoints\.

Figure[16](https://arxiv.org/html/2606.07616#A2.F16)shows the correlation between Beta\-IRT 1PL predictedpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}and empiricalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\. Figure[17](https://arxiv.org/html/2606.07616#A2.F17)and[18](https://arxiv.org/html/2606.07616#A2.F18)show the Beta\-IRT curve on a randomly sampled question for 2PL and 1PL, respectively\.

Figure[19](https://arxiv.org/html/2606.07616#A2.F19)reports the MAE of hard set estimation across all benchmarks and LM data mixtures\.

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_arc_challenge_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_arc_easy_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_boolq_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_csqa_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_hellaswag_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_mmlu_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_openbookqa_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_piqa_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_socialiqa_theta_vs_flop.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/theta_vs_flop/beta_2pl_dclm-baseline_winogrande_theta_vs_flop.png)

Figure 12:θ\\thetascales linearly withlog⁡FLOP\\log\\mathrm\{FLOP\}across all benchmarks\.Beta\-IRT 2PL on the test set for a representative LM data mixture across all 10 benchmarks\. This linear trend is consistent across other data mixtures, as well as Binary\-IRT and 1PL variants\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/arc_challenge/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/arc_easy/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/boolq/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/csqa/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/hellaswag/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/mmlu/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/openbookqa/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/piqa/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/socialiqa/classic_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/winogrande/classic_step1.png)

Figure 13:Traditional scaling law step 1:L≈α⋅FLOP−β\+γL\\approx\\alpha\\cdot\\mathrm\{FLOP\}^\{\-\\beta\}\+\\gamma\.Representative LM data mixture across all 10 benchmarks\. The trend is consistent across other data mixtures\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/arc_challenge/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/arc_easy/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/boolq/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/csqa/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/hellaswag/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/mmlu/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/openbookqa/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/piqa/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/socialiqa/classic_step2.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/winogrande/classic_step2.png)

Figure 14:Traditional scaling law step 2:Performance\(i,𝒟\)≈a⋅σ\(b⋅\(L−l0\)\)\+c\\mathrm\{Performance\}\(i,\{\\mathcal\{D\}\}\)\\approx a\\cdot\\sigma\(b\\cdot\(L\-l\_\{0\}\)\)\+c\.Representative LM data mixture across all 10 benchmarks\. The trend is consistent across other data mixtures\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/arc_challenge/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/arc_easy/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/boolq/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/csqa/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/hellaswag/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/mmlu/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/openbookqa/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/piqa/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/socialiqa/irt_step1.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/curve_fit/winogrande/irt_step1.png)

Figure 15:IRSL step 1:θi≈a⋅log⁡\(FLOPi\)\+b\\theta\_\{i\}\\approx a\\cdot\\log\(\\mathrm\{FLOP\}\_\{i\}\)\+b\.Representative LM data mixture across all 10 benchmarks\. The trend is consistent across other data mixtures\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_arc_challenge.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_arc_easy.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_boolq.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_csqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_hellaswag.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_mmlu.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_openbookqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_piqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_socialiqa.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/corr_scatter/prob_corr_winogrande.png)

Figure 16:Beta\-IRT 1PL predictedpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}correlates strongly with empiricalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_arc_challenge_315.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_arc_easy_1213.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_boolq_1006.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_csqa_1035.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_hellaswag_2708.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_mmlu_11941.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_openbookqa_134.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_piqa_1168.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_socialiqa_1242.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis_2pl/irt_curve/beta_winogrande_1074.png)

Figure 17:Beta\-IRT 2PL curve on a single question for each benchmark\.The x\-axis is the ability parameterθ\\theta, and the y\-axis ispCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}\. The red line shows the fitted Beta\-IRT curve\. The blue dots represent the empiricalpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}; each dot corresponds to an LM checkpoint in the test set\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_arc_challenge_315.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_arc_easy_1213.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_boolq_1006.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_csqa_1035.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_hellaswag_2708.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_mmlu_11941.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_openbookqa_134.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_piqa_1168.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_socialiqa_1242.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/5_cat_analysis/irt_curve/beta_winogrande_1074.png)

Figure 18:Beta\-IRT 1PL curve on a single question for each benchmark\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/generalization/hard_mae_heatmap_with_arc.png)Figure 19:MAE of hard set estimation across all benchmarks and LM data mixtures\.We report the MAE between the ground truth scaling curve and the estimated curve on the hard sets\. The last row specifically corresponds to the cross\-benchmark transfer from ARC Easy to ARC Challenge\.
## Appendix CBenchmark Homogeneity Inspection for Pre\-training Downstream IRSL

To further explain why IRSL does not consistently outperform traditional scaling laws on certain benchmarks, we carry out an additional experiment on benchmark homogeneity\. Figure[20](https://arxiv.org/html/2606.07616#A3.F20)presents four rows of diagnostics per benchmark: \(1\) a response matrix heatmap of models versus questions, colored by probability of correct response; \(2\) item difficulty distribution; \(3\) item discrimination distribution; and \(4\) the Test Information Function \(TIF\): a measure of how precisely a benchmark estimates model ability at each point on the ability scale, computed as the sum of individual item information functions, where each item contributes more when its discrimination is high and its difficulty is well\-matched to the ability being estimated\. The shaded region marks the 5th–95th percentile of actual model abilities\.

On BoolQ and HellaSwag, the response matrices \(row 1\) show almost no structural gradient, consistent with their narrow difficulty distributions \(row 2; standard deviation of 0\.19 and 0\.29, respectively\) and low discrimination spread \(row 3; standard deviation of 0\.14 and 0\.30\)\. Because items are nearly homogeneous in both difficulty and discrimination, the aggregate TIF \(row 4\) is low and flat\.

ARC\-Challenge presents a starkly different picture\. The response matrix \(row 1\) shows clear diagonal stratification: a smooth gradient from easy to hard items\. Both the difficulty \(standard deviation of 0\.55\) and discrimination \(standard deviation of 0\.61\) distributions are substantially wider \(rows 2–3\)\. As a result, the TIF \(row 4\) exhibits a pronounced peak\.

We therefore view this not as a limitation of IRSL, but as a property of the benchmarks themselves\. IRSL is most effective when evaluation items are sufficiently diverse and informative, and we believe this finding itself contributes toward more principled benchmark design\.

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/resmat_heatmap_comparison.png)Figure 20:Benchmark homogeneity analysis for BoolQ, HellaSwag, and ARC Challenge\.Top row: response matrix heatmaps with rows \(models\) sorted by meanpCorrect Choice\\operatorname\{p\_\{\\text\{Correct Choice\}\}\}and columns \(items\) sorted by calibrated difficultyzz\. Middle rows: histograms of calibrated item difficultyzzand discriminationdd\. Bottom row: Test Information Function \(TIF\) per item,I\(θ\)/NI\(\\theta\)/N\. BoolQ and HellaSwag exhibit highly concentrated item parameters \(σz=0\.19,0\.29\\sigma\_\{z\}\{=\}0\.19,0\.29;σd=0\.14,0\.30\\sigma\_\{d\}\{=\}0\.14,0\.30\), yielding near\-uniform response matrices and low per\-item information\. In contrast, ARC Challenge shows diverse item parameters \(σz=0\.55\\sigma\_\{z\}\{=\}0\.55,σd=0\.61\\sigma\_\{d\}\{=\}0\.61\) and substantially higher TIF, enabling IRT to differentiate model abilities effectively\.
## Appendix DConstruct Similarity and Cross\-Benchmark Transfer

We mention that the LM abilityθ\\thetaestimated from one benchmark can transfer to another benchmark with similar measurement objectives\. In this section, we show how to examine if two benchmarks share similar measurement objectives from two complementary perspectives: empirical convergent validity and benchmark design\.

#### Empirical convergent validity\.

We provide direct empirical evidence via correlation plots of estimated LM abilityθ\\thetaacross models\. As shown in Figure[21](https://arxiv.org/html/2606.07616#A4.F21), there are strong correlations between latent abilities estimated from paired benchmarks:ρ\\rho= 0\.99 between ARC Easy and ARC Challenge \(pre\-training\) andρ\\rho= 0\.80 between AIME 2024 and AIME 2025 \(test\-time\)\. This confirms that these pairs share a consistently measured latent construct\. We extend this analysis to all benchmark pairs in Figure[22](https://arxiv.org/html/2606.07616#A4.F22)\. The full correlation heatmap shows that most of the 10 pre\-training benchmarks exhibit high pairwiseθ\\thetacorrelations, with BoolQ as the notable exception \(BoolQ is known to have a low signal\-to\-noise ratio as a two\-choice benchmark\(Heinemanet al\.,[2025](https://arxiv.org/html/2606.07616#bib.bib43)\)\)\. This aligns with findings fromKipniset al\.\([2025](https://arxiv.org/html/2606.07616#bib.bib58)\)that a single common factor underlies most benchmark scores, suggesting that cross\-benchmark transfer may be more broadly applicable\. For the test\-time benchmarks shown in Figure[23](https://arxiv.org/html/2606.07616#A4.F23), correlations are weaker, likely due to the limited experimental scale\.

#### Construct similarity as a design property\.

Beyond empirical validation, we argue that construct similarity is often a design\-level property established prior to evaluation\. The relevant construct \(e\.g\., mathematical reasoning, coding, domain\-specific knowledge, or general capability\) is defined upfront by whoever designs or uses the benchmark\. ARC Easy and ARC Challenge were explicitly built to assess the same scientific reasoning construct at different difficulty levels; AIME 2024 and 2025 share identical format and objectives\. This is analogous to how the community treats yearly administrations of standardized tests, such as the SAT, as measuring a consistent construct by design\.

For arbitrary benchmark pairs without clear design\-level similarity, we suggest that convergent validity should be empirically verified before transfer is attempted, for instance via theθ\\thetacorrelation analysis demonstrated above\.

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/prob_2pl_arc_easy_vs_arc_challenge.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/irsl_testtime_resmat2_aime2024_vs_aime2025.png)

Figure 21:The transfer benchmark pairs exhibit strong convergent validity in estimated LM ability\.The x\-axis shows the estimated abilityθ\\thetaon the source benchmark, and the y\-axis shows the estimated abilityθ\\thetaon the target benchmark\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/pretrain_downstream/prob_2pl_theta_corr_heatmap.png)Figure 22:Most pre\-training benchmarks share a strongly aligned latent ability\.The x\-axis and y\-axis show pre\-training benchmarks, and each cell reports the Pearson correlation of estimated abilityθ\\thetabetween the corresponding benchmark pair\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/irsl_testtime_resmat2_theta_corr_heatmap.png)Figure 23:Test\-time benchmark abilities show weaker but still informative cross\-benchmark alignment\.

## Appendix EAdditional Results for Test\-time IRSL

The 12 models used are: DeepSeek\-R1\-Distill\-Llama\-70B, DeepSeek\-R1\-Distill\-Llama\-8B, DeepSeek\-R1\-Distill\-Qwen\-14B, DeepSeek\-R1\-Distill\-Qwen\-32B, DeepSeek\-R1\-Distill\-Qwen\-7B, QwQ\-32B, Qwen3\-14B, Qwen3\-30B\-A3B, Qwen3\-32B, Qwen3\-4B, Qwen3\-8B, and gemma\-3\-27b\-it\. The 4 benchmarks used are: AIME 2024, AIME 2025, Global MMLU Lite, and MMLU Pro\.

Figure[24](https://arxiv.org/html/2606.07616#A5.F24)shows the correlation between Beta\-IRT 2PL predictedpass@1\\operatorname\{pass@1\}and empiricalpass@1\\operatorname\{pass@1\}\. Figure[25](https://arxiv.org/html/2606.07616#A5.F25)and Figure[24](https://arxiv.org/html/2606.07616#A5.F24)show the Beta\-IRT curve on a randomly sampled question for 1PL and 2PL, respectively\.

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/corr_scatter/resmat2_2pl_aime2024_prob_corr.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/corr_scatter/resmat2_2pl_aime2025_prob_corr.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/corr_scatter/resmat2_2pl_global_mmlu_lite_prob_corr.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/corr_scatter/resmat2_2pl_mmlu_pro_prob_corr.png)

Figure 24:Beta\-IRT 2PL predictedpass@1\\operatorname\{pass@1\}correlates strongly with empiricalpass@1\\operatorname\{pass@1\}\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/irt_curve/resmat2_aime2024_16.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/irt_curve/resmat2_aime2025_31.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/irt_curve/resmat2_global_mmlu_lite_75.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2/irt_curve/resmat2_mmlu_pro_108.png)

Figure 25:Beta\-IRT 1PL curve on a single question for each test\-time benchmark\.The x\-axis is the ability parameterθ\\theta, and the y\-axis ispass@1\\operatorname\{pass@1\}\. The red line shows the fitted Beta\-IRT curve\. The blue dots represent the empiricalpass@1\\operatorname\{pass@1\}; each dot corresponds to an LM in the test set\.![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/irt_curve/resmat2_2pl_aime2024_10.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/irt_curve/resmat2_2pl_aime2025_47.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/irt_curve/resmat2_2pl_global_mmlu_lite_70.png)

![Refer to caption](https://arxiv.org/html/2606.07616v1/figures/testtime/2_cat_analysis/resmat2_2pl/irt_curve/resmat2_2pl_mmlu_pro_109.png)

Figure 26:Beta\-IRT 2PL curve on a single question for each test\-time benchmark\.
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

Similar Articles

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

Unified Neural Scaling Laws

Unified Neural Scaling Laws

Auditing LLM Benchmarks with Item Response Theory

Submit Feedback

Similar Articles

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation
Auditing LLM Benchmarks with Item Response Theory