The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Summary
This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.
View Cached Full Text
Cached at: 05/13/26, 06:35 AM
# The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Source: [https://arxiv.org/html/2605.11205](https://arxiv.org/html/2605.11205)
Jung Min Kang Independent Researcher Seoul, South Korea
\(May 10, 2026\)
###### Abstract
Benchmark evaluation across AI and safety\-critical domains overwhelmingly relies on simple averaging: a system’s score is the arithmetic mean of its performance across test items\. We demonstrate that this practice produces substantially misleading rankings when two conditions co\-occur: \(1\) the evaluation matrix is sparse \(not every system is tested on every item\), and \(2\) items vary substantially in difficulty\. Through controlled simulation experiments across four domains—Natural Language Processing \(NLP/GLUE\), clinical drug trials, autonomous vehicle \(AV\) safety, and cybersecurity product evaluation—we show that Spearman rank correlationρ\\rhobetween simple\-average rankings and ground\-truth rankings degrades fromρ=1\.000\\rho=1\.000at 100% data coverage toρ=0\.809\\rho=0\.809at 67% coverage with high difficulty heterogeneity \(mean over 20 random seeds\)\. In contrast, a standard two\-parameter logistic \(2PL\) Item Response Theory \(IRT\) model maintainsρ≥0\.996\\rho\\geq 0\.996across all conditions\. We refer to this observed monotonic relationship as theEvaluation Failure Scaling Law: the accuracy of simple averaging is a decreasing function of the product of sparsity and difficulty gap, while IRT remains robust\. A 150\-condition grid sweep over sparsityS∈\[0,0\.70\]S\\in\[0,0\.70\]and difficulty gapD∈\[0\.5,5\.0\]D\\in\[0\.5,5\.0\]further confirms that ranking error forms a two\-dimensional failure surface, with a strong positiveS×DS\\times Dinteraction in ranking error \(γ3=\+0\.20\\gamma\_\{3\}=\+0\.20,t=13\.05t=13\.05\), while IRT maintainsρ≥0\.993\\rho\\geq 0\.993across all conditions\. We provide the complete experimental pipeline and discuss implications for Physical AI \(robotics\) benchmarking, where published evaluation matrices are often incomplete and difficulty gaps are extreme\. Our results constitute, to our knowledge, one of the first cross\-domain simulation studies demonstrating that IRT\-style estimation may be an important correction mechanism for fair evaluation in sparse, heterogeneous benchmark ecosystems\. We discuss implications for Physical AI and outline requirements for real\-world validation\.
Keywords:Item Response Theory, benchmark evaluation, simple averaging, data sparsity, cross\-domain validation, Physical AI, evaluation methodology
## 1Introduction
Benchmarks are the currency of progress in artificial intelligence\. The ranking of a model on a leaderboard determines publication venues, investment decisions, and deployment choices\. Yet the statistical methodology underlying nearly all benchmark rankings—computing the arithmetic mean of scores across test items—has remained largely unexamined outside of the psychometrics community\.
The fragility of simple averaging has been noted in specific contexts\.Rodriguez et al\. \([2021](https://arxiv.org/html/2605.11205#bib.bib19)\)demonstrated that evaluation examples in NLP benchmarks are not equally informative and proposed IRT\-based leaderboards\.Polo et al\. \([2024](https://arxiv.org/html/2605.11205#bib.bib17)\)showed that IRT enables efficient evaluation with fewer examples\.Zhou et al\. \([2026](https://arxiv.org/html/2605.11205#bib.bib27)\)revealed significant quality shortcomings in 11 LLM benchmarks using IRT analysis\. More recently,Uzunoğlu et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib25)\)introduced the concept of benchmark harmony to quantify the non\-uniformity of performance across subdomains\.
However, much of the existing AI benchmark work using IRT has focused on single benchmark ecosystems or single domains, leaving open whether the failure modes of simple averaging can be characterized as a function of measurable evaluation\-matrix properties across domains\. Additionally, no study has characterized*when*simple averaging fails as a function of measurable dataset properties, making it impossible for practitioners to know whether their particular evaluation is trustworthy\.
In this paper, we address both gaps\. We conduct controlled simulation experiments across four domains spanning the full spectrum of data density and difficulty heterogeneity:
1. 1\.NLP \(GLUE benchmark\): 100% data coverage, moderate difficulty heterogeneity\. The easy case where simple averaging works\.
2. 2\.Clinical drug trials: 65% coverage, high difficulty gap\. Simulates the common scenario where not every drug is tested in every hospital\.
3. 3\.Autonomous vehicle safety: 60% coverage, extreme difficulty gap across driving environments\. Simulates heterogeneous reporting conditions where systems are not evaluated under identical scenarios\.
4. 4\.Cybersecurity product evaluation: 67% coverage, extreme difficulty gap across attack types\. Simulates the scenario where vendors are evaluated on different threat profiles\.
Our key contribution is the identification and controlled simulation\-based validation of what we term theEvaluation Failure Scaling Law:
> The rank\-order accuracy of simple averaging degrades monotonically as a function ofS×DS\\times D, whereSSis the fraction of missing data \(sparsity\) andDDis the item difficulty gap\. IRT\-based estimation remains substantially more robust \(ρ≥0\.993\\rho\\geq 0\.993\) across the entire\(S,D\)\(S,D\)surface\.
This law has immediate practical implications\. In Physical AI \(robotics\) benchmarking, where published evaluation matrices are often incomplete and difficulty heterogeneity can be extremeZhou et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib28)\); Liu et al\. \([2023](https://arxiv.org/html/2605.11205#bib.bib14)\), simple averaging is not merely imprecise—it is systematically biased in a predictable direction\.
Scope and non\-claims\.This paper does not claim to provide a completed Physical AI benchmark, nor does it claim that real\-world Physical AI evaluations necessarily follow a 2PL IRT model\. Instead, we use controlled simulations calibrated to realistic benchmark conditions to isolate a specific methodological failure mode: when systems are evaluated on sparse and difficulty\-biased subsets of items, simple averaging can produce misleading rankings\. IRT is evaluated here as a principled correction mechanism whose real\-world deployment requires episode\-level validation\.
## 2Related Work
IRT in NLP evaluation\.The application of Item Response Theory to NLP evaluation was pioneered byLalor et al\. \([2016](https://arxiv.org/html/2605.11205#bib.bib10)\), who demonstrated that IRT gold standards provide more nuanced evaluation than majority voting\.Rodriguez et al\. \([2021](https://arxiv.org/html/2605.11205#bib.bib19)\)scaled this to full benchmark analysis, proposing IRT\-based leaderboards that jointly model item difficulty, discriminability, and subject ability using the 2PL model\. Their open\-source implementationRodriguez et al\. \([2021b](https://arxiv.org/html/2605.11205#bib.bib20)\)provided the foundation for subsequent work\.Lalor et al\. \([2024](https://arxiv.org/html/2605.11205#bib.bib12)\)presented a comprehensive tutorial on IRT for NLP at EACL 2024\.Polo et al\. \([2024](https://arxiv.org/html/2605.11205#bib.bib17)\)leveraged IRT to construct tiny benchmark subsets that reproduce full\-benchmark rankings with as few as 100 examples\.Zhou et al\. \([2026](https://arxiv.org/html/2605.11205#bib.bib27)\)proposed PSN\-IRT, a neural IRT variant, and conducted the most comprehensive analysis to date across 11 LLM benchmarks with 41,871 items\.Uebayashi et al\. \([2026](https://arxiv.org/html/2605.11205#bib.bib3)\)extended IRT to multimodal benchmarks with M3IRT, decomposing ability and difficulty into modality\-specific components\. However, much of this work has focused on single benchmark ecosystems or closely related domains, leaving open whether simple\-averaging failure can be characterized as a function of measurable evaluation\-matrix properties across domains\.
Critiques of averaging in benchmarks\.The limitations of simple averaging have been discussed from multiple angles\.Uzunoğlu et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib25)\)introduced benchmark HARMONY, measuring performance uniformity across subdomains, and showed that less harmonious benchmarks produce misleading results\. However, their proposed solution \(reporting harmony alongside accuracy\) does not address the ranking problem—it diagnoses but does not cure\. The flaw of averages concept from decision scienceSavage \([2009](https://arxiv.org/html/2605.11205#bib.bib22)\)provides theoretical grounding: when the relationship between inputs and outputs is nonlinear \(as with item difficulty\), plans based on averages fail on average\.
IRT beyond NLP\.Outside AI evaluation, IRT has been applied to motor carrier safety assessment by the U\.S\. Federal Motor Carrier Safety Administration \(FMCSA\), which conducted a multi\-year study comparing IRT to their existing Safety Measurement SystemsFMCSA \([2021](https://arxiv.org/html/2605.11205#bib.bib8)\)\.Luo et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib23)\)introduced MedIRT for item\-aware evaluation across medical benchmarks, demonstrating that IRT\-based rankings outperform accuracy\-based rankings across six external medical benchmarks\.Truong et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib24)\)proposed amortized model\-based evaluation using IRT with learned difficulty predictors\. Concurrently, recent work on efficient agent benchmarkingNdzomga \([2026](https://arxiv.org/html/2605.11205#bib.bib29)\)proposed evaluating AI agents on mid\-difficulty task subsets \(30–70% pass rate\) motivated by IRT, reducing evaluation cost by 44–70%\. Their approach selects*which*tasks to evaluate; our approach corrects rankings when task selection is not under the evaluator’s control—a complementary but fundamentally different problem\. To our knowledge, no prior work has systematically compared IRT performance against simple averaging across multiple domains as a function of data sparsity and difficulty heterogeneity\.
Physical AI benchmarking\.The Physical AI benchmark landscape is characterized by extreme fragmentation\.Liu et al\. \([2023](https://arxiv.org/html/2605.11205#bib.bib14)\)introduced LIBERO, now the de facto standard for vision\-language\-action \(VLA\) model evaluation\.Zhou et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib28)\)extended this with LIBERO\-PRO, revealing that models achieving\>\>90% accuracy under standard evaluation collapse to 0% under perturbation\.Fei et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib7)\)further extended robustness analysis across seven perturbation axes\. Despite this progress, no widely adopted unified aggregation methodology exists: each paper reports results on different subsets of models and tasks, creating a sparse evaluation matrix that renders cross\-paper comparison via simple averaging fundamentally unreliable\.
## 3Methodology
### 3\.1Problem Formulation
Consider an evaluation matrixR∈\{0,1,NA\}J×I×KR\\in\\\{0,1,\\text\{NA\}\\\}^\{J\\times I\\times K\}, whereJJsystems \(subjects\) are evaluated onIIitems \(tasks or conditions\) withKKbinary trials each\. The entryRjik=1R\_\{jik\}=1if systemjjsucceeds on thekk\-th trial of itemii, and NA if that \(system, item\) pair was never evaluated\. The observation maskM∈\{0,1\}J×IM\\in\\\{0,1\\\}^\{J\\times I\}indicates which pairs are observed:Mji=1M\_\{ji\}=1if systemjjwas tested on itemii\.
The data coverage \(density\) is defined as:
C=∑j,iMjiJ×I∈\[0,1\]C=\\frac\{\\sum\_\{j,i\}M\_\{ji\}\}\{J\\times I\}\\in\[0,1\]\(1\)
The difficulty gap is the range of true item difficulties:
D=maxibi−minibiD=\\max\_\{i\}b\_\{i\}\-\\min\_\{i\}b\_\{i\}\(2\)wherebib\_\{i\}is the difficulty parameter of itemii\(defined below\)\.
The evaluation goal is to produce a rankingπ^\\hat\{\\pi\}over theJJsystems that maximizes Spearman rank correlationρ\(π^,π∗\)\\rho\(\\hat\{\\pi\},\\pi^\{\*\}\)with the ground\-truth rankingπ∗\\pi^\{\*\}\.
### 3\.2Simple Averaging Baseline
The standard approach computes, for each systemjj, the mean success rate across observed items:
r¯j=1∑iMji∑i:Mji=11K∑k=1KRjik\\bar\{r\}\_\{j\}=\\frac\{1\}\{\\sum\_\{i\}M\_\{ji\}\}\\sum\_\{i:M\_\{ji\}=1\}\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}R\_\{jik\}\(3\)
Systems are then ranked byr¯j\\bar\{r\}\_\{j\}\. This estimator is unbiased only when either \(a\)Mji=1M\_\{ji\}=1for all\(j,i\)\(j,i\), or \(b\) the missing entries are missing completely at random \(MCAR\) with respect to item difficulty\. In practice, neither condition holds: systems that are tested only on easy items receive inflated scores, while systems tested on hard items are penalized\.
### 3\.32PL Item Response Theory
The two\-parameter logistic \(2PL\) IRT modelLord and Novick \([1968](https://arxiv.org/html/2605.11205#bib.bib15)\); Baker and Kim \([2004](https://arxiv.org/html/2605.11205#bib.bib1)\)posits that the probability of systemjjsucceeding on itemiiis:
P\(Rjik=1∣θj,ai,bi\)=σ\(ai\(θj−bi\)\)P\(R\_\{jik\}=1\\mid\\theta\_\{j\},a\_\{i\},b\_\{i\}\)=\\sigma\\big\(a\_\{i\}\(\\theta\_\{j\}\-b\_\{i\}\)\\big\)\(4\)whereσ\(x\)=1/\(1\+e−x\)\\sigma\(x\)=1/\(1\+e^\{\-x\}\)is the logistic function,θj∈ℝ\\theta\_\{j\}\\in\\mathbb\{R\}is the ability of systemjj,bi∈ℝb\_\{i\}\\in\\mathbb\{R\}is the difficulty of itemii, andai\>0a\_\{i\}\>0is the discrimination of itemii\(how effectively it separates high\-ability from low\-ability systems\)\.
The key insight is that IRT jointly estimatesθj\\theta\_\{j\},aia\_\{i\}, andbib\_\{i\}from the observed data, automatically adjusting for the difficulty of each system’s test items\. A system that scores 70% on hard items \(bi≫0b\_\{i\}\\gg 0\) receives higherθj\\theta\_\{j\}than a system scoring 90% on easy items \(bi≪0b\_\{i\}\\ll 0\)\.
Estimation\.Parameters are estimated by maximizing the marginal log\-likelihood:
ℒ=∑j=1J∑i:Mji=1∑k=1K\[RjiklogPji\+\(1−Rjik\)log\(1−Pji\)\]\\mathcal\{L\}=\\sum\_\{j=1\}^\{J\}\\sum\_\{i:M\_\{ji\}=1\}\\sum\_\{k=1\}^\{K\}\\big\[R\_\{jik\}\\log P\_\{ji\}\+\(1\-R\_\{jik\}\)\\log\(1\-P\_\{ji\}\)\\big\]\(5\)wherePji=σ\(ai\(θj−bi\)\)P\_\{ji\}=\\sigma\(a\_\{i\}\(\\theta\_\{j\}\-b\_\{i\}\)\)\. We optimize using L\-BFGS\-B with regularization priorsθj∼𝒩\(0,1\)\\theta\_\{j\}\\sim\\mathcal\{N\}\(0,1\),bi∼𝒩\(0,2\)b\_\{i\}\\sim\\mathcal\{N\}\(0,2\), andlogai∼𝒩\(0,0\.5\)\\log a\_\{i\}\\sim\\mathcal\{N\}\(0,0\.5\)\.
Handling missing data\.IRT handles missing data through the likelihood formulation: the inner sum in Equation \(5\) iterates only over observed\(j,i\)\(j,i\)pairs\. No imputation is required\. The model leverages the structure of observed responses—which items a system succeeded and failed on—to estimate ability even from incomplete profiles\. However, this does not mean arbitrary missingness is harmless: stable estimation requires sufficient overlap among systems and items \(each system observed on at least 2 items, each item observed for at least 3 systems in our experiments\), and missing\-not\-at\-random patterns may still bias estimates if they violate the assumed response model\.
Standard errors\.We compute standard errors from the diagonal of the inverse observed Fisher information matrix, approximated via finite\-difference Hessian evaluation at the MLE\.
## 4Experimental Design
Our experimental strategy is a controlled simulation study\. For each domain, we:
1. 1\.Define ground\-truth parameters\(θj∗,ai∗,bi∗\)\(\\theta^\{\*\}\_\{j\},a^\{\*\}\_\{i\},b^\{\*\}\_\{i\}\)calibrated to published data and domain expertise\.
2. 2\.Generate a realistic observation maskMMreflecting domain\-specific evaluation patterns\.
3. 3\.Generate binary response dataRRfrom the 2PL model with these parameters\.
4. 4\.Estimate rankings using both simple averaging and 2PL IRT\.
5. 5\.Evaluate both methods against the known ground\-truth rankingπ∗\\pi^\{\*\}\.
The use of simulation is deliberate: it is the only design that permits definitive comparison, because only with synthetic data do we have access to the ground\-truth rankingπ∗\\pi^\{\*\}\. We calibrate simulation parameters to published real\-world data to ensure ecological validity\. This approach is standard in the IRT literatureDe Ayala \([2009](https://arxiv.org/html/2605.11205#bib.bib4)\); Embretson and Reise \([2000](https://arxiv.org/html/2605.11205#bib.bib6)\)and the missing\-data literatureRubin \([1976](https://arxiv.org/html/2605.11205#bib.bib21)\)\.
### 4\.1Domain 1: NLP \(GLUE Benchmark\)
Design rationale\.The GLUE benchmarkWang et al\. \([2018](https://arxiv.org/html/2605.11205#bib.bib26)\)provides the control condition: all models are evaluated on all tasks \(100% coverage\), and task difficulty heterogeneity is moderate\. Under these conditions, we expect simple averaging to perform well\.
Parameters\.We use 12 NLP models \(ELMo through DeBERTa\) and 8 GLUE tasks\. Ground\-truth ability parametersθj∗\\theta^\{\*\}\_\{j\}are derived from published leaderboard scoresDevlin et al\. \([2019](https://arxiv.org/html/2605.11205#bib.bib5)\); Liu et al\. \([2019](https://arxiv.org/html/2605.11205#bib.bib13)\); Raffel et al\. \([2020](https://arxiv.org/html/2605.11205#bib.bib18)\); He et al\. \([2021](https://arxiv.org/html/2605.11205#bib.bib9)\)\. Task difficultiesbi∗b^\{\*\}\_\{i\}are calibrated so that CoLA and RTE are hard \(b\>0\.5b\>0\.5\) and SST\-2 and QQP are easy \(b<−0\.5b<\-0\.5\), consistent with known GLUE task properties\. Full coverage:C=1\.00C=1\.00\.
Data generation\.K=500K=500binary trials per \(model, task\) pair, yielding 48,000 observations\.
### 4\.2Domain 2: Clinical Drug Trials
Design rationale\.Clinical trials routinely produce sparse evaluation matrices: not every drug is tested at every hospital \(site\), and hospitals treat patient populations of vastly different severity\. This creates the conditions for Simpson’s paradox, where a drug that appears inferior overall is actually superior when stratified by hospital difficulty\.
Parameters\.10 drugs, 6 hospitals\. Ground\-truth drug efficaciesθj∗\\theta^\{\*\}\_\{j\}range from−1\.5\-1\.5to\+2\.0\+2\.0\. Hospital difficultiesbi∗b^\{\*\}\_\{i\}range from−1\.0\-1\.0\(community clinic, mild cases\) to\+1\.5\+1\.5\(ICU, severe cases\)\. A*Fake Miracle Drug*\(θ∗=−0\.2\\theta^\{\*\}=\-0\.2, mediocre\) is tested only at easy hospitals\. A*True Miracle Drug*\(θ∗=\+2\.0\\theta^\{\*\}=\+2\.0, best\) is tested only at hard hospitals\. Coverage:C=0\.65C=0\.65\.
Data generation\.K=200K=200binary trials \(patient outcomes\) per observed \(drug, hospital\) pair\.
### 4\.3Domain 3: Autonomous Vehicle Safety
Design rationale\.AV safety evaluation exhibits both sparsity \(not every system is tested in every driving environment\) and extreme difficulty heterogeneity\. Published evaluations often cover different subsets of driving conditions, creating missingness patterns that may correlate with system strengths or reporting incentives\. Autonomous driving evaluation illustrates heterogeneous reporting conditions: DMV disengagement reports include counts, circumstances, locations, and autonomous milesCalifornia DMV \([2023](https://arxiv.org/html/2605.11205#bib.bib2)\), but these are not equivalent to a controlled benchmark where all systems face identical scenarios\.
Parameters\.10 AV systems, 6 driving environments\. Environments range from Sunny Suburb \(b∗=−1\.5b^\{\*\}=\-1\.5\) to Snowy Intersection \(b∗=\+1\.0b^\{\*\}=\+1\.0\)\. A*Fake Safe AV*\(θ∗=−0\.3\\theta^\{\*\}=\-0\.3\) is tested only in the three easiest environments\. A*True Safe AV*\(θ∗=\+2\.0\\theta^\{\*\}=\+2\.0\) is tested only in the three hardest\. Coverage:C=0\.60C=0\.60\.
Data generation\.K=1000K=1000binary safety interactions per observed pair\.
### 4\.4Domain 4: Cybersecurity Product Evaluation
Design rationale\.Cybersecurity product evaluations suffer from both sparsity and difficulty heterogeneity\. Products are tested against different threat profiles \(DDoS, phishing, ransomware, APT\), and the difficulty gap between automated attacks and sophisticated APTs is enormous\. A product that performs well on simple attacks but poorly on sophisticated threats could be misleadingly favored by aggregate rankings if task difficulty is ignored\.
Parameters\.8 security products, 6 attack types\. Attack difficulties range from Port Scan \(b∗=−1\.5b^\{\*\}=\-1\.5\) to Nation\-State APT \(b∗=\+2\.0b^\{\*\}=\+2\.0\)\. A*Fake Secure*product \(θ∗=−0\.5\\theta^\{\*\}=\-0\.5\) is tested only against easy attacks \(Port Scan, DDoS, Basic Phishing\)\. A*True Secure*product \(θ∗=\+2\.0\\theta^\{\*\}=\+2\.0\) is tested predominantly against hard attacks \(Ransomware, Zero\-Day, APT\)\. Coverage:C=0\.67C=0\.67\.
Data generation\.K=500K=500binary detection outcomes per observed pair\.
### 4\.5Evaluation Metrics
For each domain, we report: Spearman’sρ\\rho\(rank correlation between estimated ranking and ground\-truth ranking\);Δρ=ρIRT−ρavg\\Delta\\rho=\\rho\_\{\\text\{IRT\}\}\-\\rho\_\{\\text\{avg\}\}\(the improvement of IRT over simple averaging\); critical rank displacement \(whether the most dangerous ranking error is corrected by IRT\); and parameter recovery \(correlation between estimated and true item difficulty parameters\)\.
## 5Results
### 5\.1Main Results
Table[1](https://arxiv.org/html/2605.11205#S5.T1)presents the cross\-domain comparison\. The pattern is consistent: as data coverage decreases and difficulty gap increases, simple averaging degrades while IRT maintains near\-perfect rank recovery\.
Table 1:Cross\-domain evaluation results \(mean±\\pmstd over 20 random seeds\)\. Spearmanρ\\rhobetween estimated and ground\-truth rankings\.Domain 1: NLP \(GLUE\)\.At full coverage, both methods achieveρ=1\.000\\rho=1\.000\. IRT correctly identifies CoLA as the hardest task \(b^=\+0\.49\\hat\{b\}=\+0\.49\) and SST\-2 as the easiest \(b^=−1\.31\\hat\{b\}=\-1\.31\), consistent with established GLUE properties\.*Conclusion: When the evaluation matrix is complete, simple averaging is adequate\.*
Domain 2: Clinical Trials\.With 65% coverage and strategic sparsity, simple averaging drops toρ=0\.922±0\.029\\rho=0\.922\\pm 0\.029\(mean±\\pmstd over 20 seeds\)\. The Fake Miracle Drug, tested only at easy hospitals, is ranked \#6 by simple average despite being \#9 by true ability—inflated by 3 positions\. The True Miracle Drug, tested only at hard hospitals, is correctly identified as \#1 by IRT \(θ^=\+1\.97\\hat\{\\theta\}=\+1\.97\)\. IRT achievesρ=0\.996±0\.005\\rho=0\.996\\pm 0\.005, near\-perfect rank recovery\.*Conclusion: Missing data coupled with difficulty heterogeneity introduces systematic bias that IRT substantially corrects\.*
Domain 3: AV Safety\.At 60% coverage with extreme difficulty heterogeneity across driving conditions, simple averaging producesρ=0\.916±0\.016\\rho=0\.916\\pm 0\.016\. The True Safe AV, tested in the most demanding conditions \(night, fog, snow\), is displaced from \#1 to \#2 by simple averaging\. IRT correctly ranks all 10 systems \(ρ=1\.000±0\.000\\rho=1\.000\\pm 0\.000\), recovering the True Safe AV at rank \#1\.*Conclusion: Heterogeneous test\-condition coverage systematically distorts rankings\.*
Domain 4: Cybersecurity\.The most extreme difficulty gap \(D=3\.5D=3\.5\):ρavg=0\.809±0\.000\\rho\_\{\\text\{avg\}\}=0\.809\\pm 0\.000\. DeepScan AI, predominantly evaluated against sophisticated threats, is displaced from \#2 to \#5 by simple average—a 3\-position drop\. Enterprise Shield, evaluated on more complete but easier threat profiles, is inflated from \#3 to \#1\. IRT correctly recovers the full ranking \(ρ=1\.000±0\.000\\rho=1\.000\\pm 0\.000\), placing True Secure at \#1\.*Conclusion: When difficulty gaps are extreme, simple averaging systematically rewards systems that avoid hard tests\.*
### 5\.2The Evaluation Failure Scaling Law
Table[2](https://arxiv.org/html/2605.11205#S5.T2)shows the relationship between the Sparsity–Difficulty productS×DS\\times Dand ranking accuracy, whereS=1−CS=1\-Cis the missing data fraction andDDis the difficulty gap\. The data from our four domains reveal a clear functional relationship\.
Table 2:Sparsity–Difficulty product and ranking accuracy \(mean±\\pmstd, 20 seeds\)\. The productS×DS\\times Dpredicts the failure severity of simple averaging\. IRT remains robust\.The monotonic degradation pattern is consistent across all four domain\-calibrated conditions\. In Section[5\.3](https://arxiv.org/html/2605.11205#S5.SS3), we validate this pattern with a 150\-condition grid sweep that confirms theS×DS\\times Dinteraction as a strong and statistically significant contributor to ranking failure\. IRT remains near\-perfect throughout, with minimum meanρ=0\.993\\rho=0\.993across all 150 grid conditions\.
The relationship shows that increasing the difficulty gap has a stronger effect on ranking degradation than increasing sparsity alone: AV Safety and Clinical Trials have similar sparsity but AV Safety’sρavg\\rho\_\{\\text\{avg\}\}is slightly lower due to more extreme biased missingness in the observation mask\. Cybersecurity, with a substantially larger difficulty gap \(D=3\.5D=3\.5vsD=2\.5D=2\.5\), shows the largest degradation \(ρavg=0\.809\\rho\_\{\\text\{avg\}\}=0\.809\)\. This is consistent with the expected behavior when difficulty\-biased missingness interacts with item heterogeneity\.
### 5\.3Grid Sweep Over Sparsity and Difficulty Gap
To test whether the four domain\-calibrated examples reflect a broader pattern, we performed a systematic grid sweep over sparsitySSand item difficulty gapDD\. We variedSSfrom0\.000\.00to0\.700\.70in increments of0\.050\.05andDDfrom0\.500\.50to5\.005\.00in increments of0\.500\.50, yielding 150 grid conditions\. For each condition, we generated 15 independent datasets under both difficulty\-biased missingness and MCAR missingness, then compared simple averaging against 2PL IRT using Spearman rank correlation with the known ground\-truth ability ranking\.
Figure[1](https://arxiv.org/html/2605.11205#S5.F1)presents the resulting failure surface\. Under difficulty\-biased missingness \(Panel A\), simple averaging remains reliable in the low\-sparsity, low\-difficulty\-gap region but degrades sharply as bothSSandDDincrease, reachingρ=0\.24\\rho=0\.24at the most extreme condition \(S=0\.70S=0\.70,D=5\.0D=5\.0\)\. In contrast, IRT \(Panel B\) remains near\-perfect across the same grid, with minimum meanρ=0\.993\\rho=0\.993\. The MCAR control \(Panel C\) shows milder degradation \(minimumρ=0\.77\\rho=0\.77\), confirming that severe evaluation failure is driven primarily by the interaction of sparsity, difficulty heterogeneity, and*non\-random*missingness\. Panel D shows the IRT advantageΔρ=ρIRT−ρavg\\Delta\\rho=\\rho\_\{\\text\{IRT\}\}\-\\rho\_\{\\text\{avg\}\}, concentrated in the high\-SS, high\-DDregion\.
Figure 1:Evaluation failure surface over sparsity and item difficulty gap \(150 grid conditions, 15 seeds per cell,J=10J=10systems,I=10I=10items,K=100K=100trials\)\.A: Under difficulty\-biased missingness, simple averaging degrades sharply as sparsity and difficulty gap jointly increase\.B: IRT remains substantially more stable \(ρ≥0\.993\\rho\\geq 0\.993\)\.C: MCAR missingness produces milder degradation, indicating that severe failure arises primarily from biased missingness\.D: The IRT advantageΔρ\\Delta\\rhoconcentrates in the high\-sparsity, high\-difficulty region where simple averaging is most unreliable\.To quantify this interaction, we fit a centered interaction regression on ranking error\(1−ρavg\)\(1\-\\rho\_\{\\text\{avg\}\}\):
1−ρavg=γ0\+γ1Sc\+γ2Dc\+γ3\(Sc×Dc\)\+ε1\-\\rho\_\{\\text\{avg\}\}=\\gamma\_\{0\}\+\\gamma\_\{1\}S\_\{c\}\+\\gamma\_\{2\}D\_\{c\}\+\\gamma\_\{3\}\(S\_\{c\}\\times D\_\{c\}\)\+\\varepsilon\(6\)whereSc=S−S¯S\_\{c\}=S\-\\bar\{S\}andDc=D−D¯D\_\{c\}=D\-\\bar\{D\}are centered variables\. The results \(Table[3](https://arxiv.org/html/2605.11205#S5.T3)\) show a strongly significant positive interaction:γ3=\+0\.199\\gamma\_\{3\}=\+0\.199\(t=13\.05t=13\.05\), confirming that theS×DS\\times Dinteraction contributes substantially to ranking failure beyond the main effects of sparsity and difficulty gap alone\. The model explainsR2=0\.777R^\{2\}=0\.777of the variance\. A one\-dimensional power\-law approximation1−ρavg=α\(S⋅D\)β1\-\\rho\_\{\\text\{avg\}\}=\\alpha\(S\\cdot D\)^\{\\beta\}fits less well \(R2=0\.587R^\{2\}=0\.587\)\. We therefore interpret the Evaluation Failure Scaling Law as an empirical failure surface rather than a finalized closed\-form law\.
For MCAR missingness, the same interaction regression yields a smaller interaction coefficient \(γ3=\+0\.063\\gamma\_\{3\}=\+0\.063,t=15\.49t=15\.49\) and explainsR2=0\.878R^\{2\}=0\.878of the variance, with substantially less total degradation\. AtS≥0\.40S\\geq 0\.40, difficulty\-biased missingness produces mean ranking error of0\.1180\.118compared to0\.0570\.057under MCAR—an additional0\.0610\.061error attributable to the missingness mechanism\.
Table 3:Interaction regression on ranking error1−ρavg1\-\\rho\_\{\\text\{avg\}\}\(150 grid cells\)\. Centered variables:Sc=S−S¯S\_\{c\}=S\-\\bar\{S\},Dc=D−D¯D\_\{c\}=D\-\\bar\{D\}\.
### 5\.4Item Parameter Recovery
Across all four domains, IRT correctly recovers the ground\-truth item difficulty ordering \(Spearmanρ\(b∗,b^\)=1\.000\\rho\(b^\{\*\},\\hat\{b\}\)=1\.000in all cases\)\. The estimated discrimination parametersa^i\\hat\{a\}\_\{i\}correctly identify which items best separate high\-ability from low\-ability systems: in NLP, CoLA has the highest discrimination \(a^=3\.21\\hat\{a\}=3\.21\) and SST\-2 the lowest \(a^=1\.08\\hat\{a\}=1\.08\)—consistent with CoLA’s known ability to separate model quality; in clinical trials, ICU \(severe patients\) has the highest discrimination and the community clinic the lowest; in AV safety, dense urban night driving has the highest discrimination \(a^=3\.69\\hat\{a\}=3\.69\) and sunny suburb the lowest \(a^=1\.56\\hat\{a\}=1\.56\); in cybersecurity, Nation\-State APT has the highest discrimination and port scanning the lowest\.
These recovered parameters are not merely statistical artifacts—they provide actionable diagnostic information\. In AV safety, they tell regulators which test conditions to mandate\. In cybersecurity, they tell CISOs which threat types matter most for product differentiation\.
## 6Analysis
### 6\.1Why Simple Averaging Fails
The failure of simple averaging under sparsity can be understood through a decomposition\. For systemjjwith abilityθj\\theta\_\{j\}, the expected value of the simple average is:
𝔼\[r¯j\]=1\|Ij\|∑i∈Ijσ\(ai\(θj−bi\)\)\\mathbb\{E\}\[\\bar\{r\}\_\{j\}\]=\\frac\{1\}\{\|I\_\{j\}\|\}\\sum\_\{i\\in I\_\{j\}\}\\sigma\(a\_\{i\}\(\\theta\_\{j\}\-b\_\{i\}\)\)\(7\)whereIj=\{i:Mji=1\}I\_\{j\}=\\\{i:M\_\{ji\}=1\\\}is the set of items observed for systemjj\. WhenIjI\_\{j\}is a biased subset—containing predominantly easy items \(lowbib\_\{i\}\) or predominantly hard items \(highbib\_\{i\}\)—the expected value𝔼\[r¯j\]\\mathbb\{E\}\[\\bar\{r\}\_\{j\}\]is a biased estimator of system ability\. Critically, the bias direction depends on which items are observed, not on the system’s true ability\.
Consider two systems: system A \(θA=2\.0\\theta\_\{A\}=2\.0, tested only on hard items withb¯A=1\.0\\bar\{b\}\_\{A\}=1\.0\) and system B \(θB=−0\.5\\theta\_\{B\}=\-0\.5, tested only on easy items withb¯B=−1\.5\\bar\{b\}\_\{B\}=\-1\.5\)\. Then:
𝔼\[r¯A\]\\displaystyle\\mathbb\{E\}\[\\bar\{r\}\_\{A\}\]=σ\(a\(2\.0−1\.0\)\)≈0\.73\\displaystyle=\\sigma\(a\(2\.0\-1\.0\)\)\\approx 0\.73\(8\)𝔼\[r¯B\]\\displaystyle\\mathbb\{E\}\[\\bar\{r\}\_\{B\}\]=σ\(a\(−0\.5\+1\.5\)\)≈0\.73\\displaystyle=\\sigma\(a\(\-0\.5\+1\.5\)\)\\approx 0\.73\(9\)Both systems appear identical despite a 2\.5\-unit ability gap\. This is precisely the mechanism producing the Fake Miracle Drug / Fake Safe AV / Fake Secure scenarios in our experiments\.
### 6\.2Why IRT Succeeds
IRT avoids this failure by jointly estimating item and system parameters\. The log\-likelihood \(Equation 5\) encodes the constraint that if a high\-ability system fails an item, that item must be difficult, and vice versa\. This cross\-referencing is precisely why IRT handles missing data gracefully: even if system A is tested only on hard items, the difficulty of those items is calibrated from other systems’ performance on the same items\. The estimatedθA\\theta\_\{A\}is adjusted upward to account for the difficulty of A’s test items\.
Formally, IRT exploits the separability of ability and difficulty in the 2PL model: the probability depends only on the differenceθj−bi\\theta\_\{j\}\-b\_\{i\}, ensuring that ability estimates are on a common scale regardless of which items are observed\.
### 6\.3Sensitivity Analysis
We conducted additional experiments varying coverage from 30% to 100% in 10% increments, with difficulty gaps from 1\.0 to 4\.0\. Key findings: atC=100%C=100\\%\(no missing data\), simple averaging achievesρ\>0\.95\\rho\>0\.95regardless of difficulty gap—missing data is the necessary condition for failure; atC<100%C<100\\%with uniform random missingness \(MCAR\), simple averaging degrades only mildly \(ρ\>0\.90\\rho\>0\.90atC=50%C=50\\%\)—difficulty\-biased missingness is what causes severe failure; IRT maintainsρ\>0\.95\\rho\>0\.95even atC=30%C=30\\%with extreme difficulty\-biased missingness, provided that each system is observed on at least 2 items and each item is observed for at least 3 systems\.
The grid sweep in Section[5\.3](https://arxiv.org/html/2605.11205#S5.SS3)extends this sensitivity analysis systematically: across 150 conditions spanningS∈\[0,0\.70\]S\\in\[0,0\.70\]andD∈\[0\.5,5\.0\]D\\in\[0\.5,5\.0\], the interaction regression \(Equation[6](https://arxiv.org/html/2605.11205#S5.E6)\) confirms that severe failure requires all three ingredients: sparsity, difficulty heterogeneity, and non\-random missingness\.
### 6\.4Implications for Physical AI Benchmarking
The Physical AI \(robotics/VLA\) benchmark ecosystem exhibits the most extreme version of the conditions we have characterized\. We estimate coverage well below 50% based on published evaluation tables: LIBERO\-PROZhou et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib28)\)reports results for three VLA models across four perturbation axes, with substantial missing entries, and cross\-paper comparison across the broader VLA literature reveals that most models are evaluated on non\-overlapping subsets of benchmarks\. LIBERO\-PRO showed that models scoring\>\>90% on standard LIBERO tasks collapse to 0% under perturbation—a large empirical performance gap between standard and perturbed evaluation conditions\. Published evaluations often cover different subsets of benchmarks, creating missingness patterns that may correlate with model strengths, benchmark availability, or reporting incentives—precisely the type of difficulty\-biased missingness that degrades simple averaging\.
Extrapolating from our scaling law, at coverage well below 50% with extreme difficulty gaps, simple averaging is expected to produce substantially degraded rankings\. This suggests that leaderboard\-style comparisons that aggregate sparse, non\-overlapping Physical AI results by simple average may not reliably reflect true system quality, and that IRT\-based methods warrant serious consideration\.
## 7Discussion
### 7\.1Limitations
Our study has several important limitations that we address transparently\.
Synthetic data and data\-generating process\.Our non\-NLP experiments use synthetic data generated from the 2PL IRT model\. This is not a tautology, but a correctly specified simulation: a controlled experiment under a 2PL\-compatible data\-generating processMorris et al\. \([2019](https://arxiv.org/html/2605.11205#bib.bib16)\)\. We report it as an estimator\-comparison study, not as final empirical proof that real\-world benchmarks obey 2PL assumptions\. Three points clarify the scope of this design\. First, the comparison is between IRT and simple averaging on identical observations\. Both methods receive the same data; the question is which produces more accurate rankings\. Second, because the data\-generating process follows a 2PL structure, the experiment intentionally favors methods capable of modeling item difficulty\. This is by design: the goal is to test whether simple averaging fails when ability and difficulty are separable but unevenly observed—a scenario motivated by empirical evidence from LIBERO\-PROZhou et al\. \([2025](https://arxiv.org/html/2605.11205#bib.bib28)\)and AV disengagement reporting practices\. Third, simulation studies with known ground truth are the standard methodology for evaluating statistical estimatorsDe Ayala \([2009](https://arxiv.org/html/2605.11205#bib.bib4)\); Morris et al\. \([2019](https://arxiv.org/html/2605.11205#bib.bib16)\), precisely because they are the only setting where estimator accuracy can be definitively assessed\.
We do not claim that Physical AI evaluation necessarily follows a 2PL IRT model\. Rather, we use a controlled 2PL\-compatible simulation to isolate the interaction between sparsity, item difficulty gaps, and biased missingness\. This establishes a methodological failure mode of simple averaging\. Real\-world validation with episode\-level Physical AI data—binary success/failure outcomes per trial per scenario—remains necessary future work\.
Model specification\.Real\-world data may violate 2PL assumptions \(unidimensionality, local independence\)\. The 2PL model is a simplification\. However, IRT has been shown to be robust to moderate violations of these assumptionsEmbretson and Reise \([2000](https://arxiv.org/html/2605.11205#bib.bib6)\), and 3PL or multidimensional IRT models can be substituted when needed\.
Functional form\.The grid sweep \(Section[5\.3](https://arxiv.org/html/2605.11205#S5.SS3)\) confirms the failure surface under controlled 2PL\-compatible conditions but does not establish a precise closed\-form law\. The power\-law approximation1−ρavg=α\(S⋅D\)β1\-\\rho\_\{\\text\{avg\}\}=\\alpha\(S\\cdot D\)^\{\\beta\}explains only part of the variance \(R2=0\.587R^\{2\}=0\.587\), and the failure surface is better understood as a two\-dimensional interaction than a one\-dimensional scaling law\. We retain the term*Evaluation Failure Scaling Law*as a name for the observed monotonic relationship, while acknowledging that its precise functional form remains to be established\.
Sample sizes\.Our experiments use moderate numbers of systems \(8–12\) and items \(4–8\)\. While consistent with the Physical AI domain \(where∼\\sim20 models and 4 benchmark suites are available\), larger\-scale validation on domains with hundreds of systems would strengthen the conclusions\.
### 7\.2Practical Recommendations
Based on our findings, we propose a decision rule for evaluation practitioners: \(1\) Compute coverageCCand estimate the difficulty gapDD\(e\.g\., via coefficient of variation of per\-item success rates\)\. \(2\) IfC\>0\.95C\>0\.95andDDis small: simple averaging is adequate\. \(3\) IfC<0\.95C<0\.95orDDis large: use IRT\. Libraries such as py\-irtLalor and Rodriguez \([2022](https://arxiv.org/html/2605.11205#bib.bib11)\)make this straightforward\. \(4\) Always report both simple average and IRT rankings, allowing readers to assess the impact\.
### 7\.3Broader Impact
The Evaluation Failure Scaling Law has implications beyond AI benchmarking: regulatory safety assessment bodies \(NHTSA, EU AI Act compliance\) should consider reporting IRT\-adjusted aggregation alongside simple averaging of test results across driving conditions; CISOs making procurement decisions based on vendor\-reported detection rates should consider IRT\-adjusted scores that account for the difficulty of the threats tested; cross\-site drug efficacy comparisons may benefit from IRT or analogous hierarchical models that adjust for site\-level patient severity; and educational testing—where IRT originatedLord and Novick \([1968](https://arxiv.org/html/2605.11205#bib.bib15)\)—reminds us that the same failure modes exist in AI evaluation, where they have been largely ignored\.
## 8Conclusion
We have presented a systematic cross\-domain simulation study demonstrating that simple averaging—the default evaluation methodology in AI benchmarking—can fail substantially when evaluation matrices are sparse and items vary in difficulty\. Through controlled experiments across four domains \(NLP, clinical trials, autonomous vehicle safety, and cybersecurity\) and a 150\-condition grid sweep, we showed that Spearman rank correlation between simple\-average rankings and ground truth degrades fromρ=1\.000\\rho=1\.000toρ=0\.242\\rho=0\.242as the sparsity–difficulty product increases under biased missingness, while IRT maintainsρ≥0\.993\\rho\\geq 0\.993throughout\. TheS×DS\\times Dinteraction contributes substantially to this failure \(γ3=\+0\.199\\gamma\_\{3\}=\+0\.199,t=13\.05t=13\.05\)\.
We refer to this observed relationship as the Evaluation Failure Scaling Law and argue that for Physical AI benchmarking—where data coverage is well below 50% and difficulty gaps are extreme—IRT\-based methods warrant serious consideration as a complement or alternative to simple averaging\.
Our results suggest that IRT\-style estimation may be necessary in sparse, heterogeneous evaluation settings; however, real\-world validation with episode\-level Physical AI data remains essential future work\. We do not claim to have completed a benchmark, but rather to have identified and characterized a methodological failure mode that the evaluation community should address\.
The path forward involves two steps: reporting IRT\-based rankings alongside simple averages in any evaluation ecosystem with non\-trivial missing data, and collecting the episode\-level binary outcomes that enable production\-grade IRT calibration\. The mathematical framework has existed since 1968Lord and Novick \([1968](https://arxiv.org/html/2605.11205#bib.bib15)\); the computational tools are freely availableLalor and Rodriguez \([2022](https://arxiv.org/html/2605.11205#bib.bib11)\); Rodriguez et al\. \([2021b](https://arxiv.org/html/2605.11205#bib.bib20)\); what remains is the data infrastructure and the will to adopt them\.
## Reproducibility Statement
All experiments use synthetic data with fixed random seeds for exact reproducibility\. The complete Python implementation—including data generation, IRT estimation, grid sweep, and evaluation—is available at[https://github\.com/testofschool/evaluation\-failure\-scaling\-law](https://github.com/testofschool/evaluation-failure-scaling-law)\. The codebase requires only NumPy and SciPy \(no deep learning frameworks\)\. Each experiment runs in under 60 seconds on a standard laptop CPU\. The 150\-condition grid sweep completes in approximately 3 minutes\. We report all hyperparameters in Section 4 and provide the exact ground\-truth item parameters in the Appendix; complete subject ability parameters are included in the code repository\.
## References
- Baker and Kim \[2004\]Baker, F\. B\. and Kim, S\.\-H\.*Item Analysis in Testing and Item Response Theory for Scoring, Scale Construction, and Diagnostics*\. Marcel Dekker, 2nd edition, 2004\.
- California DMV \[2023\]California Department of Motor Vehicles\. Autonomous Vehicle Disengagement Reports, 2023\.[https://www\.dmv\.ca\.gov/portal/vehicle\-industry\-services/autonomous\-vehicles/disengagement\-reports/](https://www.dmv.ca.gov/portal/vehicle-industry-services/autonomous-vehicles/disengagement-reports/)\.
- Uebayashi et al\. \[2026\]Uebayashi, S\. et al\. M3IRT: Evaluating cross\-modal reasoning ability and problem characteristics with multimodal item response theory\.*arXiv preprint arXiv:2603\.02663*, 2026\.
- De Ayala \[2009\]De Ayala, R\. J\.*The Theory and Practice of Item Response Theory*\. Guilford Press, 2009\.
- Devlin et al\. \[2019\]Devlin, J\., Chang, M\.\-W\., Lee, K\., and Toutanova, K\. BERT: Pre\-training of deep bidirectional transformers for language understanding\. In*Proceedings of NAACL\-HLT*, pages 4171–4186, 2019\.
- Embretson and Reise \[2000\]Embretson, S\. E\. and Reise, S\. P\.*Item Response Theory for Psychologists*\. Lawrence Erlbaum Associates, 2000\.
- Fei et al\. \[2025\]Fei, H\., Wang, Z\., and others\. LIBERO\-Plus: In\-depth robustness analysis of vision\-language\-action models\.*arXiv preprint arXiv:2510\.13626*, 2025\.
- FMCSA \[2021\]Federal Motor Carrier Safety Administration\. Item Response Theory \(IRT\) Correlation Study\. Technical report, U\.S\. Department of Transportation, 2021\.
- He et al\. \[2021\]He, P\., Liu, X\., Gao, J\., and Chen, W\. DeBERTa: Decoding\-enhanced BERT with disentangled attention\. In*Proceedings of ICLR*, 2021\.
- Lalor et al\. \[2016\]Lalor, J\. P\., Wu, H\., and Yu, H\. Building an evaluation scale using item response theory\. In*Proceedings of EMNLP*, pages 648–657, 2016\.
- Lalor and Rodriguez \[2022\]Lalor, J\. P\. and Rodriguez, P\. py\-irt: A scalable item response theory library for Python\.*INFORMS Journal on Computing*, 34\(5\):2530–2537, 2022\.
- Lalor et al\. \[2024\]Lalor, J\. P\., Rodriguez, P\., Sedoc, J\., and Hernández\-Orallo, J\. Item response theory for natural language processing\. Tutorial at EACL, 2024\.
- Liu et al\. \[2019\]Liu, Y\. et al\. RoBERTa: A robustly optimized BERT pretraining approach\.*arXiv preprint arXiv:1907\.11692*, 2019\.
- Liu et al\. \[2023\]Liu, B\. et al\. LIBERO: Benchmarking knowledge transfer for lifelong robot learning\. In*Advances in Neural Information Processing Systems*, 36:44776–44791, 2023\.
- Lord and Novick \[1968\]Lord, F\. M\. and Novick, M\. R\.*Statistical Theories of Mental Test Scores*\. Addison\-Wesley, 1968\.
- Morris et al\. \[2019\]Morris, T\. P\., White, I\. R\., and Crowther, M\. J\. Using simulation studies to evaluate statistical methods\.*Statistics in Medicine*, 38\(11\):2074–2102, 2019\.
- Polo et al\. \[2024\]Polo, F\. M\. et al\. tinyBenchmarks: Evaluating LLMs with fewer examples\. In*Proceedings of ICML*, PMLR 235, 2024\.
- Raffel et al\. \[2020\]Raffel, C\. et al\. Exploring the limits of transfer learning with a unified text\-to\-text transformer\.*JMLR*, 21\(140\):1–67, 2020\.
- Rodriguez et al\. \[2021\]Rodriguez, P\. et al\. Evaluation examples are not equally informative: How should that change NLP leaderboards? In*Proceedings of ACL\-IJCNLP*, pages 4486–4503, 2021\.
- Rodriguez et al\. \[2021b\]Rodriguez, P\. et al\. IRT Leaderboard\.[https://github\.com/facebookresearch/irt\-leaderboard](https://github.com/facebookresearch/irt-leaderboard), 2021\.
- Rubin \[1976\]Rubin, D\. B\. Inference and missing data\.*Biometrika*, 63\(3\):581–592, 1976\.
- Savage \[2009\]Savage, S\. L\.*The Flaw of Averages*\. John Wiley & Sons, 2009\.
- Luo et al\. \[2025\]Luo, Z\., Wu, L\., Frisch, A\., and He, D\. Measuring Competency, Not Performance: Item\-Aware Evaluation Across Medical Benchmarks\.*arXiv preprint arXiv:2509\.24186*, 2025\.
- Truong et al\. \[2025\]Truong, S\. et al\. Reliable and efficient amortized model\-based evaluation\.*arXiv preprint arXiv:2503\.13335*, 2025\.
- Uzunoğlu et al\. \[2025\]Uzunoğlu, A\., Li, T\., and Khashabi, D\. The flaw of averages: Quantifying uniformity of performance on benchmarks\.*arXiv preprint arXiv:2509\.25671*, 2025\.
- Wang et al\. \[2018\]Wang, A\. et al\. GLUE: A multi\-task benchmark and analysis platform for natural language understanding\. In*Proceedings of ICLR Workshop*, 2018\.
- Zhou et al\. \[2026\]Zhou, H\. et al\. Lost in benchmarks? Rethinking large language model benchmarking with item response theory\. In*Proceedings of AAAI \(Oral\)*, 2026\.
- Zhou et al\. \[2025\]Zhou, X\. et al\. LIBERO\-PRO: Towards robust and fair evaluation of vision\-language\-action models beyond memorization\.*arXiv preprint arXiv:2510\.03827*, 2025\.
- Ndzomga \[2026\]Ndzomga, F\. Efficient benchmarking of AI agents\.*arXiv preprint arXiv:2603\.23749*, 2026\.
## Appendix AGround\-Truth Parameters
For full reproducibility, we provide the exact ground\-truth parameters used in each experiment\.
### A\.1Domain 1: NLP \(GLUE\)
Table 4:GLUE task parameters \(ground truth\)\.
### A\.2Domain 2: Clinical Drug Trials
Table 5:Hospital parameters \(ground truth\)\.
### A\.3Domain 3: Autonomous Vehicle Safety
Table 6:Driving environment parameters \(ground truth\)\.
### A\.4Domain 4: Cybersecurity
Table 7:Attack type parameters \(ground truth\)\.
## Appendix BObservation Masks
The observation masksMMfor each domain encode realistic evaluation patterns\. For Domains 2–4, the key structural feature is that certain systems are tested on difficulty\-biased subsets: in clinical trials, the Fake Miracle Drug is observed at hospitals A, B, C \(easy half\) while the True Miracle Drug is observed at hospitals C, D, E, F \(hard half\); in AV safety, the Fake Safe AV is observed in environments 1, 2, 3 \(easy\) while the True Safe AV is observed in environments 3, 4, 5, 6 \(hard\); in cybersecurity, Fake Secure is tested against attacks 1, 2, 3 \(trivial\) while True Secure is tested against attacks 3, 4, 5, 6 \(advanced\)\. The remaining systems have randomly sampled observation patterns at the stated coverage rate, with the constraint that each system is observed on at least 2 items and each item has at least 3 observed systems\.Similar Articles
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
Introduces Item Response Scaling Laws (IRSL) that integrates Item Response Theory to efficiently estimate neural scaling laws, reducing required evaluation questions by 99.9% while achieving comparable accuracy.
Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation
This paper introduces Multilingual-IRT, a statistical framework extending Item Response Theory with per-language difficulty deviations and split discriminability, enabling efficient prediction of unobserved evaluations, detection of translation errors, and recovery of culture-specific items across 29 languages.
Auditing LLM Benchmarks with Item Response Theory
This paper introduces an Item Response Theory-based method to detect mislabeled examples in LLM benchmarks at 95% precision, tracing errors to labeling heuristics and annotation issues.
You Don't Need to Run Every Eval
This research paper demonstrates that the scores of frontier AI models across 133 benchmarks are approximately rank-2, meaning only two latent factors explain over 90% of variation. The authors introduce BenchPress, a logit-space matrix completion method that predicts a model's full scorecard from just a few benchmarks, significantly reducing the cost of evaluation.
@cwolferesearch: Evaluations should not be static. We need to evolve evaluation sets / benchmarks over time so that they remain relevant…
Discusses the need for evolving AI evaluation benchmarks through difficulty, quality, and diversity refinement, citing examples like MMLU-Pro, MMLU-Redux, BIG-Bench Extra Hard, RealMath, MathArena, and DatBench.