Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results
Summary
Introduces Every Eval Ever, a shared schema and community-crowdsourced repository for standardizing AI evaluation results, with automatic converters and a hosted database spanning over 22k models and 2.2k benchmarks.
View Cached Full Text
Cached at: 06/15/26, 09:12 AM
# A Unifying Schema and Community Repository for AI Evaluation Results
Source: [https://arxiv.org/html/2606.14516](https://arxiv.org/html/2606.14516)
Jan Batzner\*,1\-3Sree Harsha Nelaturu\*,4Damian Stachura\*,5Anastassia Kornilova\*,6Jon Crall⋄\\diamond, 7Tommaso Cerruti⋄\\diamond, 8Yanan Long⋄\\diamond, 9Yifan Mai⋄\\diamond, 10Sanchit Ahuja⋄\\diamond, 11Asaf Yehudai⋄\\diamond, 12Marek Šuppa⋄\\diamond, 13,14John P\. Lalor⋄\\diamond, 15Oluwagbemike Olowe⋄\\diamond, 16Jatin Ganhotra12Brian H\. Hu7Eliya Habba17Andrew M\. Bean18Chang Liu19Sander Land20Steven Dillmann10Aniketh Garikaparthi21Elron Bandel12Saki Imai11James Edgell22Wm\. Matthew Kennedy18Jenny Chim23Patrick Meusling24Asteria Kaeberlein11Venkata Ramachandra Karthik Chundi16Manasi Patwardhan21Martin Ku22Austin Meek25Leon Knauer26Brian Wingenroth27Srishti Yadav28,29Usman Gohar30Felix Friedrich31Michelle Lin32,33Jennifer Mickel34Arman Cohan35Stella Biderman†\\dagger, 34Irene Solaiman†\\dagger, 36Zeerak Talat†\\dagger, 37Anka Reuel†\\dagger, 10,38Mubashara Akhtar†\\dagger, 39,8Gjergji Kasneci†\\dagger, 1,2Avijit Ghosh†\\dagger, 36Leshem Choshen†\\dagger, 40,41,12\*Lead Author⋄\\diamondTop Contributor†\\daggerAdvisorThis project was a part of the Evaluating Evaluations \(EvalEval\) Coalition:![[Uncaptioned image]](https://arxiv.org/html/2606.14516v1/figs/logo-square.png)[https://evalevalai\.com/](https://evalevalai.com/)1Technical University Munich2Munich Center for Machine Learning3Weizenbaum Institute 4Zuse Institute Berlin5Evidence Prime6Trustible7Kitware8ETH Zurich9StickFlux Labs10Stanford University11Northeastern University12IBM Research13Comenius University Bratislava14Cisco15University of Notre Dame16Independent17Hebrew University of Jerusalem 18University of Oxford19Ohio University20Writer21TCS Research22Oxford University Press23Queen Mary University of London24Technical University Berlin25University of Delaware26Cinemo27Johns Hopkins University28University of Copenhagen29ELLIS30Iowa State University 31Meta FAIR32University of Montreal33Mila Quebec AI Institute34EleutherAI35Yale University36Hugging Face37University of Edinburgh38Harvard University39ETH AI Center 40MIT41MIT\-IBM Watson Lab
###### Abstract
AI evaluations are widely used for testing and understanding progress\. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison\. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories\. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross\-community evaluation science, cost reduction, and reuse\. We introduceEvery Eval Ever, the first shared schema and community\-crowdsourced repository for AI evaluation results\. The schema standardizes how evaluations are represented, in a unified, single JSON document\. It is source\-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per\-instance outputs for fine\-grained analysis\. We contribute: \(i\) a community\-governed metadata schema with a companion instance\-level schema[evaleval/every\_eval\_ever](https://github.com/evaleval/every_eval_ever), the first standardization effort of its kind; \(ii\) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema[https://github.com/evaleval/every_eval_ever](https://github.com/evaleval/every_eval_ever); and \(iii\) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats[![[Uncaptioned image]](https://arxiv.org/html/2606.14516v1/figs/hf-logo.png)evaleval/EEE\_datastore](https://huggingface.co/datasets/evaleval/EEE_datastore)\.
## 1Introduction
Evaluations are critical for measuring AI progress, yet how they are reported is inconsistent, incomplete, and difficult to interpret\. Evaluation results are often reduced to aggregated scores in a table, with important evaluation metadata, such as generation parameters, evaluation settings, and data provenance, omitted or scattered across papers, ad hoc log files, and code repositories\. This fragmentation undermines reproducibility, complicates cross\-benchmark comparisons, and limits the potential for systematic meta\-analysis\.
In practice, this creates fundamental challenges for both researchers and practitioners\. Comparative evaluation studies are typically constrained by the subset of results that can be reliably reproduced \(e\.g\., architecture scaling\[[22](https://arxiv.org/html/2606.14516#bib.bib54),[80](https://arxiv.org/html/2606.14516#bib.bib55)\]or quantization comparisons\[[48](https://arxiv.org/html/2606.14516#bib.bib4)\]\), often requiring substantial computational and financial resources\[[34](https://arxiv.org/html/2606.14516#bib.bib71),[70](https://arxiv.org/html/2606.14516#bib.bib53)\]\. A lack of comparability is especially misleading when different parties evaluate the same model or benchmark, yet produce different scores\[see §[7\.3](https://arxiv.org/html/2606.14516#S7.SS3)and[97](https://arxiv.org/html/2606.14516#bib.bib7),[89](https://arxiv.org/html/2606.14516#bib.bib1)\]\. For example, the LLaMA 65B model has been reported to achieve both 63\.7 and 48\.8 on MMLU\[[39](https://arxiv.org/html/2606.14516#bib.bib11)\]\. On a closer look, the difference in scores was found to arise from the use of different evaluation harnesses\. Without this context, the scores are not directly comparable\[[29](https://arxiv.org/html/2606.14516#bib.bib40)\]\. Similarly, our analysis of evaluations across over 22,235 models and 2,273 benchmarks reveals 31 distinct reporting formats, highlighting the lack of standardization and motivating the need for more structured reporting practices \(See statistics in Fig\.[2](https://arxiv.org/html/2606.14516#S6.F2)\)\.
Other parts of the AI pipeline have benefited from standardization: shared metadata schemas such as DCAT, Schema\.org/Dataset, and Croissant\[[90](https://arxiv.org/html/2606.14516#bib.bib79),[81](https://arxiv.org/html/2606.14516#bib.bib80),[3](https://arxiv.org/html/2606.14516#bib.bib14)\]; documentation practices such as Datasheets for Datasets and Model Cards\[[33](https://arxiv.org/html/2606.14516#bib.bib81),[67](https://arxiv.org/html/2606.14516#bib.bib82)\]; and common evaluation and benchmarking protocols such as GLUE, SuperGLUE, HELM, BIG\-bench, and MLPerf\[[92](https://arxiv.org/html/2606.14516#bib.bib83),[91](https://arxiv.org/html/2606.14516#bib.bib84),[53](https://arxiv.org/html/2606.14516#bib.bib9),[85](https://arxiv.org/html/2606.14516#bib.bib78),[76](https://arxiv.org/html/2606.14516#bib.bib85)\]have improved reproducibility, comparability, and transparency\. In contrast, evaluation reporting remains fragmented\[[54](https://arxiv.org/html/2606.14516#bib.bib41),[18](https://arxiv.org/html/2606.14516#bib.bib42),[75](https://arxiv.org/html/2606.14516#bib.bib44),[27](https://arxiv.org/html/2606.14516#bib.bib43),[19](https://arxiv.org/html/2606.14516#bib.bib45)\], with reported implications for downstream analysis such as benchmark saturation studies\[[15](https://arxiv.org/html/2606.14516#bib.bib35),[4](https://arxiv.org/html/2606.14516#bib.bib38)\]\. Similarly, psychometric analyses in the field depend on standardized example\-level data, which is rare in current evaluation reporting\[[51](https://arxiv.org/html/2606.14516#bib.bib5),[74](https://arxiv.org/html/2606.14516#bib.bib6)\]\. Finally, governance frameworks such as the EU AI Act\[[28](https://arxiv.org/html/2606.14516#bib.bib36)\]mandate reproducible risk assessments, yet current evaluation tooling and reporting lack even the basic standardization that reproducibility requires\.
Figure 1:Every Eval Everhas four components: \(1\) heterogeneous evaluation data \(leaderboards, papers, harness logs, custom scripts\); \(2\) converters for known log formats \(HELM, Inspect AI, lm\-eval\) and metadata parsers for community formats \(Hugging Face, leaderboards\); \(3\) a unified metadata schema supporting aggregate and instance\-level results; and \(4\) a crowdsourced community database making public evaluation results accessible and processable\.Every Eval Ever\(EEE\) addresses these gaps through a shared reporting schema and a crowdsourced repository for AI evaluation results\. Just as data\[[3](https://arxiv.org/html/2606.14516#bib.bib14)\]and models\[[67](https://arxiv.org/html/2606.14516#bib.bib82)\]have documentation standards,EEEstandardizes the core aspects of evaluation: who ran it, under what settings, and what the resulting scores mean\. It ingests results from any source, like harness logs, leaderboard scrapes, and paper results, and represents them in a single, interoperable format\.
In summary,EEEmakes the following contributions:
1. 1\.Ashared, versioned JSON schemafor AI evaluation results that captures source provenance, model access mode, generation configuration, and metric semantics in a single record, with an optional instance\-level companion schema supporting single\-and multi\-turn interaction types\.
2. 2\.Automatic convertersfrom major harnesses \(HELM, lm\-eval\-harness, Inspect AI\) and common formats producing schema\-compliant records, including per\-instance outputs where source logs provide them, paired with a validation pipeline that ensures schema compliance at contribution time\.
3. 3\.Acrowdsourced, community repositoryhosted on Hugging Face, already spanning 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats, that for the first time enables cross\-framework comparison of evaluation results at scale\.
4. 4\.Exemplaryempirical analyses enabled by unified repository, whereEEEcan identify cost\-accuracy tradeoffs in agentic evaluations \([7\.1](https://arxiv.org/html/2606.14516#S7.SS1)\), reveal implementation\-dependent perplexity scores \([7\.2](https://arxiv.org/html/2606.14516#S7.SS2)\), captures evaluation harness reproducibility gaps \([7\.3](https://arxiv.org/html/2606.14516#S7.SS3)\), and enable meta\-analysis using Item Response Theory \([7\.4](https://arxiv.org/html/2606.14516#S7.SS4)\), none of which were previously feasible without a unified result format\.
## 2Related Work
##### Evaluation harnesses:
Evaluation harnesses describe software to standardize model evaluation, from input prompts to output metrics\. While evaluation harnesses like lm\-eval\-harness\[[32](https://arxiv.org/html/2606.14516#bib.bib8)\], HELM\[[53](https://arxiv.org/html/2606.14516#bib.bib9)\], and InspectAI\[[2](https://arxiv.org/html/2606.14516#bib.bib10)\]have proliferated, their format for results remain mutually incompatible\[[7](https://arxiv.org/html/2606.14516#bib.bib25),[14](https://arxiv.org/html/2606.14516#bib.bib46)\]\.Every Eval Everis not a new evaluation harness, but a translation layer that sits above those and enables better aggregation of evaluation results\.
##### Evaluation sharing:
There are a few large sources that share evaluations\. The main sources for those are leaderboards\[[53](https://arxiv.org/html/2606.14516#bib.bib9),[43](https://arxiv.org/html/2606.14516#bib.bib63)\], or websites\[[6](https://arxiv.org/html/2606.14516#bib.bib18),[65](https://arxiv.org/html/2606.14516#bib.bib17),[26](https://arxiv.org/html/2606.14516#bib.bib16)\]efforts that release what they run, and two concurrent works to ours that collect instance\-level\[[41](https://arxiv.org/html/2606.14516#bib.bib15)\]or Inspect framework outputs specifically\[[1](https://arxiv.org/html/2606.14516#bib.bib19)\]and share them publicly\. We collaborate with them to aggregate their results toEvery Eval Ever\. Public Leaderboards like Open LLM Leaderboard\[[12](https://arxiv.org/html/2606.14516#bib.bib20)\], Chatbot Arena\[[99](https://arxiv.org/html/2606.14516#bib.bib21)\], AlpacaEval\[[52](https://arxiv.org/html/2606.14516#bib.bib56)\], MT\-Bench\[[98](https://arxiv.org/html/2606.14516#bib.bib57)\], aggregated results at scale but export limited structured metadata\[[93](https://arxiv.org/html/2606.14516#bib.bib50)\]\. We createdEvery Eval Everto combine all of those scores in a unified format and database, alongside local harness runs within the same format\.
##### Reproducibility:
Comparison is unreliable when different evaluation settings are underspecified and carry the same benchmark name\. Lacking standards prevents the community from reliably comparing, replicating, and reusing cost\-intensive evaluations\[[14](https://arxiv.org/html/2606.14516#bib.bib46)\]\. The same model, accessed through different providers or run with different engine configurations, can produce different outputs\[[69](https://arxiv.org/html/2606.14516#bib.bib51)\]\. Moreover, prompt ordering\[[58](https://arxiv.org/html/2606.14516#bib.bib33)\]and data contamination\[[60](https://arxiv.org/html/2606.14516#bib.bib34)\]can introduce score variance\. Large\-scale analysis of evaluation results can require weeks of data wrangling before any research can begin\[e\.g\.[80](https://arxiv.org/html/2606.14516#bib.bib55),[22](https://arxiv.org/html/2606.14516#bib.bib54),[70](https://arxiv.org/html/2606.14516#bib.bib53),[4](https://arxiv.org/html/2606.14516#bib.bib38)\], if such analysis is possible at all without rerunning full leaderboards at extreme cost\[e\.g\.[35](https://arxiv.org/html/2606.14516#bib.bib52),[70](https://arxiv.org/html/2606.14516#bib.bib53),[34](https://arxiv.org/html/2606.14516#bib.bib71)\], we estimate the inference cost to reproduce our data in §[6](https://arxiv.org/html/2606.14516#S6)\.
##### Dataset and model documentation:
Although larger efforts in the ML community centered around datasets and model documentation, evaluation, and result documentation itself remain a gap in the community\[[54](https://arxiv.org/html/2606.14516#bib.bib41),[18](https://arxiv.org/html/2606.14516#bib.bib42)\]; where multiple suggestions for metadata to report exist\[[86](https://arxiv.org/html/2606.14516#bib.bib3),[16](https://arxiv.org/html/2606.14516#bib.bib2),[84](https://arxiv.org/html/2606.14516#bib.bib49)\], but not the low\-level evaluation ones\. For datasets, Datasheets for Datasets\[[33](https://arxiv.org/html/2606.14516#bib.bib81)\]and Croissant\[[3](https://arxiv.org/html/2606.14516#bib.bib14)\]standardize metadata for ML datasets\. For benchmarks, those efforts have been tailored to benchmark needs\[[78](https://arxiv.org/html/2606.14516#bib.bib47),[40](https://arxiv.org/html/2606.14516#bib.bib48),[84](https://arxiv.org/html/2606.14516#bib.bib49)\]\. For models, Model Cards\[[67](https://arxiv.org/html/2606.14516#bib.bib82),[56](https://arxiv.org/html/2606.14516#bib.bib76)\]document artefacts and their intended uses\. For evaluation results,Every Eval Everaddresses the most pressing remaining gap: a shared schema for the run\-time context that determines whether two scores can be aggregated and compared\.
##### Agentic evaluation standardization:
Recent work begun to understand the importance of standardizing agentic evaluation\[[10](https://arxiv.org/html/2606.14516#bib.bib67),[43](https://arxiv.org/html/2606.14516#bib.bib63)\], and made first steps towards achieving it\[[8](https://arxiv.org/html/2606.14516#bib.bib58),[49](https://arxiv.org/html/2606.14516#bib.bib65),[64](https://arxiv.org/html/2606.14516#bib.bib110),[38](https://arxiv.org/html/2606.14516#bib.bib66),[96](https://arxiv.org/html/2606.14516#bib.bib68)\]\. These efforts focus on the runtime and execution layers, standardizing agent evaluation across task representation, environment type, interface protocol, and tool specification format to enable easy and scalable agent and benchmark integration\.Every Eval Everprovides a complementary focus by standardizing how agentic evaluation results are represented and stored\. Hence, allowing for easy results analysis across different sources \(Section[7\.1](https://arxiv.org/html/2606.14516#S7.SS1)\)\.
Table 1:Design decisions behindEvery Eval Everand the capabilities they enable\.
## 3TheEvery Eval EverSchema
Every Eval Everis a standardized representation of AI evaluation results across benchmarks, models, and reporting data sources \(e\.g\., public model leaderboards, research papers, evaluation harness logs, among others; Figure[1](https://arxiv.org/html/2606.14516#S1.F1)\)\. Instead of storing only final performance scores, eachEvery Eval Everrecord captures the metadata required to interpret, compare, and reuse results: Who ran the evaluation, which model was evaluated, under what generation settings, how metrics were computed, and \(if available\) instance\-level outputs\. The schema is modular and organized into reusable information blocks\. In this section, we describe how the schema was developed and the design principles that guided its construction \(Section[3\.1](https://arxiv.org/html/2606.14516#S3.SS1)\), and present the schema structure and core components \(Section[3\.2](https://arxiv.org/html/2606.14516#S3.SS2.SSS0.Px1)\); full field references appear in App\.[B](https://arxiv.org/html/2606.14516#A2)and Table[8](https://arxiv.org/html/2606.14516#A3.T8)\.
### 3\.1Schema design principles and development methodology
Every Eval Everbalances broad adoption with enough structure to support downstream comparison, auditing, and reanalysis\. Table[1](https://arxiv.org/html/2606.14516#S2.T1)summarizes the main design decisions and the capabilities they enable\. The schema was developed through an open, iterative community design process inspired by the Croissant metadata format\[[3](https://arxiv.org/html/2606.14516#bib.bib14)\]\. We gathered structured feedback from about 40 researchers and unstructured feedback from about 110 researchers, including benchmark creators, evaluation framework developers, governance experts, leaderboard operators, and industry practitioners\. The schema is open and had since discussions on improvements through GitHub and Slack\. The schema versions were openly proposed, discussed, and revised by the community, with disagreements resolved by consensus meeting among core contributors \(see Governance in App\. §[E](https://arxiv.org/html/2606.14516#A5)\)\. The fields were included if they were \(i\) reported in at least one existing evaluation framework or published result and \(ii\) considered necessary for the interpretability or reuse of the score by the majority of contributors, \(iii\) anticipated to be available for others to report in the future; fields that did not meet all criteria were moved toadditional\_detailsor excluded\.
### 3\.2Schema overview
EachEEErecord stores data on a single evaluation run\. TheEEEschema is organized into five metadata blocks: First,Source Metadata:Who produced the results and from where did it originate? Second,Model Information:Which model was evaluated and how was it accessed? Third,General Configuration:Which configuration settings were used during evaluation? Fourth,Evaluation Results:How were the results reported \(e\.g\., metrics, uncertainty estimates\)? Finally,Instance\-level Data:Optionally, what instance\-level information is available?
\{sidemarkA\}
##### 1\. Source metadata:
Thesource\_metadatafield records who produced the result and how it was collected\.source\_typedistinguishes results scraped from a leaderboard or paper \(documentation\) from those produced by a local evaluation run \(evaluation\_run\)\.evaluator\_relationshiprecords whether the evaluation was run by the model developer \(first\_party\), an independent party \(third\_party\), or the metadata contributors themselves \(self\)\. Capturing the reporting source is important since the incentives, reproducibility, and trustworthiness of the reported results can differ between them\.
\{sidemarkB\}
##### 2\. Model information and access mode:
model\_inforecords the model identifier indeveloper/nameusing a standardized developer/model naming convention and the*access mode*\. We store whether results were obtained through hosted APIs such asopenaiandanthropic\(inference\_platform\) or local inference engines like vLLM \(inference\_engine\)\. The same weights served through different providers or engine versions can produce different results \(see §[7\.3](https://arxiv.org/html/2606.14516#S7.SS3)\); recording access mode makes these hidden confounds visible\.
\{sidemarkC\}
##### 3\. Generation configuration:
Parameters such as temperature, number of samples, and stop sequences can effect benchmark outcomes substantially, yet they are frequently missing from leaderboards\.generation\_configmakes them first\-class fields\. When a parameter is unknown, the field is omitted and its absence is recorded explicitly rather than silently defaulted\.
\{sidemarkD\}
##### 4\. Evaluation results and metric semantics:
evaluation\_resultsstores one entry per scored metric\. Each entry includes ametric\_configobject capturing score direction \(lower\_is\_better\), type \(continuous, binary, ordinal\), and range\. This prevents silent ambiguity\. For example, a score of 0\.31 is favorable on toxicity metrics where lower is better, but poor on pass@1 coding metrics where higher scores are desirable\. Ordinal metrics \(e\.g\. Low/Medium/High mapped to integers vialevel\_names\), uncertainty fields \(standard errors, confidence intervals\), and per\-result timestamps are also supported \(Case Studies §[7\.2](https://arxiv.org/html/2606.14516#S7.SS2)\)\.
\{sidemarkE\}
##### 5\. Instance\-level data:
While aggregate scores support comparisons, understanding*why*scores differ often requires per\-sample data\[[19](https://arxiv.org/html/2606.14516#bib.bib45)\]\.EEEtherefore stores instance outputs in a optional companion file\_samples\.jsonl\(one JSON object per line\)\. This file can store prompts, model outputs, references, scores, and metadata needed for detailed analysis\. Three interaction types are supported: First,single\_turn— QA, MCQ, classification; uses anoutputobject\. Second,multi\_turn— multi\-exchange conversations; uses amessagesarray\. Third,agentic— tool\-using agents with full tool\-call traces and sandbox logs; usesmessageswith nestedtool\_calls\. For agentic evaluations \(e\.g\. SWE\-Bench, GAIA\), the aggregate record captures tool and sandbox configuration \(Case Study §[7\.1](https://arxiv.org/html/2606.14516#S7.SS1)\)\.
## 4Converters and validation
Every Eval Everschema is supported by two components: \(1\) converters, which automatically parse existing evaluation outputs into theEEEschema, and \(2\) automated validation, which assesses whether the submitted datasets conform to theEEEschema specification:
##### \(1\) Converters:
Manual re\-formatting of existing logs is a large barrier to adoption\. Hence, we provide converters for three widely\-used LLM evaluation frameworks: HELM\[[53](https://arxiv.org/html/2606.14516#bib.bib9)\], lm\-eval\-harness\[[32](https://arxiv.org/html/2606.14516#bib.bib8)\], and Inspect AI\[[2](https://arxiv.org/html/2606.14516#bib.bib10)\]\. App\.[C](https://arxiv.org/html/2606.14516#A3)describes these converters input formats and field mappings\. More converters from leaderboards or other sources were contributed by the community \(App\.[C\.4](https://arxiv.org/html/2606.14516#A3.SS4)\)\. Each converter produces a schema\-compliant\.jsonfile and, where source logs include per\-sample data, a\_samples\.jsonl\.
##### \(2\) Validation:
Every time a record is submitted toEEE, it is validated against the schema before being entered into the repository\. The validation step checks for the required fields, data types, enum constraints, and object consistency\. Schema compliance is enforced by Pydantic v2 models generated from[eval\.schema\.json](https://github.com/evaleval/every_eval_ever/blob/dec1ae43e0741a37003425eafe6699d3296145ec/every_eval_ever/schemas/eval.schema.json)\. Validation runs in two settings:
\(2\.1\) Locally via the command\-line:Users can validate records locally with theEvery Eval EverCLI before submission\. However, there are additional checks available which are provided by the validator\. The CLI supports rich terminal, JSON, and GitHub annotation output formats\.
\(2\.2\) EvalEvalBot:When contributors submit data via pull request to the datastore, either the author or a maintainer can request validation using the validator\. The validator checks schema syntax compliance, presence of all files mentioned and provides warning for situations presence of duplicate records\.
## 5Community approach
Governance:The community\-driven nature ofEEEnecessitates that governance is integrated intoEEE\. The project recognizes three roles:*core maintainers*,*contributors*, and*community reviewers*, where maintainers hold final authority on contested decisions \(see App\.[E\.1](https://arxiv.org/html/2606.14516#A5.SS1)for more\)\. Decisions range from routine record additions to substantive schema changes, the latter following a structured community proposal and review process \(App\.[E\.2](https://arxiv.org/html/2606.14516#A5.SS2)\)\. Records are immutable once accepted\. Errors are handled via explicit correction and retraction mechanisms that preserve immutability and reproducibility \(App\.[E\.4](https://arxiv.org/html/2606.14516#A5.SS4)\)\. For example, discrepancy between evaluation results for LLaMA\[[29](https://arxiv.org/html/2606.14516#bib.bib40)\]exists andEEEstores both evaluation results as valid records, with the discrepancy visible in metadata rather than discovered through a blog post \(App\.[E\.3](https://arxiv.org/html/2606.14516#A5.SS3),[E\.6\.1](https://arxiv.org/html/2606.14516#A5.SS6.SSS1)\)\. Full details, including worked examples, are in App\.[E](https://arxiv.org/html/2606.14516#A5)\.
Contribution model:Evaluation infrastructure is a public good: every stakeholder needs it, yet no single entity has sufficient incentive to build it alone\.EEEaddresses this collectively: contributions range from evaluation records \(aggregate JSON files from converters, instance\-level companion files, or leaderboard scrapes; \(Section[4](https://arxiv.org/html/2606.14516#S4.SS0.SSS0.Px1)\) to converter extensions, schema proposals, and tooling \(Section[3](https://arxiv.org/html/2606.14516#S3)\)\. Benchmark creators can register artifacts inEEEformat, gaining visibility in downstream use\. Evaluation framework developers \(e\.g\., HELM\[[53](https://arxiv.org/html/2606.14516#bib.bib9)\], Inspect AI\[[2](https://arxiv.org/html/2606.14516#bib.bib10)\]\) get a standardized export path for their users\. Leaderboard operators can offload comparison and cross\-source aggregation to shared infrastructure\. Evaluation researchers \(see Section[7](https://arxiv.org/html/2606.14516#S7)for examples\) gain access to otherwise unattainable meta\-evaluation data\. Significant contributions are recognized through co\-authorship and formal industry partnerships \(See App\.[E\.5](https://arxiv.org/html/2606.14516#A5.SS5)\)\.
## 6Data analytics of theEvery Eval Everdatastore
Figure 2:Overview of the scale and diversity forEvery Eval Everdata\.As of May 4th 2026, current contributions total more than 200K aggregated results across over a hundred data contributions \(see summary statistics in App\.[A](https://arxiv.org/html/2606.14516#A1)\), providing a foundation for evaluation research and a lens on community\-wide reporting trends\. In this section, we highlight how the cost of running evaluations underscores the value of a shared resource likeEEE, and what the collected data reveals about community\-wide evaluation trends\.
##### Cost of AI evals:
Several works have discussed the cost of evaluating models\[[70](https://arxiv.org/html/2606.14516#bib.bib53),[34](https://arxiv.org/html/2606.14516#bib.bib71)\]; while future work should make cost more explicitly derivable from the schema, it remains underreported and difficult to infer\. Towards thism, we provide a conservative estimate of the savings such a shared resource may offer\. We conservatively estimate that reproducing just the evaluation runs currently collected inEEE, would cost hundreds of thousands of dollars \(App\.[D](https://arxiv.org/html/2606.14516#A4)\)\. This figure considers only running costs and excludes factors that would raise it by further orders of magnitude e\.g\., agentic evaluations \(Case Study §[F\.1](https://arxiv.org/html/2606.14516#A6.SS1)\), thinking models, repeated runs, failed attempts, long benchmarks, code execution, and human labeling\[[9](https://arxiv.org/html/2606.14516#bib.bib64),[43](https://arxiv.org/html/2606.14516#bib.bib63)\]\.
##### Community trends:
Beyond its value as a shared resource, the corpus offers a data\-rich view of community\-wide evaluation practices\. While acknowledging that coverage is biased by data availability, we analyze what the collected results reveal about how the field approaches AI evaluations \(App\.[A](https://arxiv.org/html/2606.14516#A1)\)\. We find that evaluations follow a long tail: popular benchmarks and models are reported at a scale far above the rest, yet thousands of less common ones appear, and the top 25 in each category account for barely 25% of all results\. Geographically, we observe a strong concentration on U\.S\.\-based models, with GPT models dominating\. Excluding human baselines, five companies account for 23 of the 24 most frequently evaluated systems, revealing a focus not only on specific models but on specific sources of models\. This concentration carries implications beyond socio\-political findings and suggests that much of the field is evaluating commercial products rather than underlying technologies, confirming recent claims in the literature\[[66](https://arxiv.org/html/2606.14516#bib.bib23)\]\.
Table 2:Macro\-average fill rate ofEvery Eval Evermetadata fields across the 31 evaluation harnesses and formats in theEEEdatastore\. We provide three metadata field examples each for model metadata, benchmark metadata, and evaluation metadata\.
##### Format inconsistencies:
Analyzing the data also corroborates our claims about format inconsistencies\. For example, the common source of evaluations is academic papers, which are not machine\-readable, and each uses a different reporting format\. Moreover, many fields crucial for comparisons are unreported; for instance, the*inference platform*is either explicitly marked as unknown or omitted entirely in 98% of all evaluation rows \(micro\-average\); even when each of the 31 formats is weighted equally, the field is reported in only 27% of rows on average \(macro\-average; Table[6](https://arxiv.org/html/2606.14516#S6.SS0.SSS0.Px2)\)\.
## 7Case Studies
EEEdata allows AI evaluations to be compared, reproduced, and reused, enabling broader impact through additional research and meta\-analyses\. For example,EEEdata can help identify where the evaluation ecosystem is thin, which capabilities are over\-measured, and which risks are neglected\. With instance\-level data, researchers can move beyond leaderboard averages to study item difficulty, robustness, and temporal drift\.Every Eval Everalso enables meta\-evaluation: testing evaluation methods themselves to distinguish real progress from artifacts of setup and reporting\. While different works already showcased uses forEEE\-like data\[e\.g\., for efficient benchmarking;[70](https://arxiv.org/html/2606.14516#bib.bib53)\], or even already usedEEEdata\[e\.g\., to characterize benchmark saturation;[4](https://arxiv.org/html/2606.14516#bib.bib38)\], we perform several initial studies to showcase research thatEEEenables\.
### 7\.1Case Study 1:Every Eval Everidentifies cost–accuracy tradeoffs in agentic evaluation
We showEEEis useful for analyzing agentic evaluations beyond accuracy scores\.EEEalready contains results from several agentic benchmarks, including SWE\-bench\[[42](https://arxiv.org/html/2606.14516#bib.bib62)\], HAL\[[43](https://arxiv.org/html/2606.14516#bib.bib63)\], Exgentic\[[8](https://arxiv.org/html/2606.14516#bib.bib58)\], and CocoaBench\[[37](https://arxiv.org/html/2606.14516#bib.bib59)\]\.EEEalso reports diverse metadata following previous work arguing that agent evaluations should track time, cost\[[44](https://arxiv.org/html/2606.14516#bib.bib61),[95](https://arxiv.org/html/2606.14516#bib.bib60)\], and other agent metadata\[[9](https://arxiv.org/html/2606.14516#bib.bib64)\], rather than reporting accuracy alone\. We rely on the additional metadata to reveal cost\-performance tradeoffs in scaffold and backbone choice\.
\(a\)CocoaBench\.
\(b\)CORE\-Bench Hard from HAL\.
Figure 3:Every Eval Everenables cost–accuracy analysis across agent scaffolds and model backbones\. Marker shape denotes the scaffold, and color denotes the backbone, with each point corresponding to one scaffold–backbone pair\. Segments connect results sharing a backbone, isolating scaffold effects\.Fig\.[3](https://arxiv.org/html/2606.14516#S7.F3)illustrates two concrete findings about cost–accuracy tradeoffs\. First, CocoaBench\[[37](https://arxiv.org/html/2606.14516#bib.bib59)\]shows that scaffold choice has substantial implication to costs, without necessarily showing performance gains in return\. For example, Codex and OpenClaw with GPT\-5\.4 reach the same reported accuracy, but Codex costs less and is also faster on average \(Appendix[F\.1](https://arxiv.org/html/2606.14516#A6.SS1)\)\. Second, CORE\-Bench Hard\[[83](https://arxiv.org/html/2606.14516#bib.bib27)\]from HAL\[[43](https://arxiv.org/html/2606.14516#bib.bib63)\]shows that scaffold effects can depend on the model backbone\. Claude code is cheaper than CORE\-Agent for both Opus 4\.5 and 4\.1, but it substantially increases accuracy for one and decreases for the other\.
Taken together, we find agentic evaluations cannot be interpreted from scalar accuracy alone: agent scaffold and model backbone choice, runtime, and cost all significantly affect the conclusions one draws from a result\. This kind of cross\-source decomposition is precisely whatEEEenables: without a common schema, these factors are scattered across incompatible logs and leaderboards, making systematic reanalysis challenging, particularly when metadata is reported inconsistently across sources\.
### 7\.2Case Study 2:Every Eval Everreveals version\-dependent perplexity
Model compression techniques\[[31](https://arxiv.org/html/2606.14516#bib.bib72),[30](https://arxiv.org/html/2606.14516#bib.bib73),[87](https://arxiv.org/html/2606.14516#bib.bib74)\]aim to reduce the size and computational cost of models, while minimizing degradation in performance\. WikiText perplexity\[[63](https://arxiv.org/html/2606.14516#bib.bib70)\]is a widely used metric to assess the impact of model compression, where lower perplexity indicates better predictive performance\. However, reported values across papers for the same model and dataset can differ substantially based on implementation choices that often go unreported\. GPTQ\[[31](https://arxiv.org/html/2606.14516#bib.bib72)\]and SpinQuant\[[57](https://arxiv.org/html/2606.14516#bib.bib75)\]shipped model\-specific evaluation scripts that report perplexity normalized by the number of tokens\. In contrast, the LM Evaluation Harness\[[32](https://arxiv.org/html/2606.14516#bib.bib8)\]reportsbyte\_perplexity,word\_perplexity, andbits\_per\_byterather than token\-level perplexity\. “Perplexity” alone is ambiguous as the same loss normalized by tokens, words, or bytes yields different numbers that are not directly comparable, as shown in Table[3](https://arxiv.org/html/2606.14516#S7.T3)\.
GPTQ scriptvLLM \+lm\-evalToken PPLWord PPLMismatchModele\(ℒsum/Ntokens\)e^\{\(\\mathcal\{L\}\_\{\\text\{sum\}\}/N\_\{\\text\{tokens\}\}\)\}e\(ℒsum/Nwords\)e^\{\(\\mathcal\{L\}\_\{\\text\{sum\}\}/N\_\{\\text\{words\}\}\)\}OPT\-6\.7B10\.860512\.2907\\cellcolordeltahl1\.4301Llama\-2\-7B5\.46878\.7939\\cellcolordeltahl3\.3252Table 3:Perplexity on WikiText under two evaluation implementations\. The summed cross\-entropyℒsum\\mathcal\{L\}\_\{\\text\{sum\}\}is identical across columns; only the normalization denominator differs\. Yet, the resulting values are not directly comparable\.EEEmakes these distinctions explicit\. Recording the evaluation backend, dataset version, and normalization convention prevents results from being compared merely because they share the “perplexity” label, and helps avoid drawing incorrect conclusions when, for example, a vLLM\-based evaluator reports a different variant from the GPTQ\-style script needed for direct comparison \(see App\.[F\.2](https://arxiv.org/html/2606.14516#A6.SS2)for implementation details\)\.
### 7\.3Case Study 3:Every Eval Evercaptures reproducibility gaps
We useEEEto audit instance\-level reproducibility\. Although evaluation frameworks do not usually promise exact reproducibility, researchers often rerun public evaluations locally and compare them to shared results\. We reproduced three models on fourteen single\-turn HELM benchmarks and compared aligned per\-instance scores between official HELM\-released records\[[53](https://arxiv.org/html/2606.14516#bib.bib9)\]and local reproductions after converting both sides toEEE\. Model and benchmark references appear in Fig\.[4](https://arxiv.org/html/2606.14516#S7.F4); implementation details are in App\.[F\.3](https://arxiv.org/html/2606.14516#A6.SS3)\.
Entity Imputation\[[61](https://arxiv.org/html/2606.14516#bib.bib97)\]IMDB\[[59](https://arxiv.org/html/2606.14516#bib.bib100)\]Synth\. Reasoning\[[94](https://arxiv.org/html/2606.14516#bib.bib104)\]TruthfulQA\[[55](https://arxiv.org/html/2606.14516#bib.bib105)\]GSM\[[25](https://arxiv.org/html/2606.14516#bib.bib99)\]LSAT QA\[[100](https://arxiv.org/html/2606.14516#bib.bib101)\]Civil Comments\[[17](https://arxiv.org/html/2606.14516#bib.bib96)\]MMLU\[[39](https://arxiv.org/html/2606.14516#bib.bib11)\]BoolQ\[[23](https://arxiv.org/html/2606.14516#bib.bib95)\]QuAC\[[21](https://arxiv.org/html/2606.14516#bib.bib103)\]NarrativeQA\[[46](https://arxiv.org/html/2606.14516#bib.bib102)\]Natural Syn\. Reason\.\[[24](https://arxiv.org/html/2606.14516#bib.bib88)\]WikiFact\[[71](https://arxiv.org/html/2606.14516#bib.bib89)\]Entity Matching\[[47](https://arxiv.org/html/2606.14516#bib.bib87)\]Pythia\-6\.9B\[[13](https://arxiv.org/html/2606.14516#bib.bib86)\]\\cellcolorcs3blue100100\\cellcolorcs3blue100100\\cellcolorcs3blue99999\.9\\cellcolorcs3blue99899\.8\\cellcolorcs3blue99699\.6\\cellcolorcs3blue99899\.8\\cellcolorcs3blue99999\.9\\cellcolorcs3blue100100\\cellcolorcs3blue99699\.6\\cellcolorcs3blue99799\.7\\cellcolorcs3blue98498\.4\\cellcolorcs3blue99799\.7\\cellcolorcs3blue92892\.8\\cellcolorcs3naN/AVicuna\-7B v1\.3\[[20](https://arxiv.org/html/2606.14516#bib.bib91)\]\\cellcolorcs3blue100100\\cellcolorcs3blue100100\\cellcolorcs3blue99799\.7\\cellcolorcs3blue99899\.8\\cellcolorcs3blue99099\.0\\cellcolorcs3blue99899\.8\\cellcolorcs3blue99799\.7\\cellcolorcs3blue100100\\cellcolorcs3blue100100\\cellcolorcs3blue98798\.7\\cellcolorcs3blue98598\.5\\cellcolorcs3blue99899\.8\\cellcolorcs3blue92092\.0\\cellcolorcs3naN/AFalcon\-7B\[[5](https://arxiv.org/html/2606.14516#bib.bib92)\]\\cellcolorcs3blue100100\\cellcolorcs3blue99199\.1\\cellcolorcs3blue98298\.2\\cellcolorcs3blue97797\.7\\cellcolorcs3blue98398\.3\\cellcolorcs3blue96096\.0\\cellcolorcs3blue94094\.0\\cellcolorcs3blue93593\.5\\cellcolorcs3blue93493\.4\\cellcolorcs3blue94294\.2\\cellcolorcs3blue91091\.0\\cellcolorcs3orange78878\.8\\cellcolorcs3blue92292\.2\\cellcolorcs3naN/A
10075Figure 4:Instance\-level score agreement between model–benchmark pairs for official HELM records and local reproductions after conversion toEvery Eval Ever\. Values report the percentage of aligned\(instance, core metric\)score pairs with identical official and local scores up to numerical tolerance\. N/A denotes no content\-hash overlap, making Entity\-Matching\[[47](https://arxiv.org/html/2606.14516#bib.bib87)\]incomparable\.Figure[4](https://arxiv.org/html/2606.14516#S7.F4)shows thatEEEexposes score and example mismatches\. Entity\-Matching \([right column](https://arxiv.org/html/2606.14516#entity)\) is incomparable because official and reproduced records select different Abt–Buy\[[47](https://arxiv.org/html/2606.14516#bib.bib87)\]examples despite using the same HELM recipe; App\.[F\.3](https://arxiv.org/html/2606.14516#A6.SS3)traces this to row\-order changes in the data\-processing stack\. SyntheticReasoning\-Natural reveals a serving artifact: official Pythia completions are empty and receive zero scores, whereas local completions are non\-empty and receive low but non\-zero scores\. WikiFact\[[71](https://arxiv.org/html/2606.14516#bib.bib89)\]shows roughly 92% agreement across models, consistent with stochastic sampling\. Smaller residual disagreements remain after these cases are explained\. Overall,EEEenables reproducibility forensics by surfacing mismatched example sets, empty or truncated completions, stochastic disagreement, and residual score differences\. Notably,EEEdoes not replace framework\-level provenance: when serving details are missing, the schema can surface the discrepancies but cannot always determine their exact cause\.
### 7\.4Case Study 4:Every Eval Everenables meta\-analysis using Item Response Theory
Figure 5:Estimated model abilities \(left\) and item difficulties \(right\) for three datasets included inEvery Eval Ever\.We showcase how the instance\-level data inEEEcan be used for analysis across datasets with otherwise incomparable data: GPQA Diamond\[[77](https://arxiv.org/html/2606.14516#bib.bib107)\], Wordle Arena\[[68](https://arxiv.org/html/2606.14516#bib.bib108)\], and JudgeBench\[[88](https://arxiv.org/html/2606.14516#bib.bib109)\]\. We fit a unidimensional Item Response Theory \(IRT\) model to analyze model ability and example difficulty distributions \(App\.[F\.4](https://arxiv.org/html/2606.14516#A6.SS4)\)\. At the dataset level, we see that the distribution of item difficulties and model abilities varies \(Fig\.[5](https://arxiv.org/html/2606.14516#S7.F5)\)\. In particular, the Wordle Arena examples are generally more difficult and their difficulty varies more\. This suggests GPQA may quickly shift from hard to saturated, while Wordle Arena will likely continue to surface challenging cases\. Overall, this showcases how the cross benchmark consistency simplifies comparing datasets and gaining new insights on existing evaluations\.
## 8Limitations
We note several limitations to the currentEEEschema\. First, coverage is strongest for text\-based, single\-model evaluations, while multi\-modal evaluations, human preference judgments \(e\.g\. Chatbot Arena Elo\), and multi\-agent settings are only partially supported; these areas are intended to be extended by the community\. Second, the value of the schema depends on broad community adoption: although converters reduce the cost of contribution, labs and leaderboard operators may still omit important metadata, such as generation parameters for proprietary systems, and this missing metadata is explicitly recorded\. Third, the UUID\-per\-run design preserves information losslessly but shifts deduplication to the analysis layer, with reference implementations of common equivalence criteria planned as package utilities\. Fourth, while the schema aids in finding reproducibility issues, as it does not run evaluations, information unique to this running settings is likely not reported in it\. Finally, despite automatic verification mechanisms and community governance for data additions, the resource remains participation\-based, so errors, inconsistencies, and uneven reporting across evaluation areas are likely to occur\.
## 9Conclusion
We presentEvery Eval Ever: a schema, validation pipeline, and converter suite that establishes a common language for AI evaluation reporting, supported by a growing community dataset of evaluation results\. By recording the context needed to interpret a score, not just the score itself,EEEmakes existing evaluation results reusable and enables analyses that per\-paper reporting cannot support\. We invite the community to contribute results, extend the schema, and build upon the dataset in future research\.
## References
- \[1\]\(2025\)Developing and maintaining an open\-source repository of AI evaluations: challenges and insights\.InChampioning Open\-source DEvelopment in ML Workshop @ ICML25,External Links:[Link](https://openreview.net/forum?id=yw33GWAEOK)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[2\]U\. AI Security Institute\(2024\-05\)Inspect AI: Framework for Large Language Model Evaluations\.External Links:[Link](https://github.com/UKGovernmentBEIS/inspect_ai)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.14516#S4.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.14516#S5.p2.1)\.
- \[3\]M\. Akhtar, O\. Benjelloun, C\. Conforti, L\. Foschini, P\. Gijsbers, J\. Giner\-Miguelez, S\. Goswami, N\. Jain, M\. Karamousadakis, S\. Krishna, M\. Kuchnik, S\. Lesage, Q\. Lhoest, P\. Marcenac, M\. Maskey, P\. Mattson, L\. Oala, H\. Oderinwale, P\. Ruyssen, T\. Santos, R\. Shinde, E\. Simperl, A\. Suresh, G\. Thomas, S\. Tykhonov, J\. Vanschoren, S\. Varma, J\. van der Velde, S\. Vogler, C\. Wu, and L\. Zhang\(2024\)Croissant: a metadata format for ML\-ready datasets\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 82133–82148\.External Links:[Document](https://dx.doi.org/10.52202/079017-2610)Cited by:[Appendix E](https://arxiv.org/html/2606.14516#A5.p1.1),[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§1](https://arxiv.org/html/2606.14516#S1.p4.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1),[§3\.1](https://arxiv.org/html/2606.14516#S3.SS1.p1.1)\.
- \[4\]M\. Akhtar, A\. Reuel, P\. Soni, S\. Ahuja, P\. S\. Ammanamanchi, R\. Rawal, V\. Zouhar, S\. Yadav, C\. Whitehouse, D\. Ki, J\. Mickel, L\. Choshen, M\. Šuppa, J\. Batzner, J\. Chim, J\. Sania, Y\. Long, H\. A\. Rahmani, C\. Knight, Y\. Nan, J\. Raj, Y\. Fan, S\. Singh, S\. Sahoo, E\. Habba, U\. Gohar, S\. Pawar, R\. Scholz, A\. Subramonian, J\. Ni, M\. Kochenderfer, S\. Koyejo, M\. Sachan, S\. Biderman, Z\. Talat, A\. Ghosh, and I\. Solaiman\(2026\)When ai benchmarks plateau: a systematic study of benchmark saturation\.External Links:2602\.16763,[Link](https://arxiv.org/abs/2602.16763)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2606.14516#S7.p1.1)\.
- \[5\]E\. Almazrouei, H\. Alobeidli, A\. Alshamsi, A\. Cappelli, R\. Cojocaru, M\. Debbah, É\. Goffinet, D\. Hesslow, J\. Launay, Q\. Malartic, D\. Mazzotta, B\. Noune, B\. Pannier, and G\. Penedo\(2023\)The falcon series of open language models\.External Links:2311\.16867,[Document](https://dx.doi.org/10.48550/arXiv.2311.16867),[Link](https://arxiv.org/abs/2311.16867)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.4.1)\.
- \[6\]Artificial Analysis\(2026\)Independent analysis of ai models and hosting providers\.Note:[https://artificialanalysis\.ai/](https://artificialanalysis.ai/)Accessed: 2026\-05\-01Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]E\. Bandel, Y\. Perlitz, E\. Venezian, R\. Friedman, O\. Arviv, M\. Orbach, S\. Don\-Yehiya, D\. Sheinwald, A\. Gera, L\. Choshen,et al\.\(2024\)Unitxt: flexible, shareable and reusable data preparation and evaluation for generative ai\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 3: System Demonstrations\),pp\. 207–215\.Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]E\. Bandel, A\. Yehudai, L\. Eden, Y\. Sagron, Y\. Perlitz, E\. Venezian, N\. Razinkov, N\. Ergas, S\. S\. Ifergan, S\. Shlomov, M\. Jacovi, L\. Choshen, L\. Ein\-Dor, Y\. Katz, and M\. Shmueli\-Scheuer\(2026\)General agent evaluation\.External Links:2602\.22953,[Link](https://arxiv.org/abs/2602.22953)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1),[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1)\.
- \[9\]E\. Bandel, A\. Yehudai, A\. Lacoste, A\. Ghosh, G\. Neubig, M\. Mitchell, M\. Shmueli\-Scheuer, and L\. Choshen\(2026\)Agentic systems should be general\.SSRN Electronic Journal\.External Links:[Link](https://ssrn.com/abstract=6176178)Cited by:[§6](https://arxiv.org/html/2606.14516#S6.SS0.SSS0.Px1.p1.1),[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1)\.
- \[10\]E\. Bandel, A\. Yehudai, and M\. Shmueli\-Scheuer\(April 27, 2026\)Ready for general agents? let’s test it\.\.InICLR Blogposts 2026,Note:https://iclr\-blogposts\.github\.io/2026/blog/2026/general\-agent\-evaluation/External Links:[Link](https://iclr-blogposts.github.io/2026/blog/2026/general-agent-evaluation/)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1)\.
- \[11\]J\. Batzner, L\. Choshen, S\. H\. Nelaturu, D\. Stachura, A\. Kornilova, Y\. Long, U\. Gohar, A\. Tran, and A\. Ghosh\(2026\)Shared task of every eval ever: building a unifying, standardized database of llm evaluations\.Note:PreprintCited by:[§E\.5](https://arxiv.org/html/2606.14516#A5.SS5.p1.1)\.
- \[12\]E\. Beeching, C\. Fourrier, N\. Habib, S\. Han, N\. Lambert, N\. Rajani, O\. Sanseviero, Y\. Belkada, and T\. Wolf\(2023\)Open LLM leaderboard\.Note:[https://huggingface\.co/spaces/HuggingFaceH4/open\_llm\_leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]S\. Biderman, H\. Schoelkopf, Q\. G\. Anthony, H\. Bradley, K\. O’Brien, E\. Hallahan, M\. A\. Khan, S\. Purohit, U\. S\. Prashanth, E\. Raff, A\. Skowron, L\. Sutawika, and O\. Van Der Wal\(2023\-23–29 Jul\)Pythia: a suite for analyzing large language models across training and scaling\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 2397–2430\.External Links:[Link](https://proceedings.mlr.press/v202/biderman23a.html)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.2.1)\.
- \[14\]S\. Biderman, H\. Schoelkopf, L\. Sutawika, L\. Gao, J\. Tow, B\. Abbasi, A\. F\. Aji, P\. S\. Ammanamanchi, S\. Black, J\. Clive, A\. DiPofi, J\. Etxaniz, B\. Fattori, J\. Z\. Forde, C\. Foster, J\. Hsu, M\. Jaiswal, W\. Y\. Lee, H\. Li, C\. Lovering, N\. Muennighoff, E\. Pavlick, J\. Phang, A\. Skowron, S\. Tan, X\. Tang, K\. A\. Wang, G\. I\. Winata, F\. Yvon, and A\. Zou\(2024\)Lessons from the trenches on reproducible evaluation of language models\.External Links:2405\.14782,[Link](https://arxiv.org/abs/2405.14782)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]K\. Blagec, G\. Dorffner, M\. Moradi, M\. Alam, and M\. Samwald\(2021\)Are NLP benchmarks saturating?\.External Links:2105\.13977Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[16\]F\. Bordes, C\. Ross, J\. T\. Kao, E\. Spiliopoulou, and A\. Williams\(2025\)Eval factsheets: a structured framework for documenting ai evaluations\.External Links:2512\.04062,[Link](https://arxiv.org/abs/2512.04062)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[17\]D\. Borkan, L\. Dixon, J\. Sorensen, N\. Thain, and L\. Vasserman\(2019\)Nuanced metrics for measuring unintended bias with real data for text classification\.arXiv preprint arXiv:1903\.04561\.External Links:[Link](https://arxiv.org/abs/1903.04561)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.8.1.1.1.1.1.1)\.
- \[18\]S\. R\. Bowman and G\. E\. Dahl\(2021\)What will it take to fix benchmarking in natural language understanding?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Online,pp\. 4843–4855\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.385),[Link](https://aclanthology.org/2021.naacl-main.385/)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[19\]R\. Burnell, W\. Schellaert, J\. Burden, T\. D\. Ullman, F\. Martinez\-Plumed, J\. B\. Tenenbaum, D\. Rutar, L\. G\. Cheke, J\. Sohl\-Dickstein, M\. Mitchell, D\. Kiela, M\. Shanahan, E\. M\. Voorhees, A\. G\. Cohn, J\. Z\. Leibo, and J\. Hernandez\-Orallo\(2023\)Rethink reporting of evaluation results in AI\.Science380\(6641\),pp\. 136–138\.External Links:[Document](https://dx.doi.org/10.1126/science.adf6369)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.14516#S3.SS2.SSS0.Px5.p1.1)\.
- \[20\]W\. Chiang, Z\. Li, Z\. Lin, Y\. Sheng, Z\. Wu, H\. Zhang, L\. Zheng, S\. Zhuang, Y\. Zhuang, J\. E\. Gonzalez, I\. Stoica, and E\. P\. Xing\(2023\-03\)Vicuna: an open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality\.External Links:[Link](https://lmsys.org/blog/2023-03-30-vicuna/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.3.1)\.
- \[21\]E\. Choi, H\. He, M\. Iyyer, M\. Yatskar, W\. Yih, Y\. Choi, P\. Liang, and L\. Zettlemoyer\(2018\)QuAC: question answering in context\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2174–2184\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1241),[Link](https://aclanthology.org/D18-1241/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.11.1.1.1.1.1.1)\.
- \[22\]L\. Choshen, Y\. Zhang, and J\. Andreas\(2025\)A hitchhiker’s guide to scaling law estimation\.InInternational Conference on Machine Learning,pp\. 10683–10699\.Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[23\]C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova\(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2924–2936\.External Links:[Link](https://aclanthology.org/N19-1300/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.10.1.1.1.1.1.1)\.
- \[24\]P\. Clark, O\. Tafjord, and K\. Richardson\(2020\)Transformers as soft reasoners over language\.InProceedings of the Twenty\-Ninth International Joint Conference on Artificial Intelligence,pp\. 3882–3890\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2020/537),[Link](https://arxiv.org/abs/2002.05867)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.13.1.1.1.1.1.1)\.
- \[25\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.6.1.1.1.1.1.1)\.
- \[26\]Epoch AI\(2026\)About us: making sense of ai\.Note:[https://epoch\.ai/about](https://epoch.ai/about)Accessed: 2026\-05\-01Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]K\. Ethayarajh and D\. Jurafsky\(2020\)Utility is in the eye of the user: A critique of NLP leaderboards\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 4846–4853\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.393),[Link](https://aclanthology.org/2020.emnlp-main.393/)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[28\]European Parliament and Council of the European Union\(2024\)Regulation \(EU\) 2024/1689 of the European Parliament and of the Council: artificial intelligence act\.Note:[https://eur\-lex\.europa\.eu/legal\-content/EN/TXT/?uri=CELEX:32024R1689](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[29\]C\. Fourrier, N\. Habib, J\. Launay, and T\. Wolf\(2023\-06\-23\)What’s going on with the open LLM leaderboard?\.Note:Hugging Face BlogExternal Links:[Link](https://huggingface.co/blog/open-llm-leaderboard-mmlu)Cited by:[§E\.6\.1](https://arxiv.org/html/2606.14516#A5.SS6.SSS1.p1.1),[§1](https://arxiv.org/html/2606.14516#S1.p2.1),[§5](https://arxiv.org/html/2606.14516#S5.p1.1)\.
- \[30\]E\. Frantar and D\. Alistarh\(2023\)SparseGPT: massive language models can be accurately pruned in one\-shot\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.Cited by:[§7\.2](https://arxiv.org/html/2606.14516#S7.SS2.p1.1)\.
- \[31\]E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh\(2023\)OPTQ: accurate quantization for generative pre\-trained transformers\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tcbBPnfwxS)Cited by:[§7\.2](https://arxiv.org/html/2606.14516#S7.SS2.p1.1)\.
- \[32\]L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou\(2023\-12\)A framework for few\-shot language model evaluation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.5371628),[Link](https://doi.org/10.5281/zenodo.5371628)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.14516#S4.SS0.SSS0.Px1.p1.1),[§7\.2](https://arxiv.org/html/2606.14516#S7.SS2.p1.1)\.
- \[33\]T\. Gebru, J\. Morgenstern, B\. Vecchione, J\. W\. Vaughan, H\. Wallach, H\. D\. III, and K\. Crawford\(2021\)Datasheets for datasets\.Communications of the ACM64\(12\),pp\. 86–92\.External Links:[Document](https://dx.doi.org/10.1145/3458723)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[34\]A\. Ghosh, Y\. Mai, G\. Channing, and L\. Choshen\(2026\-04\)AI evals are becoming the new compute bottleneck\.Note:EvalEval Coalition BlogExternal Links:[Link](https://evalevalai.com/research/2026/04/29/eval-costs-bottleneck/)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2606.14516#S6.SS0.SSS0.Px1.p1.1)\.
- \[35\]E\. Habba, O\. Arviv, I\. Itzhak, Y\. Perlitz, E\. Bandel, L\. Choshen, M\. Shmueli\-Scheuer, and G\. Stanovsky\(2025\)DOVE: a large\-scale multi\-dimensional predictions dataset towards meaningful llm evaluation\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 11744–11763\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[36\]E\. Habba, I\. Itzhak, A\. Yehudai, Y\. Perlitz, E\. Bandel, M\. Shmueli\-Scheuer, L\. Choshen, and G\. Stanovsky\(2026\)Growing pains: extensible and efficient llm benchmarking via fixed parameter calibration\.arXiv preprint arXiv:2604\.12843\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[37\]S\. Hao, Z\. Zhang, Z\. Liang, T\. Liu, Y\. Zha, Q\. Gao, J\. Chen, Z\. Wang, Z\. Cheng, H\. Zhang, J\. Wang, H\. Jin, B\. Zheng, K\. Zhou, Y\. Wang, F\. Yao, L\. Liu, Y\. Li, Z\. Li, Z\. Han, P\. Promthaw, T\. Cerruti, X\. Fu, Z\. Ma, J\. Shang, L\. Qin, J\. McAuley, E\. P\. Xing, Z\. Liu, R\. K\. Srivastava, and Z\. Hu\(2026\)CocoaBench: evaluating unified digital agents in the wild\.External Links:2604\.11201,[Link](https://arxiv.org/abs/2604.11201)Cited by:[§F\.1](https://arxiv.org/html/2606.14516#A6.SS1.p1.1),[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1),[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p2.1)\.
- \[38\]Harbor Framework Team\(2026\-01\)Harbor: A framework for evaluating and optimizing agents and models in container environments\.External Links:[Link](https://github.com/harbor-framework/harbor)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1)\.
- \[39\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1),[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.9.1.1.1.1.1.1)\.
- \[40\]A\. Hofmann, I\. Vejsbjerg, D\. Salwala, and E\. M\. Daly\(2025\)Auto\-benchmarkcard: automated synthesis of benchmark documentation\.External Links:2512\.09577,[Link](https://arxiv.org/abs/2512.09577)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[41\]H\. Jiang, S\. Zhang, X\. Yi, X\. Xie, and Z\. Xiao\(2026\)Position: science of ai evaluation requires item\-level benchmark data\.External Links:2604\.03244,[Link](https://arxiv.org/abs/2604.03244)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[42\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world github issues?\.External Links:2310\.06770,[Link](https://arxiv.org/abs/2310.06770)Cited by:[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1)\.
- \[43\]S\. Kapoor, B\. Stroebl, P\. Kirgis, N\. Nadgir, Z\. S\. Siegel, B\. Wei, T\. Xue, Z\. Chen, F\. Chen, S\. Utpala, F\. Ndzomga, D\. Oruganty, S\. Luskin, K\. Liu, B\. Yu, A\. Arora, D\. Hahm, H\. Trivedi, H\. Sun, J\. Lee, T\. Jin, Y\. Mai, Y\. Zhou, Y\. Zhu, R\. Bommasani, D\. Kang, D\. Song, P\. Henderson, Y\. Su, P\. Liang, and A\. Narayanan\(2025\)Holistic agent leaderboard: the missing infrastructure for ai agent evaluation\.External Links:2510\.11977,[Link](https://arxiv.org/abs/2510.11977)Cited by:[§F\.1](https://arxiv.org/html/2606.14516#A6.SS1.p1.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1),[§6](https://arxiv.org/html/2606.14516#S6.SS0.SSS0.Px1.p1.1),[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1),[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p2.1)\.
- \[44\]S\. Kapoor, B\. Stroebl, Z\. S\. Siegel, N\. Nadgir, and A\. Narayanan\(2024\)AI agents that matter\.External Links:2407\.01502,[Link](https://arxiv.org/abs/2407.01502)Cited by:[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1)\.
- \[45\]A\. Kipnis, K\. Voudouris, L\. M\. S\. Buschoff, and E\. Schulz\(2024\)Metabench–a sparse benchmark of reasoning and knowledge in large language models\.arXiv preprint arXiv:2407\.12844\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[46\]T\. Kočiský, J\. Schwarz, P\. Blunsom, C\. Dyer, K\. M\. Hermann, G\. Melis, and E\. Grefenstette\(2018\)The NarrativeQA reading comprehension challenge\.Transactions of the Association for Computational Linguistics6,pp\. 317–328\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00023),[Link](https://aclanthology.org/Q18-1023/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.12.1.1.1.1.1.1)\.
- \[47\]H\. Köpcke, A\. Thor, and E\. Rahm\(2010\)Evaluation of entity resolution approaches on real\-world match problems\.Proceedings of the VLDB Endowment3\(1–2\),pp\. 484–493\.External Links:[Document](https://dx.doi.org/10.14778/1920841.1920904)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4),[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.13.2),[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.15.1.1.1.1.1.1),[§7\.3](https://arxiv.org/html/2606.14516#S7.SS3.p2.1)\.
- \[48\]E\. Kurtic, A\. N\. Marques, S\. Pandit, M\. Kurtz, and D\. Alistarh\(2025\)“Give me bf16 or give me death”? accuracy\-performance trade\-offs in llm quantization\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 26872–26886\.Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1)\.
- \[49\]A\. Lacoste, N\. Gontier, O\. Shliazhko, A\. Jaiswal, K\. Sareen, S\. Nanisetty, J\. Cabezas, M\. D\. Verme, O\. G\. Younis, S\. Baratta, M\. Avalle, I\. Kerboua, X\. H\. Lù, E\. Bandel, M\. Shmueli\-Scheuer, A\. Yehudai, L\. Choshen, J\. Lebensold, S\. Hughes, M\. Caccia, A\. Drouin, S\. Reddy, T\. Yu, Y\. Su, G\. Neubig, and D\. Song\(2026\)CUBE: a standard for unifying agent benchmarks\.External Links:2603\.15798,[Link](https://arxiv.org/abs/2603.15798)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1)\.
- \[50\]J\. P\. Lalor and P\. Rodriguez\(2023\)Py\-irt: a scalable item response theory library for python\.INFORMS Journal on Computing35\(1\),pp\. 5–13\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p4.4)\.
- \[51\]P\. Li, X\. Tang, S\. Chen, Y\. Cheng, R\. Metoyer, T\. Hua, and N\. V\. Chawla\(2026\)Adaptive testing for llm evaluation: a psychometric alternative to static benchmarks\.External Links:2511\.04689,[Link](https://arxiv.org/abs/2511.04689)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[52\]X\. Li, T\. Zhang, Y\. Dubois, R\. Taori, I\. Gulrajani, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto\(2023\)AlpacaEval: an automatic evaluator of instruction\-following models\.External Links:[Link](https://github.com/tatsu-lab/alpaca_eval)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[53\]P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Ré, D\. Acosta\-Navas, D\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. Wang, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yüksekgönül, M\. Suzgun, N\. Kim, N\. Guha, N\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda\(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.14516#S4.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.14516#S5.p2.1),[§7\.3](https://arxiv.org/html/2606.14516#S7.SS3.p1.1)\.
- \[54\]T\. Liao, R\. Taori, I\. D\. Raji, and L\. Schmidt\(2021\)Are we learning yet? A meta review of evaluation failures across machine learning\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/757b505cfd34c64c85ca5b5690ee5293-Abstract-round2.html)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[55\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3214–3252\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229),[Link](https://aclanthology.org/2022.acl-long.229/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.5.1.1.1.1.1.1)\.
- \[56\]J\. Liu, W\. Li, Z\. Jin, and M\. Diab\(2024\-06\)Automatic generation of model and data cards: a step towards responsible AI\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 1975–1997\.External Links:[Link](https://aclanthology.org/2024.naacl-long.110/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.110)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[57\]Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort\(2025\)SpinQuant: llm quantization with learned rotations\.External Links:2405\.16406,[Link](https://arxiv.org/abs/2405.16406)Cited by:[§7\.2](https://arxiv.org/html/2606.14516#S7.SS2.p1.1)\.
- \[58\]Y\. Lu, M\. Bartolo, A\. Moore, S\. Riedel, and P\. Stenetorp\(2022\)Fantastically ordered prompts and where to find them: overcoming few\-shot prompt order sensitivity\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8086–8098\.Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[59\]A\. L\. Maas, R\. E\. Daly, P\. T\. Pham, D\. Huang, A\. Y\. Ng, and C\. Potts\(2011\)Learning word vectors for sentiment analysis\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,pp\. 142–150\.External Links:[Link](https://aclanthology.org/P11-1015/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.3.1.1.1.1.1.1)\.
- \[60\]I\. Magar and R\. Schwartz\(2022\)Data contamination: from memorization to exploitation\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),pp\. 157–165\.Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[61\]Y\. Mei, S\. Song, C\. Fang, H\. Yang, J\. Fang, and J\. Long\(2021\)Capturing semantics for imputation with pre\-trained language models\.In2021 IEEE 37th International Conference on Data Engineering \(ICDE\),pp\. 61–72\.External Links:[Document](https://dx.doi.org/10.1109/ICDE51399.2021.00013),[Link](https://doi.org/10.1109/ICDE51399.2021.00013)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.2.1.1.1.1.1.1)\.
- \[62\]G\. Meng, Q\. Zeng, J\. P\. Lalor, and H\. Yu\(2025\)A psychology\-based unified dynamic framework for curriculum learning\.Computational Linguistics,pp\. 1–49\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[63\]S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher\(2016\)Pointer sentinel mixture models\.External Links:1609\.07843,[Link](https://arxiv.org/abs/1609.07843)Cited by:[§7\.2](https://arxiv.org/html/2606.14516#S7.SS2.p1.1)\.
- \[64\]M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan, J\. Shen, G\. Ye, H\. Lin, J\. Poulos, M\. Wang, M\. Nezhurina, J\. Jitsev, D\. Lu, O\. M\. Mastromichalakis, Z\. Xu, Z\. Chen, Y\. Liu, R\. Zhang, L\. L\. Chen, A\. Kashyap, J\. Uslu, J\. Li, J\. Wu, M\. Yan, S\. Bian, V\. Sharma, K\. Sun, S\. Dillmann, A\. Anand, A\. Lanpouthakoun, B\. Koopah, C\. Hu, E\. Guha, G\. H\. S\. Dreiman, J\. Zhu, K\. Krauth, L\. Zhong, N\. Muennighoff, R\. Amanfu, S\. Tan, S\. Pimpalgaonkar, T\. Aggarwal, X\. Lin, X\. Lan, X\. Zhao, Y\. Liang, Y\. Wang, Z\. Wang, C\. Zhou, D\. Heineman, H\. Liu, H\. Trivedi, J\. Yang, J\. Lin, M\. Shetty, M\. Yang, N\. Omi, N\. Raoof, S\. Li, T\. Y\. Zhuo, W\. Lin, Y\. Dai, Y\. Wang, W\. Chai, S\. Zhou, D\. Wahdany, Z\. She, J\. Hu, Z\. Dong, Y\. Zhu, S\. Cui, A\. Saiyed, A\. Kolbeinsson, J\. Hu, C\. M\. Rytting, R\. Marten, Y\. Wang, A\. Dimakis, A\. Konwinski, and L\. Schmidt\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.External Links:2601\.11868,[Link](https://arxiv.org/abs/2601.11868)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1)\.
- \[65\]METR\(2026\-01\)Time horizon 1\.1\.Note:[https://metr\.org/blog/2026\-1\-29\-time\-horizon\-1\-1/](https://metr.org/blog/2026-1-29-time-horizon-1-1/)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[66\]J\. A\. Michaelov, C\. Arnett, T\. A\. Chang, P\. D\. Rivière, S\. M\. Taylor, C\. R\. Jones, S\. Trott, R\. P\. Levy, B\. K\. Bergen, and M\. Altman\(2026\)How open must language models be to enable reliable scientific inference?\.arXiv preprint arXiv:2603\.26539\.Cited by:[§6](https://arxiv.org/html/2606.14516#S6.SS0.SSS0.Px2.p1.1)\.
- \[67\]M\. Mitchell, S\. Wu, A\. Zaldivar, P\. Barnes, L\. Vasserman, B\. Hutchinson, E\. Spitzer, I\. D\. Raji, and T\. Gebru\(2019\)Model cards for model reporting\.InProceedings of the Conference on Fairness, Accountability, and Transparency,pp\. 220–229\.External Links:[Document](https://dx.doi.org/10.1145/3287560.3287596)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1),[§1](https://arxiv.org/html/2606.14516#S1.p4.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[68\]C\. Murphy and C\. Liu\(2025\)AI\-assisted wordle demo: combining llms and rule\-based solvers for enhanced gameplay\.In2025 IEEE Conference on Games \(CoG\),pp\. 1–2\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p3.2),[§7\.4](https://arxiv.org/html/2606.14516#S7.SS4.p1.1)\.
- \[69\]S\. H\. Nelaturu, N\. K\. Ravichandran, C\. Tran, S\. Hooker, and F\. Fioretto\(2024\)On the fairness impacts of hardware selection in machine learning\.Proceedings of the 41st International Conference on Machine Learning \(ICML\)\.External Links:[Link](https://proceedings.mlr.press/v235/nelaturu24a.html)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[70\]Y\. Perlitz, E\. Bandel, A\. Gera, O\. Arviv, L\. Ein\-Dor, E\. Shnarch, N\. Slonim, M\. Shmueli\-Scheuer, and L\. Choshen\(2024\-06\)Efficient benchmarking \(of language models\)\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 2519–2536\.External Links:[Link](https://aclanthology.org/2024.naacl-long.139/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.139)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2606.14516#S6.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2606.14516#S7.p1.1)\.
- \[71\]F\. Petroni, T\. Rocktäschel, S\. Riedel, P\. Lewis, A\. Bakhtin, Y\. Wu, and A\. Miller\(2019\)Language models as knowledge bases?\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,pp\. 2463–2473\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1250),[Link](https://aclanthology.org/D19-1250/)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.14.1.1.1.1.1.1),[§7\.3](https://arxiv.org/html/2606.14516#S7.SS3.p2.1)\.
- \[72\]F\. M\. Polo, R\. Xu, L\. Weber, M\. Silva, O\. Bhardwaj, L\. Choshen, A\. F\. de Oliveira, Y\. Sun, and M\. Yurochkin\(2024\)Efficient multi\-prompt evaluation of llms\.Advances in Neural Information Processing Systems37,pp\. 22483–22512\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[73\]F\. M\. Polo, L\. Choshen, Y\. Sun, and K\. Greenewald\(2025\)A statistical framework for game\-based ai evaluation\.InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling,Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[74\]F\. M\. Polo, L\. Weber, L\. Choshen, Y\. Sun, G\. Xu, and M\. Yurochkin\(2024\)TinyBenchmarks: evaluating llms with fewer examples\.External Links:2402\.14992,[Link](https://arxiv.org/abs/2402.14992)Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5),[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[75\]I\. D\. Raji, E\. Denton, E\. M\. Bender, A\. Hanna, and A\. Paullada\(2021\)AI and the everything in the whole wide world benchmark\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/084b6fbb10729ed4da8c3d3f5a3ae7c9-Abstract-round2.html)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[76\]V\. J\. Reddi, C\. Cheng, D\. Kanter, P\. Mattson, G\. Schmuelling, C\. Wu, B\. Anderson, M\. Breughe, M\. Charlebois, W\. Chou, R\. Chukka, C\. Coleman, S\. Davis, P\. Deng, G\. Diamos, J\. Duke, D\. Fick, J\. S\. Gardner, I\. Hubara, S\. Idgunji, T\. B\. Jablin, J\. Jiao, T\. St\. John, P\. Kanwar, D\. Lee, J\. Liao, A\. Lokhmotov, F\. Massa, P\. Meng, P\. Micikevicius, C\. Osborne, G\. Pekhimenko, A\. T\. R\. Rajan, D\. Sequeira, A\. Sirasao, F\. Sun, H\. Tang, M\. Thomson, F\. Wei, E\. Wu, L\. Xu, K\. Yamada, B\. Yu, G\. Yuan, A\. Zhong, P\. Zhang, and Y\. Zhou\(2020\)MLPerf inference benchmark\.InProceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture,External Links:[Document](https://dx.doi.org/10.1109/ISCA45697.2020.00045),[Link](https://arxiv.org/abs/1911.02549)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[77\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p3.2),[§7\.4](https://arxiv.org/html/2606.14516#S7.SS4.p1.1)\.
- \[78\]A\. Reuel, A\. Hardy, C\. Smith, M\. Lamparth, M\. Hardy, and M\. J\. Kochenderfer\(2024\)BetterBench: assessing ai benchmarks, uncovering issues, and establishing best practices\.External Links:2411\.12990,[Link](https://arxiv.org/abs/2411.12990)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[79\]P\. Rodriguez, J\. Barrow, A\. M\. Hoyle, J\. P\. Lalor, R\. Jia, and J\. L\. Boyd\-Graber\(2021\)Evaluation examples are not equally informative: how should that change nlp leaderboards?\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 4486–4503\.Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[80\]Y\. Ruan, C\. J\. Maddison, and T\. B\. Hashimoto\(2024\)Observational scaling laws and the predictability of language model performance\.Advances in Neural Information Processing Systems37,pp\. 15841–15892\.Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1),[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px3.p1.1)\.
- \[81\]Schema\.org\(2026\)Dataset \- Schema\.org Type\.Note:Schema\.org vocabulary documentationAccessed 2026\-05\-01External Links:[Link](https://schema.org/Dataset)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[82\]N\. Shabtay, F\. M\. Polo, S\. Doveh, W\. Lin, M\. J\. Mirza, L\. Choshen, M\. Yurochkin, Y\. Sun, A\. Arbelle, L\. Karlinsky,et al\.\(2024\)LiveXiv\-a multi\-modal live benchmark based on arxiv papers content\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p1.5)\.
- \[83\]Z\. S\. Siegel, S\. Kapoor, N\. Nadgir, B\. Stroebl, and A\. Narayanan\(2024\)CORE\-bench: fostering the credibility of published research through a computational reproducibility agent benchmark\.Transactions on Machine Learning Research\.Cited by:[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p2.1)\.
- \[84\]A\. Sokol, E\. Daly, M\. Hind, D\. Piorkowski, X\. Zhang, N\. Moniz, and N\. Chawla\(2025\)BenchmarkCards: standardized documentation for large language model benchmarks\.External Links:2410\.12974,[Link](https://arxiv.org/abs/2410.12974)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[85\]A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso, A\. Kluska, A\. Lewkowycz, A\. Agarwal, A\. Power, A\. Ray, A\. Warstadt, A\. W\. Kocurek, A\. Safaya, A\. Tazarv, A\. Xiang, A\. Parrish, A\. Nie, A\. Hussain, A\. Askell, A\. Dsouza, A\. Slone, A\. Rahane, A\. S\. Iyer, A\. Andreassen, A\. Madotto, A\. Santilli, A\. Stuhlmüller, A\. Dai, A\. La, A\. Lampinen, A\. Zou, A\. Jiang, A\. Chen, A\. Vuong, A\. Gupta, A\. Gottardi, A\. Norelli, A\. Venkatesh, A\. Gholamidavoodi, A\. Tabassum, A\. Menezes, A\. Kirubarajan, A\. Mullokandov, A\. Sabharwal, A\. Herrick, A\. Efrat, A\. Erdem, A\. Karakaş, B\. R\. Roberts, B\. S\. Loe, B\. Zoph, B\. Bojanowski, B\. Özyurt, B\. Hedayatnia, B\. Neyshabur, B\. Inden, B\. Stein, B\. Ekmekci, B\. Y\. Lin, B\. Howald, B\. Orinion, C\. Diao, C\. Dour, C\. Stinson, C\. Argueta, C\. F\. Ramírez, C\. Singh, C\. Rathkopf, C\. Meng, C\. Baral, C\. Wu, C\. Callison\-Burch, C\. Waites, C\. Voigt, C\. D\. Manning, C\. Potts, C\. Ramirez, C\. E\. Rivera, C\. Siro, C\. Raffel, C\. Ashcraft, C\. Garbacea, D\. Sileo, D\. Garrette, D\. Hendrycks, D\. Kilman, D\. Roth, D\. Freeman, D\. Khashabi, D\. Levy, D\. M\. González, D\. Perszyk, D\. Hernandez, D\. Chen, D\. Ippolito, D\. Gilboa, D\. Dohan, D\. Drakard, D\. Jurgens, D\. Datta, D\. Ganguli, D\. Emelin, D\. Kleyko, D\. Yuret, D\. Chen, D\. Tam, D\. Hupkes, D\. Misra, D\. Buzan, D\. C\. Mollo, D\. Yang, D\. Lee, D\. Schrader, E\. Shutova, E\. D\. Cubuk, E\. Segal, E\. Hagerman, E\. Barnes, E\. Donoway, E\. Pavlick, E\. Rodola, E\. Lam, E\. Chu, E\. Tang, E\. Erdem, E\. Chang, E\. A\. Chi, E\. Dyer, E\. Jerzak, E\. Kim, E\. E\. Manyasi, E\. Zheltonozhskii, F\. Xia, F\. Siar, F\. Martínez\-Plumed, F\. Happé, F\. Chollet, F\. Rong, G\. Mishra, G\. I\. Winata, G\. de Melo, G\. Kruszewski, G\. Parascandolo, G\. Mariani, G\. Wang, G\. Jaimovitch\-López, G\. Betz, G\. Gur\-Ari, H\. Galijasevic, H\. Kim, H\. Rashkin, H\. Hajishirzi, H\. Mehta, H\. Bogar, H\. Shevlin, H\. Schütze, H\. Yakura, H\. Zhang, H\. M\. Wong, I\. Ng, I\. Noble, J\. Jumelet, J\. Geissinger, J\. Kernion, J\. Hilton, J\. Lee, J\. F\. Fisac, J\. B\. Simon, J\. Koppel, J\. Zheng, J\. Zou, J\. Kocoń, J\. Thompson, J\. Wingfield, J\. Kaplan, J\. Radom, J\. Sohl\-Dickstein, J\. Phang, J\. Wei, J\. Yosinski, J\. Novikova, J\. Bosscher, J\. Marsh, J\. Kim, J\. Taal, J\. Engel, J\. Alabi, J\. Xu, J\. Song, J\. Tang, J\. Waweru, J\. Burden, J\. Miller, J\. U\. Balis, J\. Batchelder, J\. Berant, J\. Frohberg, J\. Rozen, J\. Hernandez\-Orallo, J\. Boudeman, J\. Guerr, J\. Jones, J\. B\. Tenenbaum, J\. S\. Rule, J\. Chua, K\. Kanclerz, K\. Livescu, K\. Krauth, K\. Gopalakrishnan, K\. Ignatyeva, K\. Markert, K\. D\. Dhole, K\. Gimpel, K\. Omondi, K\. Mathewson, K\. Chiafullo, K\. Shkaruta, K\. Shridhar, K\. McDonell, K\. Richardson, L\. Reynolds, L\. Gao, L\. Zhang, L\. Dugan, L\. Qin, L\. Contreras\-Ochando, L\. Morency, L\. Moschella, L\. Lam, L\. Noble, L\. Schmidt, L\. He, L\. O\. Colón, L\. Metz, L\. K\. Şenel, M\. Bosma, M\. Sap, M\. ter Hoeve, M\. Farooqi, M\. Faruqui, M\. Mazeika, M\. Baturan, M\. Marelli, M\. Maru, M\. J\. R\. Quintana, M\. Tolkiehn, M\. Giulianelli, M\. Lewis, M\. Potthast, M\. L\. Leavitt, M\. Hagen, M\. Schubert, M\. O\. Baitemirova, M\. Arnaud, M\. McElrath, M\. A\. Yee, M\. Cohen, M\. Gu, M\. Ivanitskiy, M\. Starritt, M\. Strube, M\. Swędrowski, M\. Bevilacqua, M\. Yasunaga, M\. Kale, M\. Cain, M\. Xu, M\. Suzgun, M\. Walker, M\. Tiwari, M\. Bansal, M\. Aminnaseri, M\. Geva, M\. Gheini, M\. V\. T, N\. Peng, N\. A\. Chi, N\. Lee, N\. G\. Krakover, N\. Cameron, N\. Roberts, N\. Doiron, N\. Martinez, N\. Nangia, N\. Deckers, N\. Muennighoff, N\. S\. Keskar, N\. S\. Iyer, N\. Constant, N\. Fiedel, N\. Wen, O\. Zhang, O\. Agha, O\. Elbaghdadi, O\. Levy, O\. Evans, P\. A\. M\. Casares, P\. Doshi, P\. Fung, P\. P\. Liang, P\. Vicol, P\. Alipoormolabashi, P\. Liao, P\. Liang, P\. Chang, P\. Eckersley, P\. M\. Htut, P\. Hwang, P\. Miłkowski, P\. Patil, P\. Pezeshkpour, P\. Oli, Q\. Mei, Q\. Lyu, Q\. Chen, R\. Banjade, R\. E\. Rudolph, R\. Gabriel, R\. Habacker, R\. Risco, R\. Millière, R\. Garg, R\. Barnes, R\. A\. Saurous, R\. Arakawa, R\. Raymaekers, R\. Frank, R\. Sikand, R\. Novak, R\. Sitelew, R\. LeBras, R\. Liu, R\. Jacobs, R\. Zhang, R\. Salakhutdinov, R\. Chi, R\. Lee, R\. Stovall, R\. Teehan, R\. Yang, S\. Singh, S\. M\. Mohammad, S\. Anand, S\. Dillavou, S\. Shleifer, S\. Wiseman, S\. Gruetter, S\. R\. Bowman, S\. S\. Schoenholz, S\. Han, S\. Kwatra, S\. A\. Rous, S\. Ghazarian, S\. Ghosh, S\. Casey, S\. Bischoff, S\. Gehrmann, S\. Schuster, S\. Sadeghi, S\. Hamdan, S\. Zhou, S\. Srivastava, S\. Shi, S\. Singh, S\. Asaadi, S\. S\. Gu, S\. Pachchigar, S\. Toshniwal, S\. Upadhyay, Shyamolima, Debnath, S\. Shakeri, S\. Thormeyer, S\. Melzi, S\. Reddy, S\. P\. Makini, S\. Lee, S\. Torene, S\. Hatwar, S\. Dehaene, S\. Divic, S\. Ermon, S\. Biderman, S\. Lin, S\. Prasad, S\. T\. Piantadosi, S\. M\. Shieber, S\. Misherghi, S\. Kiritchenko, S\. Mishra, T\. Linzen, T\. Schuster, T\. Li, T\. Yu, T\. Ali, T\. Hashimoto, T\. Wu, T\. Desbordes, T\. Rothschild, T\. Phan, T\. Wang, T\. Nkinyili, T\. Schick, T\. Kornev, T\. Tunduny, T\. Gerstenberg, T\. Chang, T\. Neeraj, T\. Khot, T\. Shultz, U\. Shaham, V\. Misra, V\. Demberg, V\. Nyamai, V\. Raunak, V\. Ramasesh, V\. U\. Prabhu, V\. Padmakumar, V\. Srikumar, W\. Fedus, W\. Saunders, W\. Zhang, W\. Vossen, X\. Ren, X\. Tong, X\. Zhao, X\. Wu, X\. Shen, Y\. Yaghoobzadeh, Y\. Lakretz, Y\. Song, Y\. Bahri, Y\. Choi, Y\. Yang, Y\. Hao, Y\. Chen, Y\. Belinkov, Y\. Hou, Y\. Hou, Y\. Bai, Z\. Seid, Z\. Zhao, Z\. Wang, Z\. J\. Wang, Z\. Wang, and Z\. Wu\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.External Links:2206\.04615,[Link](https://arxiv.org/abs/2206.04615)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[86\]L\. Staufer, M\. Yang, A\. Reuel, and S\. Casper\(2025\)Audit cards: contextualizing ai evaluations\.arXiv preprint arXiv:2504\.13839\.Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px4.p1.1)\.
- \[87\]M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter\(2023\)A simple and effective pruning approach for large language models\.InWorkshop on Efficient Systems for Foundation Models @ ICML2023,External Links:[Link](https://openreview.net/forum?id=tz9JV2PRSv)Cited by:[§7\.2](https://arxiv.org/html/2606.14516#S7.SS2.p1.1)\.
- \[88\]S\. Tan, S\. Zhuang, K\. Montgomery, W\. Y\. Tang, A\. Cuadron, C\. Wang, R\. A\. Popa, and I\. Stoica\(2024\)JudgeBench: a benchmark for evaluating llm\-based judges\.ArXivabs/2410\.12784\.External Links:[Link](https://api.semanticscholar.org/CorpusID:273374769)Cited by:[§F\.4](https://arxiv.org/html/2606.14516#A6.SS4.p3.2),[§7\.4](https://arxiv.org/html/2606.14516#S7.SS4.p1.1)\.
- \[89\]F\. A\. E\. Team\(2025\)Quality\-first with kimi k2\.5: the importance of post\-training and serving infrastructure\.Fireworks AI\.External Links:[Link](https://fireworks.ai/blog/quality-first-with-kimi-k2p5)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1)\.
- \[90\]W3C Dataset Exchange Working Group\(2024\)Data Catalog Vocabulary \(DCAT\) – Version 3\.Note:W3C RecommendationAccessed 2026\-05\-01External Links:[Link](https://www.w3.org/TR/vocab-dcat-3/)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[91\]A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman\(2019\)SuperGLUE: a stickier benchmark for general\-purpose language understanding systems\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/1905.00537)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[92\]A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman\(2019\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rJ4km2R5t7)Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p3.1)\.
- \[93\]A\. Wang, A\. Hertzmann, and O\. Russakovsky\(2024\)Benchmark suites instead of leaderboards for evaluating AI fairness\.Patterns5\(11\),pp\. 101080\.External Links:[Document](https://dx.doi.org/10.1016/j.patter.2024.101080)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[94\]Y\. Wu, M\. N\. Rabe, W\. Li, J\. Ba, R\. B\. Grosse, and C\. Szegedy\(2021\)LIME: learning inductive bias for primitives of mathematical reasoning\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 11251–11262\.External Links:[Link](https://proceedings.mlr.press/v139/wu21c.html)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.4.1.1.1.1.1.1)\.
- \[95\]A\. Yehudai, L\. Eden, A\. Li, G\. Uziel, Y\. Zhao, R\. Bar\-Haim, A\. Cohan, and M\. Shmueli\-Scheuer\(2025\)Survey on evaluation of llm\-based agents\.External Links:2503\.16416,[Link](https://arxiv.org/abs/2503.16416)Cited by:[§7\.1](https://arxiv.org/html/2606.14516#S7.SS1.p1.1)\.
- \[96\]A\. Yehudai, L\. Eden, and M\. Shmueli\-Scheuer\(2026\)Agentic clear: automating multi\-level evaluation of llm agents\.External Links:2605\.22608,[Link](https://arxiv.org/abs/2605.22608)Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px5.p1.1)\.
- \[97\]J\. Yuan, H\. Li, X\. Ding, W\. Xie, Y\. Li, W\. Zhao, K\. Wan, J\. Shi, X\. Hu, and Z\. Liu\(2025\)Understanding and mitigating numerical sources of nondeterminism in llm inference\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.14516#S1.p2.1)\.
- \[98\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[99\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2024\)Chatbot arena: an open platform for evaluating LLMs by human preference\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.14516#S2.SS0.SSS0.Px2.p1.1)\.
- \[100\]W\. Zhong, S\. Wang, D\. Tang, Z\. Xu, D\. Guo, J\. Wang, J\. Yin, M\. Zhou, and N\. Duan\(2021\)AR\-LSAT: investigating analytical reasoning of text\.External Links:2104\.06598,[Link](https://arxiv.org/abs/2104.06598)Cited by:[Figure 4](https://arxiv.org/html/2606.14516#S7.F4.2.1.1.1.1.1.1.7.1.1.1.1.1.1)\.
Appendix
## Appendix ASummary Statistics Every Eval Ever Datastore
We present a high level summary and breakdown of key fields inEvery Eval Everin Tables[4](https://arxiv.org/html/2606.14516#A1.T4),[5](https://arxiv.org/html/2606.14516#A1.T5),[6](https://arxiv.org/html/2606.14516#A1.T6),[7](https://arxiv.org/html/2606.14516#A1.T7)\.
### A\.1Overview of inference platform distribution
Table 4:Inference platform distribution by evaluation runs and model diversity\. Over 98% of runs fall underUnreportedorUnknowncategories\. Among identified providers, Ollama and OpenAI lead in both volume and unique model representation\.Inference PlatformEval\. RunsModels\\rowcolorgray\!10 Unreported184,929\(80\.56%\)17,101\(75\.71%\)Unknown42,260\(18\.41%\)5,282\(23\.38%\)\\rowcolorgray\!10 Ollama849\(0\.37%\)55\(0\.24%\)OpenAI411\(0\.18%\)32\(0\.14%\)\\rowcolorgray\!10 Google312\(0\.14%\)28\(0\.12%\)Together191\(0\.08%\)13\(0\.06%\)\\rowcolorgray\!10 Anthropic152\(0\.07%\)16\(0\.07%\)Mistral100\(0\.04%\)10\(0\.04%\)\\rowcolorgray\!10 DeepSeek56\(0\.02%\)4\(0\.02%\)Cohere48\(0\.02%\)5\(0\.02%\)\\rowcolorgray\!10 xAI31\(0\.01%\)3\(0\.01%\)OpenRouter30\(0\.01%\)10\(0\.04%\)\\rowcolorgray\!10 AWS30\(0\.01%\)3\(0\.01%\)Gemini27\(0\.01%\)3\(0\.01%\)\\rowcolorgray\!10 Aliyun22\(0\.01%\)5\(0\.02%\)Perplexity21\(0\.01%\)2\(0\.01%\)\\rowcolorgray\!10 Local20\(0\.01%\)1\(0\.00%\)Ark15\(0\.01%\)5\(0\.02%\)\\rowcolorgray\!10 Moonshot13\(0\.01%\)2\(0\.01%\)MiniMax12\(0\.01%\)1\(0\.00%\)\\rowcolorgray\!10 StepFun11\(0\.00%\)1\(0\.00%\)Qwen10\(0\.00%\)2\(0\.01%\)\\rowcolorgray\!10 Tencent10\(0\.00%\)1\(0\.00%\)Zhipu9\(0\.00%\)1\(0\.00%\)\\rowcolorgray\!10 Kuaishou3\(0\.00%\)1\(0\.00%\)
### A\.2Overview of the top 25 models inEvery Eval Ever
Table 5:This shows the breakdown of the top 25 models inEvery Eval Everand the total number of evaluations across all runs\. The data highlights a strong concentration of evaluation runs for the GPT\-4 family, while also showing the emergence of frontier models like DeepSeek\-R1 and Gemini\-3 previews in current evaluation cycles\.
### A\.3Overview of evaluation runs by source organization
Table 6:The table shows individual evaluation runs by source organization\. The dataset is characterized by a significant volume of records from alphaXiv, exceeding 160,000 entries\. Notably, this schema includes a university consortium comprising Princeton University, New York University, University of Washington, University of California San Diego, and Canyon Crest Academy, which together contribute to the diverse academic representation\.\*
### A\.4Overview of evaluation activity across the top 25 benchmarks
Table 7:Distribution of evaluation activity across the top 25 most popular benchmarks\. The data shows a high density of testing within the Artificial Analysis LLM API framework, followed by foundational reasoning and knowledge benchmarks such as GPQA, IFEval, and MMLU\-PRO, reflecting their role as industry standards for model performance assessment\.
## Appendix BFull Schema Field Reference
This appendix describes the top\-level interface of the unified evaluation schema used in Every Eval Ever\. The schema is splitted into two linked records: an aggregate evaluation record for run\-level metadata and summary metrics, and a companion instance\-level record for per\-sample outcomes\. In the current release, the canonical interfaces areeval\.schema\.json\(version0\.2\.2\) andinstance\_level\_eval\.schema\.json\(versioninstance\_level\_eval\_0\.2\.2\)\. Both schemas define closed top\-level interfaces: unspecified top\-level fields are not permitted\.
### B\.1Aggregate Evaluation Records
The aggregate record represents a single evaluation run for one model and stores the provenance, model context, evaluation framework, and one or more reported metric results\. It is defined byeval\.schema\.jsonversion0\.2\.2\. Its top\-level fields are summarized in Table[8](https://arxiv.org/html/2606.14516#A3.T8)\.
Each element ofevaluation\_resultsis anevaluation\_resultobject\. Its top\-level structure is summarized in Table[9](https://arxiv.org/html/2606.14516#A3.T9)\.
Taken together, the fields in Tables[8](https://arxiv.org/html/2606.14516#A3.T8)and[9](https://arxiv.org/html/2606.14516#A3.T9)define the aggregate representation of an evaluation run\. The separation betweenevaluation\_timestampandretrieved\_timestampdistinguishes when the evaluation was executed from when the standardized record was created, while theevaluation\_resultsarray allows multiple benchmark outcomes or metrics to be attached to the same run\-level record\.
### B\.2Instance\-Level Evaluation Records
The instance\-level record represents a single benchmark sample associated with an aggregate evaluation run\. It is defined byinstance\_level\_eval\.schema\.jsonversioninstance\_level\_eval\_0\.2\.2and is typically stored as a companion JSONL file\. Its top\-level fields are summarized in Table[10](https://arxiv.org/html/2606.14516#A3.T10)\.
The instance\-level schema is intentionally aligned with the aggregate schema but preserves sample\-level detail needed for auditing and re\-analysis\. Theevaluation\_idfield links each row back to the aggregate JSON, whileevaluation\_result\_idprovides the preferred deterministic link to one specific aggregate metric result\. The conditional use ofoutputversusmessagesmakes the schema applicable to standard single\-turn tasks as well as conversational and tool\-using evaluations\.
## Appendix CConverter Implementation Details
This section describes how evaluation logs from HELM, lm\-eval\-harness, and Inspect AI are mapped into the unifiedEEEschema\. All three converters produce an aggregateEvaluationLog\. When instance\-level data is available, they also emit a JSONL file referenced bydetailed\_evaluation\_results\. In addition to these core framework converters, the repository includes community\-contributed converters for public leaderboards and benchmark\-specific result sources, summarized in a final subsection\.
### C\.1Inspect AI Converter
##### Input format\.
The Inspect converter accepts evaluation logs with extension\.evalor\.json\. Both formats represent the same logical object and are read through the Inspect log API111[https://inspect\.aisi\.org\.uk/reference/inspectAI\.log\.html](https://inspect.aisi.org.uk/reference/inspectAI.log.html)\. The top\-level object is structurally rich:evalstores task, dataset, model path, package versions, generation settings, and task arguments;planstores solver steps and plan configuration;resultsstores scorer\-level aggregate metrics;statsstores run timestamps and summary counters;samplesstores per\-sample traces; andreductionsstores scorer\-level reduced sample values used for score resolution\. This structure motivates field\-wise extraction rather than direct key renaming\.
Table 8:Top\-level fields of the aggregate evaluation record\.
##### Aggregate mapping toEEE
CoreEEEfields are assembled from several Inspect structures\. The converter derivesevaluation\_timestampfromstats\.started\_atwith fallback toeval\.created, deriveseval\_libraryfromeval\.packages, and derivesevaluation\_resultsfromresults\.scores\. It initializesmodel\_infofromeval\.modeland, when available, refines that identifier with sample\-level output model metadata\.
Inspect model paths vary across providers, so the converter applies provider\-specific normalization to produce a canonicalmodel\_info\.idand to inferinference\_platformand, when possible,inference\_engine\. For example,openai/azure/gpt\-4o\-minimay be refined toopenai/gpt\-4o\-mini\-2024\-07\-18, whileollama/qwen2\.5:0\.5bis normalized toollama/qwen2\.5\-0\.5b\. The converter derives per\-result source data mainly fromeval\.dataset\. In particular,eval\.dataset\.locationis treated as a Hugging Face repository only when it matches the canonicalnamespace/nameform; otherwise the converter preserves the raw Inspect dataset fields inadditional\_details\.
Each scorer metric is converted into anEvaluationResult, except standalonestderr, which is treated as uncertainty metadata rather than a separate metric\. Metric values populatescore\_details\.score, and uncertainty is populated from scorer\-reportedstderrtogether with optionalstdorstddevand sample counts\. When scorer parameters exposegrader\_modelandgrader\_template, the converter also populatesmetric\_config\.llm\_scoringto preserve judge context\. Finally,generation\_configcombines standard generation parameters with Inspect\-specific execution context, including prompt template, available tools, serialized plan, limits, sandbox configuration, retry settings, and the reasoning flag\.
##### Instance\-level mapping
When sample logs are present, each sample is converted into oneInstanceLevelEvaluationLog\. Interaction type is inferred from the message structure: tool\-role messages yieldagentic, multiple assistant turns without tools yieldmulti\_turn, and the remaining cases are treated assingle\_turn\. Input text is serialized from user messages, with references and choices preserved from sample targets and choices\. For single\-turn cases, output text and reasoning traces are stored inoutput; for multi\-turn and agentic cases, the normalized message sequence is stored inmessages, including tool calls\.
Score resolution prioritizesreductionsmatched by sample and scorer, then sample\-level reduced values, and finally direct sample scores\. If no score is available, the converter falls back to reference matching, and correctness follows the resolved score semantics\. The instance\-level record also preserves token usage, latency and generation time, sample hash, stop reason, epoch, answer\-attribution metadata, and full error traces when available\.
### C\.2HELM Converter
##### Input format\.
The HELM converter operates on a HELM run directory\. It requiresrun\_spec\.json,scenario\_state\.json,scenario\.json, andper\_instance\_stats\.json, while the optionalstats\.jsonfile supplies aggregate metric values when present\. Each file contributes a distinct part of the run state\.run\_spec\.jsonprovides adapter, metric, and scenario specification metadata;scenario\_state\.jsonprovides request\-level records, including prompts, references, outputs, and request timestamps;scenario\.jsonprovides scenario naming metadata used during dataset identification; andper\_instance\_stats\.jsonprovides per\-sample metric and token statistics for instance\-level conversion\. When available,stats\.jsonadds aggregate statistics such as mean, sum, count, and standard deviation\. Because HELM distributes these signals across separate artifacts, the converter must coordinate extraction across multiple files rather than perform direct key renaming\.
##### Aggregate mapping toEEE
CoreEEEfields are assembled from multiple HELM artifacts\. The converter derivesmodel\_infofromrun\_spec\.json, preferably through deployment registry entries referenced byadapter\_spec\.model\_deployment; if deployment lookup is unavailable, it falls back to adapter\-level model fields and best\-effort platform inference\. It derivesevaluation\_resultsprimarily fromstats\.json, with candidate metric names anchored inrun\_spec\.metric\_specs\.
The converter derives per\-result source data and timestamp fields fromscenario\_state\.jsonandscenario\.json\. Dataset name comes fromscenario\.namewhen available, with fallback parsing fromrun\_spec\.name; sample counts and sample identifiers come from request\-state instance ids; and scenario class names and arguments are preserved inadditional\_details\. For each matched aggregate statistic, the score is taken frommeanwith fallback tosum/count, while uncertainty recordsstddevand sample\-count metadata\. Ifstats\.jsonis absent, aggregate metric output may be empty\.
generation\_configis derived from request\-level and adapter\-level settings, includingtemperature,top\_p,top\_k,max\_tokens, stop sequences, penalties, completion count, and a reasoning flag inferred from HELM thinking traces\. The converter computesevaluation\_timestampfrom the earliest available request datetime, with fallback to retrieval time, and formsevaluation\_idfrom dataset, model, and timestamp after path\-safe normalization of model identifier separators\.
##### Instance\-level mapping
When request\-state records are available, the converter emits one consolidated JSONL file and references it throughdetailed\_evaluation\_results\. Each row combinesrequest\_stateswithper\_instance\_stats: prompt text, references, and choices come from request\-state records and output mapping metadata, while completions and optional reasoning traces come from model outputs\. Score resolution prefers per\-instanceexact\_matchstatistics and otherwise falls back to reference matching between generated completions and tagged correct references\. The converter also records token usage, generation latency, stable sample identifiers, sample hash, answer\-attribution metadata, and single\-turn interaction typing\.
### C\.3lm\-eval\-harness Converter
##### Input format\.
The lm\-eval converter consumes result files namedresults\_\*\.json\. Optional instance\-level conversion uses files namedsamples\_<task\>\_\*\.jsonland is enabled only when sample logging is available and conversion is executed with\-\-include\_samples\. The aggregate record combines several top\-level maps:configprovides global run and model metadata,configsprovides task\-level dataset and generation settings,resultsprovides task\-level metric values,higher\_is\_betterprovides metric directionality,n\-samplesprovides sample\-count metadata for uncertainty reporting,datemaps toevaluation\_timestamp, andlm\_eval\_versionmaps toeval\_library\.version\. This layout motivates task\-wise field extraction rather than direct key renaming\.
Table 9:Top\-level fields of a singleevaluation\_resultentry withinevaluation\_results\.
##### Aggregate mapping toEEE
Theresultsobject may contain both metric\-bearing tasks and structural placeholders, so the converter first excludes placeholders and tasks without numeric metrics and then emits oneEvaluationLogper retained task\. For each retained task, it derivesevaluation\_timestampfromdate, deriveseval\_library\.versionfromlm\_eval\_version, derivesmodel\_infofromconfig, and derives both per\-result source data andgeneration\_configfrom task\-specific entries inconfigs\.
model\_infois constructed fromconfig\.modeland parsedconfig\.model\_args\. Becausemodel\_argsis often a comma\-delimited string, the converter parses it heuristically and prioritizespretrainedwhen present; inference platform and inference engine are then inferred from model\-type mappings, with an optional command\-line override for engine name and version\. Per\-result source data is derived from task configuration fields such asdataset\_pathand split metadata\. Paths that match Hugging Face repository form are mapped tosource\_type=hf\_dataset, while other paths are mapped tosource\_type=other\.
Metric keys typically follow themetric,filterconvention, such asexact\_match,none, and uncertainty keys follow themetric\_stderr,filterconvention\. The converter decomposes these keys, creates oneEvaluationResultper numeric metric, and maps standard error toscore\_details\.uncertainty\.standard\_error\. Metric directionality is derived fromhigher\_is\_betterand inverted intolower\_is\_better; score bounds are inferred from a known\-metrics table when available and left unset otherwise\. Finally,generation\_configis derived fromgeneration\_kwargs, includingtemperature,top\_p,top\_k, andmax\_gen\_toks, while the remaining generation attributes andnum\_fewshotare preserved inadditional\_details\.
##### Instance\-level mapping
When sample JSONL files are available, each sample row is converted into oneInstanceLevelEvaluationLog\. Prompt and references are extracted fromargumentsandtarget, and for multiple\-choice tasks the answer options are reconstructed fromgen\_args\_\*\. For generation tasks, the converter uses the first response text; for multiple\-choice tasks, it selects the option with the highest log probability fromfiltered\_respsorresps\. Scores and correctness are derived from per\-sample metric fields, with fallback toscore=0\.0andis\_correct=falsewhen no numeric metric value is available\. The converter also records a sample hash, lm\-eval hashesdoc\_hash,prompt\_hash, andtarget\_hash, the applied filter name, and the serialized per\-sample metric payload\.
Table 10:Top\-level fields of the instance\-level evaluation record\.
### C\.4Community\-Contributed Converters
##### AlpacaEval\.
The AlpacaEval converter fetches the public AlpacaEval 1\.0 and 2\.0 leaderboard CSVs\. It preserves pairwise preference metrics against the published baselines, including win rate, length\-controlled win rate, discrete win rate, and average response length for each model\.
##### ARC\-AGI\.
Thearc\_agiadapter reads the ARC Prize evaluations leaderboard JSON fromarcprize\.org\. It records the published ARC score together with cost\-per\-task and total\-cost fields while normalizing the often informal model aliases used on the leaderboard\.
##### Artificial Analysis\.
Theartificial\_analysisadapter ingests the Artificial Analysis LLM API, which combines benchmark scores with pricing and latency measurements for frontier models\. It carries over composite indices such as the Artificial Analysis intelligence, coding, and math indices, benchmark scores such as MMLU\-Pro, GPQA, HLE, LiveCodeBench, SciCode, AIME, and tau2, and token\-pricing and latency summaries\.
##### BFCL\.
Thebfcladapter reads the BFCL leaderboard CSV published by Berkeley Gorilla\. It preserves the leaderboard’s overall rank, overall accuracy, latency and cost fields, and the benchmark’s finer\-grained tool\-calling slices, including non\-live, live, multi\-turn, and web\-search accuracies\.
##### CocoaBench\.
Thecocoabenchadapter reads CocoaBench’s published per\-system CSV of agent performance, time, and cost\. It preserves overall benchmark accuracy together with average runtime per task, average cost per task, and total evaluation cost for each released agent\-model system\.
##### Exgentic\.
Theexgenticadapter consumes Exgentic open\-agent leaderboard aggregates, either from localresults\.jsonfiles or the Hugging Face dataset\. These runs span agent benchmarks such as AppWorld, SWE\-bench, BrowseComp\+, and Tau2, and the adapter preserves benchmark score, session counts, and run\-cost summaries for each agent\-model submission\.
##### Global MMLU Lite\.
Theglobal\-mmlu\-liteadapter fetches the Global MMLU Lite leaderboard from the Kaggle Benchmarks API\. It preserves the reported Global MMLU Lite score for each model together with any confidence\-interval or standard\-deviation information exposed by the leaderboard payload\.
##### Open LLM Leaderboard v2\.
Thehfopenllm\_v2adapter ingests the Hugging Face Open LLM Leaderboard v2 API\. It preserves the benchmark panel used by that leaderboard, including IFEval, BBH, MATH Level 5, GPQA, MUSR, and MMLU\-Pro, together with basic model metadata such as architecture, precision, and parameter count when available\.
##### LLM Stats\.
Thellm\_statsadapter consumes the LLM Stats API’s combined model, benchmark, and score payloads\. It is designed for a broad benchmark catalog rather than a single leaderboard, so it preserves benchmark\-specific provenance URLs, relationship metadata, pricing and context\-window model details, and the score entries attached to each model\.
##### Multi\-SWE\-Bench\.
Themulti\_swe\_benchadapter clones the Multi\-SWE\-Bench experiments repository and reads verified submissions under each language\-specific leaderboard\. It preserves resolved\-instance rates and submission metadata for C, C\+\+, Go, Java, JavaScript, Rust, and TypeScript tracks\.
##### RewardBench\.
Therewardbenchadapter fetches RewardBench v1 leaderboard CSV data and RewardBench v2 JSON results from Hugging Face\. It preserves the v1 overall, chat, chat\-hard, safety, reasoning, and prior\-set scores, as well as the v2 factuality, precise instruction following, math, safety, focus, and tie\-handling metrics\.
##### SciArena\.
Thesciarenaadapter reads the SciArena leaderboard API maintained by Allen AI\. It preserves the published rank, arena rating, and cost\-per\-100\-calls metadata for each model, while keeping the source model aliases close to the leaderboard’s own naming\.
##### SWE\-bench Verified\.
Theswe\_bench\_verifiedadapter reads verified submission directories from the public SWE\-bench experiments repository\. It preserves the standard verified leaderboard signal, namely the fraction of the 500 benchmark instances resolved by each submission, along with submission metadata and agent tooling context\.
##### SWE\-PolyBench\.
Theswe\_polybenchadapter reads submission artifacts for SWE\-PolyBench and SWE\-PolyBench Verified from the public experiments repository\. It preserves resolved\-instance rates separately for each dataset variant and programming language, so one submission may yield distinct records for different language tracks\.
##### Terminal\-Bench 2\.0\.
Theterminal\_bench\_2adapter captures the published Terminal\-Bench 2\.0 leaderboard for agentic coding systems\. It preserves the leaderboard’s accuracy and standard\-error values for each agent\-model pair on the 87\-task benchmark, together with the agent and model organization metadata shown on the leaderboard\.
## Appendix DConservative Estimation of Costs
We explain here our assumptions on how we estimate the cost for running evaluations to reproduce all of our data\. While we note that this is a vast underapproximation of the actual cost of reproduction all this work, we still see it as a sign for the importance of collecting such data\.
### D\.1Dataset and Evaluation Scale
The dataset comprises approximately 230,000 model–benchmark evaluation pairs, where each evaluation represents running a model on a single benchmark\.
Each benchmark is assumed to contain 1,000 examples, with roughly 100 input tokens and 20 output tokens per example\.
Under these assumptions, each evaluation uses about 100,000 input tokens and 20,000 output tokens, for a total of 120,000 tokens before additional overhead\.
### D\.2LLM\-as\-Judge Overhead
Modern evaluation pipelines frequently incorporate an additional language model to automatically grade or compare outputs, commonly referred to as an “LLM\-as\-judge\.” Based on production observations, this introduces an additional 60% token overhead relative to the base evaluation\.
This overhead is modeled as a multiplicative factor applied uniformly to both input and output tokens, such that the adjusted token count is given by1\.61\.6times the base tokens\. Consequently, each evaluation involves approximately 160,000 input tokens and 32,000 output tokens after accounting for this overhead\.
### D\.3Total Token Volume
Aggregating across all 230,000 evaluations, the total token volume is obtained by multiplying the per\-evaluation total of 192,000 tokens by the number of evaluations\. This results in approximately4\.416×10104\.416\\times 10^\{10\}tokens, corresponding to roughly 44 billion tokens processed in total\.
### D\.4Cost Model
The total inference cost is computed as the sum of input and output token costs\. Specifically, the cost is given by the product of input tokens and their per\-million\-token price, plus the product of output tokens and their corresponding price\. The pricing parameters are denoted byCinC\_\{\\text\{in\}\}for input tokens andCoutC\_\{\\text\{out\}\}for output tokens\.
We consider three levels of approximation corresponding to different pricing regimes\.
### D\.5Low\-Cost Estimate \(No Judge\)
As a lower bound, we consider a highly cost\-efficient model with pricing of $0\.10 per million input tokens and $0\.40 per million output tokens\. This estimate excludes any judge overhead and therefore uses the base token counts\.
Under these assumptions, the input cost per evaluation is computed as100,000×0\.10106100\{,\}000\\times\\frac\{0\.10\}\{10^\{6\}\}, which equals $0\.01\. The output cost per evaluation is20,000×0\.4010620\{,\}000\\times\\frac\{0\.40\}\{10^\{6\}\}, which equals $0\.008\. The total cost per evaluation is therefore $0\.018\.
Across all 230,000 evaluations, the total cost is approximately230,000×0\.018230\{,\}000\\times 0\.018, which yields about $4,140\. This corresponds to a total low\-cost estimate of approximately $4\.1K\.
### D\.6Mid\-Cost Estimate \(Sonnet with Judge\)
For a more realistic estimate, we consider a mid\-tier model with pricing of $3 per million input tokens and $15 per million output tokens\. This estimate incorporates the 60% judge overhead\.
With adjusted token counts, the input cost per evaluation is160,000×3106160\{,\}000\\times\\frac\{3\}\{10^\{6\}\}, which equals $0\.48, and the output cost is32,000×1510632\{,\}000\\times\\frac\{15\}\{10^\{6\}\}, which also equals $0\.48\. The total cost per evaluation is therefore $0\.96\.
Across all evaluations, the total cost is approximately230,000×0\.96230\{,\}000\\times 0\.96, which yields about $220,800\. This corresponds to a total mid\-cost estimate of approximately $221K\.
### D\.7High\-Cost Estimate \(Opus with Judge\)
Finally, we consider a higher\-end model with pricing of $5 per million input tokens and $25 per million output tokens, again including the 60% judge overhead\.
Under these conditions, the input cost per evaluation is160,000×5106160\{,\}000\\times\\frac\{5\}\{10^\{6\}\}, which equals $0\.80, and the output cost is32,000×2510632\{,\}000\\times\\frac\{25\}\{10^\{6\}\}, which also equals $0\.80\. The total cost per evaluation is therefore $1\.60\.
Across all evaluations, the total cost is approximately230,000×1\.60230\{,\}000\\times 1\.60, resulting in about $368,000\. This corresponds to a total high\-cost estimate of approximately $368K\.
### D\.8Summary
Under the stated assumptions, the total cost of evaluating 230,000 model–benchmark pairs ranges from a lower bound of approximately $4K, assuming no judge and highly optimized pricing, to approximately $370K when using a high\-end model with judge overhead\. A mid\-tier estimate of roughly $220K is also given\. While the lower bound is likely unrealistic, the others might be closer to actual pricing as most models are not of the smaller kinds and usually top or middle models are evaluated, with or without a judge\.
## Appendix EGovernance Card
Every Eval Everis a community project\. This appendix documents the governance mechanisms currently in place\. We follow the spirit of the Croissant governance process\[[3](https://arxiv.org/html/2606.14516#bib.bib14)\]and adapt it to the specifics ofEvery Eval Ever\. The governance model remains dynamic, and we expect it to evolve as the project progresses\.
### E\.1Decision\-Making and Roles
The project recognizes three key roles\.*Core maintainers*are responsible for repository upkeep, schema releases, converter maintenance, reviewing contributions, and final decisions on contested proposals\.*Contributors*submit data, converters, schema proposals, tooling, or documentation through pull requests and issues or discussions through GitHub or, on occasion, Slack\.*Community reviewers*are volunteer experts who participate in schema discussions and review proposals in their area of expertise\. Roles are not exclusive: maintainers also contribute and become so through community acceptance and after several contributions\.
Routine decisions \(record additions that pass validation, bug fixes, documentation updates, and additive non\-breaking schema fields\) are made by maintainers on a rolling basis\. Substantive decisions \(breaking schema changes, new interaction types, deprecations, deduplication policy changes\) follow the proposal process below\.
### E\.2Schema Change Proposal Process
Substantive schema changes follow a lightweight three\-stage process modeled on the iterative methodology used to produce thevx\.x\.xschema \(Section[3](https://arxiv.org/html/2606.14516#S3)\)\.
1. 1\.Proposal\.A contributor opens an issue in the repository describing the proposed change, or is raised during discussion between maintainers\. The problem it solves and the implications are discussed, and alternatives are weighed\.
2. 2\.Community review\.The proposal is open for discussion until disagreements are resolved\. If necessary, maintainers solicit feedback from relevant community experts based on the area of the proposal\.
3. 3\.Resolution\.Maintainers summarize the discussion and propose a resolution: accept, accept with modification, defer, or decline\. Decisions are made by consensus among maintainers; when consensus cannot be reached, a documented majority decision is recorded, with dissenting positions preserved in the schema change\-log \(Section[3](https://arxiv.org/html/2606.14516#S3)\)\.
### E\.3Conflicting Submissions and Duplicate Records
Because the schema assigns a unique UUID to each evaluation run and defers deduplication to the analysis layer \(Section[3\.1](https://arxiv.org/html/2606.14516#S3.SS1)\), conflicting or near\-duplicate records are expected and, by themselves, are not a governance problem\. The validator flags likely duplicates \(same model, same benchmark, same metric, same evaluator\) at submission time but does not reject them\. When users encounter conflicting records that cannot be reconciled from metadata alone, they are encouraged to open an issue;maintainersmay then request additional metadata from thecontributors, annotate records with a disputed flag in*additional\_details*, or, in cases of clear error, mark records as superseded \(see Section[E\.4](https://arxiv.org/html/2606.14516#A5.SS4)below\)\. We do not arbitrate which of two methodologically valid evaluation runs is "correct\."
### E\.4Corrections, Retractions, and Supersession
While there have not yet been any disputes over data contributions, we present here a proposal for how to address them when they do arise\. We will continue to adapt this process in response to emerging real\-world needs\. Records are immutable once accepted: modifying a record in place would invalidate downstream analyses that reference it\. Three mechanisms handle errors and updates:
1. 1\.Correction\.For minor fixes \(e\.g\., typos in identifiers\), a new record is added that supersedes the original\. The original is retained and annotated with a superseded\_by field pointing to the corrected UUID\. Similarly, a preceded\_by field points to the original UIUD\.
2. 2\.Retraction\.For records that were submitted in error or are based on faulty source data, the record is annotated with a retracted flag and a brief reason\. The record itself is not deleted, so prior analyses remain reproducible\.
3. 3\.Schema migration\.When schema versions advance, records remain valid under the version they were submitted with\. Migration utilities are provided where possible, but historical records are not overwritten\.
### E\.5Code of Conduct and Acknowledgment
The project follows a standard contributor code of conduct\.Contributorsare acknowledged in three ways: through git commit history, through the contributor list maintained in the repository, and, for substantive contributions to a release, through co\-authorship on the associated release paper\. The first such instance is the present submission, organized as a shared task\[[11](https://arxiv.org/html/2606.14516#bib.bib24)\]; subsequent releases will follow the same pattern with criteria documented in the contributor guide\.
#### E\.5\.1Copy of the Contributor Guide
New data can be contributed to the Hugging Face Datastore using the following process:
Leaderboard/evaluation data is split\-up into files by individual model, and data for each model is stored usingeval\.schema\.json\. The repository is structured into folders asdata/benchmark\_name/developer\_name/model\_name/\.
TL;DR How to successfully submit
1. 1\.Data must conform toeval\.schema\.json\(current version:0\.2\.2\)
2. 2\.The validation pipeline will automatically verify the data submitted in the pull request, but can also be manually triggered by typing/eee validate changedin a comment on the HF PR\.
3. 3\.A core maintainer will review and merge your submission
PR Naming Convention
Use these prefixes in your pull request titles:
- •\[Submission\]\- New evaluation data
- •\[Issue \#N\]\- Fix for a specific GitHub issue
- •\[Feature\]\- New functionality not tied to an issue
- •\[Docs\]\- Documentation changes
UUID Naming Convention
Each JSON file is named with aUUID \(Universally Unique Identifier\)in the formatuuid\.json\. The UUID is automatically generated \(using standard UUID v4\) when creating a new evaluation result file\. This ensures that:
- •Multiple evaluationsof the same model can exist without conflicts \(each gets a unique UUID\)
- •Different timestampsare stored as separate files with different UUIDs \(not as separate folders\)
- •A model may have multiple result files, with each file representing different iterations or runs of the leaderboard/evaluation
- •UUID’s can be generated using Python’suuid\.uuid4\(\)function\.
Example: The modelopenai/gpt\-4o\-2024\-11\-20might have multiple files like:
- •e70acf51\-30ef\-4c20\-b7cc\-51704d114d70\.json\(evaluation run \#1\)
- •a1b2c3d4\-5678\-90ab\-cdef\-1234567890ab\.json\(evaluation run \#2\)
Note: Each file can contain multiple individual results related to one model\.
How to add new eval:
1. 1\.Add a new folder underdata/on the Hugging Face datastore with a codename for your eval\.
2. 2\.For each model, use the Hugging Face \(developer\_name/model\_name\) naming convention to create a 2\-tier folder structure\.
3. 3\.Add a JSON file with results for each model and name ituuid\.json\.
4. OptionalInclude autils/folder in your benchmark name folder with any scripts used to generate the data \(e\.g\.,utils/global\-mmlu\-lite/adapter\.py\)\.
5. SubmitTwo ways to submit your evaluation data: - •Option A: Drag & drop via Hugging Face— Go to the datastore→\\rightarrowclick “Files and versions”→\\rightarrow“Contribute”→\\rightarrow“Upload files”→\\rightarrowdrag and drop your data→\\rightarrowselect “Open as a pull request to the main branch”\. - •Option B: Clone & PR— Clone the repo, add your data underdata, and open a pull request
Schema Instructions
1. 1\.model\_info: Use Hugging Face formatting \(developer\_name/model\_name\)\. If a model does not come from Hugging Face, use the exact API reference\. Check examples indata/livecodebenchpro\. Notably, some do have adate included in the model name, but othersdo not\. For example: - •OpenAI:gpt\-4o\-2024\-11\-20,gpt\-5\-2025\-08\-07,o3\-2025\-04\-16 - •Anthropic:claude\-3\-7\-sonnet\-20250219,claude\-3\-sonnet\-20240229 - •Google:gemini\-2\.5\-pro,gemini\-2\.5\-flash - •xAI \(Grok\):grok\-2\-2024\-08\-13,grok\-3\-2025\-01\-15
2. 2\.evaluation\_id: Usebenchmark\_name/model\_id/retrieved\_timestampformat \(e\.g\.livecodebenchpro/qwen3\-235b\-a22b\-thinking\-2507/1760492095\.8105888\)\.
3. 3\.inference\_platformvsinference\_engine: Where possible specify where the evaluation was run using one of these two fields\. - •inference\_platform: Use this field when the evaluation was run through a remote API \(e\.g\.,openai,huggingface,openrouter,anthropic,xai\)\. - •inference\_engine: Use this field when the evaluation was run locally\. This is now an object withnameandversion\(e\.g\."name": "vllm", "version": "0\.6\.0"\)\.
4. 4\.Thesource\_typeonsource\_metadatahas two options:documentationandevaluation\_run\. Usedocumentationwhen results are scraped from a leaderboard or paper\. Useevaluation\_runwhen the evaluation was run locally \(e\.g\. via an eval converter\)\.
5. 5\.source\_datais specified per evaluation result \(insideevaluation\_results\), with three variants: - •source\_type: "url"\- link to a web source \(e\.g\. leaderboard API\) - •source\_type: "hf\_dataset"— reference to a Hugging Face dataset \(e\.g\."hf\_repo": "google/IFEval"\) - •source\_type: "other"— for private or proprietary datasets
6. 6\.The schema is designed to accommodate both numeric and level\-based \(e\.g\. Low, Medium, High\) metrics\. For level\-based metrics, the actual ’value’ should be converted to an integer \(e\.g\. Low = 1, Medium = 2, High = 3\), and thelevel\_namesproperty should be used to specify the mapping of levels to integers\.
7. 7\.Timestamps: The schema has three timestamp fields — use them as follows: - •retrieved\_timestamp\(required\) — when this record was created, in Unix epoch format \(e\.g\.1760492095\.8105888\) - •evaluation\_timestamp\(top\-level, optional\) — when the evaluation was run - •evaluation\_results\[\]\.evaluation\_timestamp\(per\-result, optional\) — when a specific evaluation result was produced, if different results were run at different times
8. 8\.Additional details can be provided in several places in the schema\. They are not required, but can be useful for detailed analysis\. - •model\_info\.additional\_details: Use this field to provide any additional information about the model itself \(e\.g\. number of parameters\) - •evaluation\_results\.generation\_config\.generation\_args: Specify additional arguments used to generate outputs from the model - •evaluation\_results\.generation\_config\.additional\_details: Use this field to provide any additional information about the evaluation process that is not captured elsewhere
Instance\-Level Data
For evaluations that include per\-sample results, the individual results should be stored in a companionuuid\_samples\.jsonlfile in the same folder \(one JSONL per JSON, sharing the same UUID\)\. The aggregate JSON file refers to its JSONL via thedetailed\_evaluation\_resultsfield\. The instance\-level schema \(instance\_level\_eval\.schema\.json\) supports three interaction types:
- •single\_turn: Standard QA, MCQ, classification — usesoutputobject
- •multi\_turn: Conversational evaluations with multiple exchanges — usesmessagesarray
- •agentic: Tool\-using evaluations with function calls and sandbox execution — usesmessagesarray withtool\_calls
Each instance captures:input\(raw question \+ reference answer\),answer\_attribution\(how the answer was extracted\),evaluation\(score, is\_correct\), and optionaltoken\_usageandperformancemetrics\. Instance\-level JSONL files are produced automatically by the eval converters\.
### E\.6Worked Examples
#### E\.6\.1Example 1: Conflicting MMLU Records
To make the governance mechanisms concrete, consider the LLaMA 65B/MMLU example from Section[1](https://arxiv.org/html/2606.14516#S1), where the model scores 63\.7 under HELM and 48\.8 under lm\-eval\-harness\[[29](https://arxiv.org/html/2606.14516#bib.bib40)\]\. UnderEEE, both results are valid records: each receives its own UUID, each carries eval\_library metadata identifying the harness, and each preserves the generation configuration and prompt template available at submission time\. The validator does not flag these as duplicates because the eval\_library field differs\. A downstream user comparing the two records sees the discrepancy in the metadata directly and can decide whether to treat them as comparable, rather than discovering the difference through a blog post months later\. If a third contributor later submits a third MMLU record for LLaMA 65B without specifying the harness, the validator emits a warning, the record is accepted with the missing field recorded as absent \(Section[3\.2](https://arxiv.org/html/2606.14516#S3.SS2.SSS0.Px3)\), and any downstream analysis that requires harness\-level disambiguation can filter it out\. No human governance intervention is needed for this case; the schema and validator handle it\. Governance intervention is reserved for cases where metadata is contested rather than merely missing\.
#### E\.6\.2Example 2: Disputed Agentic Record
A contributor submits records scraped from a public agentic\-evaluation leaderboard\. The records pass validation, but the agent’s developers later contest them, claiming the leaderboard ran a deprecated harness version and that the current version yields lower scores; another contributor argues the original entry should remain as the authoritative public record at the time of reporting\. The schema cannot resolve this because both parties agree on the metadata\. Maintainers handle it by retaining the original record \(records are immutable; Section[E\.4](https://arxiv.org/html/2606.14516#A5.SS4)\), annotating it with a disputed flag under*additional\_details*, pointing to the issue thread, inviting the developers to submit a new record under the current harness, and documenting the resolution in the changelog\.EEEdoes not arbitrate which run is “correct”; it ensures both runs and their dispute are visible\.
Listing 1:Installing and running a converter\.pipinstall‘every\-eval\-ever\[all\]’
every\_eval\_everconverthelm\-\-log\_pathpath/to/helm/logs
every\_eval\_everconvertinspect\-\-log\_pathpath/to/run\.eval
every\_eval\_everconvertlm\_eval\-\-log\_pathpath/to/results\.json\\
\-\-include\_samples’
Listing 2:CLI validation examples\.uvrunpython\-mevery\_eval\_evervalidatepath/to/uuid\.json
uvrunpython\-mevery\_eval\_evervalidatepath/to/uuid\_samples\.jsonl
uvrunpython\-mevery\_eval\_evervalidatedata/mmlu/
## Appendix FCase Studies: Reproducibility and Implementation Details
### F\.1Case 1
We report the aggregate records used for the agentic cost–accuracy analysis in Section[7\.1](https://arxiv.org/html/2606.14516#S7.SS1)\. CocoaBench\[[37](https://arxiv.org/html/2606.14516#bib.bib59)\]is used to illustrate how runtime and cost can change the interpretation of accuracy across scaffold–backbone combinations\. CORE\-Bench Hard results from HAL\[[43](https://arxiv.org/html/2606.14516#bib.bib63)\]are used as a representative within\-benchmark slice showing how both scaffold and backbone choices affect the cost–accuracy tradeoff\.
The corresponding records are available in theEvery Eval Everdatastore under the CocoaBench and HAL benchmark directories, illustrated respectively in Tables[11](https://arxiv.org/html/2606.14516#A6.T11)and[12](https://arxiv.org/html/2606.14516#A6.T12)\.
Table 11:Aggregate CocoaBench results represented inEvery Eval Ever\.Table 12:Representative HAL results on CORE\-Bench Hard represented inEvery Eval Ever\.
### F\.2Case 2
This appendix gives the implementation details for Case Study 2 \(Section[7\.2](https://arxiv.org/html/2606.14516#S7.SS2)\)\. Records for the perplexity comparison in Table[3](https://arxiv.org/html/2606.14516#S7.T3)were obtained from two sources: lm\-eval\-harness logs ingested via the automated lm\_eval converter \(Section[4](https://arxiv.org/html/2606.14516#S4.SS0.SSS0.Px1), App\.[C\.3](https://arxiv.org/html/2606.14516#A3.SS3)\), and GPTQ\-style evaluation scripts contributed as manual records\. The lm\_eval converter preserves metric keys verbatim from harness output —word\_perplexityandbyte\_perplexityare stored as distinctevaluation\_namevalues in theMetricConfigblock \(Section[3\.2](https://arxiv.org/html/2606.14516#S3.SS2.SSS0.Px4)\), ensuring records with different normalization conventions are never silently aggregated \(Section[3\.1](https://arxiv.org/html/2606.14516#S3.SS1)\)\. GPTQ\-style records were contributed withmetric\_nameset to reflect token\-level normalization\. The corresponding records are available in theEvery Eval Everdatastore under the WikiText benchmark directory\.
### F\.3Case 3
### F\.4Case 4
In Case Study 4, we conduct an Item Response Theory \(IRT\) meta\-analysis of instance\-level evaluation data collected and stored inEvery Eval Ever\. IRT models estimate latent parameters for datasetitems\(i\.e\., instances/examples\) andsubjects\(i\.e\., AI models\), and have been used in prior evaluation practices\[[73](https://arxiv.org/html/2606.14516#bib.bib32),[72](https://arxiv.org/html/2606.14516#bib.bib28)\]leaderboards\[[45](https://arxiv.org/html/2606.14516#bib.bib30),[82](https://arxiv.org/html/2606.14516#bib.bib29)\], meta\-evaluations\[[79](https://arxiv.org/html/2606.14516#bib.bib22)\]and applications such as curriculum learning\[[79](https://arxiv.org/html/2606.14516#bib.bib22),[62](https://arxiv.org/html/2606.14516#bib.bib111),[74](https://arxiv.org/html/2606.14516#bib.bib6)\], often those uses were limited by the amount of available data\[[35](https://arxiv.org/html/2606.14516#bib.bib52),[36](https://arxiv.org/html/2606.14516#bib.bib31)\]\. The one\-parameter logistic \(1PL\) IRT model estimates item difficulty and subject ability, while more complex IRT models include more item\-level parameters such as discriminability and feasibility\[[79](https://arxiv.org/html/2606.14516#bib.bib22)\]\. IRT models the probability of subjectjjlabeling itemiicorrectly \(zij=1z\_\{ij\}=1\) as a function of subjectjj’s latent ability and itemii’s latent difficulty\. Parameters are learned via optimization from a dataset of graded \(i\.e\., correct or incorrect\) responses from subjects for a set of items:
p\(zij\\displaystyle p\(z\_\{ij\}=1\|θj,bi\)=11\+e−\(θj−bi\)\\displaystyle=1\|\\theta\_\{j\},b\_\{i\}\)=\\frac\{1\}\{1\+e^\{\-\(\\theta\_\{j\}\-b\_\{i\}\)\}\}\(1\)logℒ\\displaystyle\\log\\mathcal\{L\}=∑j=1J∑i=1Ilogp\(Zij=zij\|θj,bi\)\\displaystyle=\\sum\_\{j=1\}^\{J\}\\sum\_\{i=1\}^\{I\}\\log p\(Z\_\{ij\}=z\_\{ij\}\|\\theta\_\{j\},b\_\{i\}\)\(2\)
Instance\-level data collection is an expensive prerequisite and thus often a bottleneck for IRT research in NLP\. For the case study, we selected three datasets currently available inEvery Eval Everwith instance\-level evaluations\. GPQA Diamond\[[77](https://arxiv.org/html/2606.14516#bib.bib107)\]includes responses for 198 items from 69 subjects; Wordle Arena\[[68](https://arxiv.org/html/2606.14516#bib.bib108)\]includes responses for 63 items from 46 subjects; JudgeBench\[[88](https://arxiv.org/html/2606.14516#bib.bib109)\]includes responses for 350 items from 55 subjects\. We extract theis\_correctvalue for each instance\-level record in a dataset to construct the response matrixZZ\. For example,ZJudgeBenchZ^\{\\text\{JudgeBench\}\}has 55 rows \(subjects\) and 350 columns \(items\)\.
We fit a 1PL model for each dataset using thepy\-irtpackage version 0\.7\.1\[[50](https://arxiv.org/html/2606.14516#bib.bib77)\]\.py\-irtimplements IRT model fitting via variational inference and can scale to large evaluation datasets via GPU\-scaled training\. Specifically, the joint posterior distributionp\(Θ,B\|Z\)p\(\\Theta,B\|Z\)is approximated by a variational distributionq\(Θ,B\)q\(\\Theta,B\), and latent variables are learned by minimizing the KL\-Divergence betweenq\(Θ,B\)q\(\\Theta,B\)andp\(Θ,B\|Z\)p\(\\Theta,B\|Z\)\[[50](https://arxiv.org/html/2606.14516#bib.bib77)\]\.Similar Articles
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting
This paper introduces EvalCards, an operational framework that standardizes AI evaluation reporting by composing benchmark metadata, evaluation run data, and model metadata into a unified record with interpretive signals for reproducibility, completeness, provenance, risk, and score comparability. The authors deploy a monitoring tool across thousands of models and benchmarks, revealing systematic gaps in current reporting practices.
An Empirical Study of Automating Agent Evaluation
This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.
@OpenAI: Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benc…
OpenAI discusses the importance of evals (evaluations) for measuring and forecasting model progress, especially as benchmarks become saturated or gamed, featuring insights from Tejal Patwardhan and Andrew Mayne.
@pauliusztin_: Every day, 100+ people ask me, "How can I learn AI evals?" I copy-paste these 11 links (every time): 1. AI evals & obse…
A curated list of 11 links shared daily to help people learn AI evaluation techniques, covering evals, observability, LLM-as-judge, and agent evaluation.
@MaxForAI: You'd be hard-pressed to find a better eval resource library. If you're interested in eval, these are what you should read. Thanks to @xdotli for sharing.
Share a curated AI evaluation (evals) resource library, including high-quality blogs, podcasts, papers, and projects, compiled by Xiangyi Li.