Unsteady Metrics and Benchmarking Cultures of AI Model Builders

arXiv cs.AI Papers

Summary

This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.

arXiv:2605.14164v1 Announce Type: new Abstract: The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:21 AM

# Unsteady Metrics and Benchmarking Cultures of AI Model Builders
Source: [https://arxiv.org/html/2605.14164](https://arxiv.org/html/2605.14164)
\(2026\)

###### Abstract\.

The primary way to establish and compare model competencies in foundation and generative AI models has largely moved from peer\-reviewed literature to press releases and company blog posts, where model builders highlight results on a selection of benchmarks\. These public\-facing industry artifacts now largely define the state of the art, both for the research community and the broader public\. Despite their prominence, which benchmarks model builders choose to highlight and what they communicate through this selection is underexamined\. To investigate this, we introduce and open\-source*Benchmarking\-Cultures\-25*, a dataset containing231231benchmarks highlighted across139139model releases in 2025 from1111major AI model builders\. Additionally, we publish an interactive tool to visually explore the relationships of the collected data\. Our analysis points to a fragmented evaluation landscape with limited cross\-model comparability:63\.2%63\.2\\%of highlighted benchmarks are used by a single model builder, and38\.5%38\.5\\%appear in just one model release\. Few benchmarks achieve true widespread use \(e\.g\., GPQA Diamond, LiveCodeBench, and AIME 2025\)\. Moreover, benchmarks are attributed different competencies by different model builders, depending on their narrative\. To disentangle these conflicting presentations, we develop a unified taxonomy that maps diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure\. ”General knowledge application” is the second most popular, yet vaguely defined, category of benchmark in our dataset\. A qualitative analysis of these benchmarks revealed that many deemphasize construct validity; instead, they frame their results as indicators of progress toward Artificial General Intelligence \(AGI\)\. This framing is evident both in benchmarks that explicitly cite AGI literature and in those implicitly shaped by its surrounding narratives\. In addition, authors of ”General knowledge application” benchmarks claim to measure knowledge or reasoning capabilities in general, yet mostly evaluate them across STEM subjects \(especially math\)\. Based on these findings, we argue that highlighted benchmarks in model release artifacts currently function less as standardized measurement tools and more as flexible narrative devices that are used to construct a story of progress that prioritizes market positioning over practical scientific evaluation and comparison\. Data is available at[https://hf\.co/datasets/matybohacek/benchmarking\-cultures\-25](https://hf.co/datasets/matybohacek/benchmarking-cultures-25); the interactive tool is available at[https://bench\-cultures\.net](https://bench-cultures.net/)\.

Benchmarks, Model Evaluation, Release Artifacts, Generative AI

††booktitle:\\acmConference@name\(\\acmConference@shortname\),\\acmConference@date,\\acmConference@venue††journalyear:2026††copyright:cc††conference:The 2026 ACM Conference on Fairness, Accountability, and Transparency; June 25–28, 2026; Montreal, QC, Canada††booktitle:The 2026 ACM Conference on Fairness, Accountability, and Transparency \(FAccT ’26\), June 25–28, 2026, Montreal, QC, Canada††doi:10\.1145/3805689\.3812240††isbn:979\-8\-4007\-2596\-8/2026/06††ccs:General and reference Evaluation††ccs:General and reference Metrics††ccs:Social and professional topics## 1\.Introduction

Recent work has increasingly questioned whether commonly used AI model benchmarks meaningfully reflect real\-world model performance and user experience\(Alzahraniet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib31); Chenget al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib32); Erikssonet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib1); Ethayarajh and Jurafsky,[2020](https://arxiv.org/html/2605.14164#bib.bib28); Bowman and Dahl,[2021](https://arxiv.org/html/2605.14164#bib.bib29); Rajiet al\.,[2021](https://arxiv.org/html/2605.14164#bib.bib30)\)\. Despite these concerns, model builders continue to highlight benchmark results prominently outside academic venues—in system cards, press releases, and company blogs—for each model release\(OpenAI,[2024](https://arxiv.org/html/2605.14164#bib.bib33),[2023a](https://arxiv.org/html/2605.14164#bib.bib34); Anthropic,[2025](https://arxiv.org/html/2605.14164#bib.bib35); OpenAI,[2023b](https://arxiv.org/html/2605.14164#bib.bib36)\)\. The benchmarks highlighted in these public\-facing industry artifacts are unlikely to reflect the full internal evaluation suite used by the respective organizations\(Wanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib52); Bommasaniet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib53); Haimeset al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib54)\); rather, they constitute a curated subset presented to external audiences \(including prospective users and developers utilizing the models through an API\), highlighting unique competencies and competitive positioning\(Joaquinet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib55)\)\.

Although there is a substantial body of scholarship studying the quality and coverage of individual benchmarks\(Beanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib37)\), as well as their usage in the academic literature\(Kochet al\.,[2021](https://arxiv.org/html/2605.14164#bib.bib38); Wanget al\.,[2024a](https://arxiv.org/html/2605.14164#bib.bib39); Liaoet al\.,[2021](https://arxiv.org/html/2605.14164#bib.bib40)\), comparatively little attention has been paid to how benchmarks are selectively used by model builders to communicate model competencies in their public\-facing release artifacts\. Analyzing benchmarks in such contexts is an opportunity to evaluate whether they facilitate meaningful cross\-model comparison and to shed light on the narratives that model builders develop through the selection of benchmarks, as this encodes implicit priorities, organizational norms, and competitive pressures\.

In this paper, we construct and analyze*Benchmarking\-Cultures\-25*, a dataset of231231benchmarks highlighted by1111prominent model builders in139139model releases throughout 2025\. We open\-source this dataset at[https://hf\.co/datasets/matybohacek/benchmarking\-cultures\-25](https://hf.co/datasets/matybohacek/benchmarking-cultures-25), with an interactive web interface at[https://bench\-cultures\.net](https://bench-cultures.net/)\. To construct this dataset, we devise a unified taxonomy based on what benchmark authors claim to measure to bridge the diverging terminology used by AI model builders to quantitatively analyze trends and compare how various types of model providers highlight benchmarks\. Finally, we also conduct a qualitative analysis of the papers introducing the five most popular ”General knowledge application” benchmarks\. We address the following research questions:

- \(RQ1\)What is the makeup of benchmark author affiliations \(e\.g\., industry, academia, government\) and how is it changing over time?
- \(RQ2\)Which tested competencies are the most prominent among the benchmarks, and how consistently are these competencies presented?
- \(RQ3\)What are the most popular benchmarks among AI model builders?
- \(RQ4\)How fast and extensively do benchmarks get adopted, and does this allow for cross\-model comparison?

## 2\.Related Work

In addition to serving as artifacts for measuring AI model performance and progress, benchmarks also function as a technology of governance\. They exert social pressure by defining hierarchies of performance, defining priorities, and ultimately compelling model builders to align with these standardized metrics \(in certain cases resulting in institutional isomorphism\)\(Wanget al\.,[2024a](https://arxiv.org/html/2605.14164#bib.bib39); Rajiet al\.,[2021](https://arxiv.org/html/2605.14164#bib.bib30); DiMaggio and Powell,[1983](https://arxiv.org/html/2605.14164#bib.bib65)\)\. Due to their importance, a standalone field, often called ”the science of benchmarking”, has emerged, studying their mechanics, quality, and impact\(Laskaret al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib57); Changet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib58); Lianget al\.,[2022](https://arxiv.org/html/2605.14164#bib.bib59)\)\.Campolo \([2025](https://arxiv.org/html/2605.14164#bib.bib70)\)situates benchmarking within a broader temporal and cultural logic, arguing that the practice of declaring state\-of\-the\-art results functions not merely as a scientific claim but as a performative act that shapes research agendas and competitive dynamics\. Relatedly,Sculleyet al\.\([2018](https://arxiv.org/html/2605.14164#bib.bib71)\)caution that the emphasis on leaderboard rankings and incremental benchmark gains risks a ”winner’s curse,” where apparent progress on metrics obscures the absence of deeper scientific understanding\. In this section, we review existing scholarship in this and adjacent fields\.

### 2\.1\.Benchmark Saturation and Goodhart’s Law

AI model builders optimize performance on benchmark metrics: in the less severe case, this occurs due to the knowledge of how testing methodologies look like, or in the more severe case through data contamination, i\.e\. by explicitly training on the benchmark contents \(test set\)\(Dominguez\-Olmedoet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib25); Orenet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib26); Niet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib27)\)\. According to Goodhart’s Law\(Goodhart,[1984](https://arxiv.org/html/2605.14164#bib.bib23); Strathern,[1997](https://arxiv.org/html/2605.14164#bib.bib24)\), such metrics cease to be informative\. As a result of this direct optimization, combined with factors such as the static nature of benchmarks111Most popular benchmarks are static: they utilize a fixed, publicly\-known test set that never changes after its original publication\. Hybrid benchmarks, on the other hand, update their test sets over time\(Chenet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib60)\), and hence mitigate AI models’ ability to learn directly on this data\. This comes at the cost increased creation complexity and the need to re\-run evaluations to enable back\-comparability\.and slow publishing cycles222For prominent AI conferences \(e\.g\., NeurIPS, ICML, and ICLR\), the time from submission deadline to publication is usually 5\-6 months\. On top of this, open\-sourcing of data often involves a delay even when the repository is available at the time of publication\(Semmelrocket al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib61)\)\. The popularity of pre\-print servers such as arXiv decreases this delay\(Zhouet al\.,[2025b](https://arxiv.org/html/2605.14164#bib.bib62)\)\. Still, there is a gap between the inception of a benchmark to its adoption, which opens the possibility for data contamination and other undesired practices\., AI models often quickly saturate on new benchmarks, effectively vanishing their discriminatory signal about model performance\(Zhouet al\.,[2025a](https://arxiv.org/html/2605.14164#bib.bib8); Srivastavaet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib7)\)\. Proposed solutions include unifying evaluation standards\(Bommasaniet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib3)\), continuously evaluating the benchmarks themselves\(Carroet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib9)\), or developing fully dynamic benchmarks\(Kielaet al\.,[2021](https://arxiv.org/html/2605.14164#bib.bib4)\)\.

### 2\.2\.Data Contamination and Reliability

Data contamination refers to models having seen the benchmark contents during training, effectively allowing them to memorize the data\(Denget al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib10); Xuet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib5)\)\. To avoid this, strategies utilizing only data from sources published after the AI model’s weights were frozen have been proposed\(Liet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib12)\)\. Overfitting to benchmarks has been demonstrated even in subtle contexts, such as minimal distribution shifts across datasets leading to major performance differences\(Zhanget al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib11)\)\.

### 2\.3\.Coverage and Discrepancy Between Aims and Measured Signal

Another known issue with benchmarks is the lack of consistency among different instantiations of benchmarks claiming to test a particular concept, as well as the divergence between the aims of the benchmarks and the actual signal measured\. One example of such a domain is reasoning, which suffers from varying definitions and scopes\(Fodor,[2025](https://arxiv.org/html/2605.14164#bib.bib15); Xieet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib13)\), leading to surprisingly poor performance on seemingly trivial tasks\(Salidoet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib14)\)\. Some proposed solutions involve examining coverage through the lens of model activations and interpretability\(Bohaceket al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib21)\)\. Another critique related to the lack of construct validity is the tendency in various AI subfields to prioritize a small number of benchmarks that are treated as milestones towards generalizable AI systems\(Rajiet al\.,[2021](https://arxiv.org/html/2605.14164#bib.bib30)\)\.

### 2\.4\.Benchmarking Culture

Erikssonet al\.\([2025](https://arxiv.org/html/2605.14164#bib.bib1)\)examine what they term a “trust crisis” in AI evaluation, pointing to construct validity failures and the lack of standardization\. Others, including[Blili\-Hamelinet al\.](https://arxiv.org/html/2605.14164#bib.bib19)andThais \([2024](https://arxiv.org/html/2605.14164#bib.bib20)\), examine the narratives and stated research agendas surrounding these benchmarks; some work has found that these patterns differ by region and community\(Ottet al\.,[2022](https://arxiv.org/html/2605.14164#bib.bib22)\)\.Weidingeret al\.\([2025](https://arxiv.org/html/2605.14164#bib.bib2)\)have called for a formal “evaluation science” for generative AI\. Collectively, these unstandardized evaluation practices and their surrounding narratives constitute whatCampolo \([2025](https://arxiv.org/html/2605.14164#bib.bib70)\)conceptualizes as a distinct “benchmarking culture\.”

### 2\.5\.AI Benchmarks as Narrative Devices

Research also shows how AI companies shape the public debate around AI\.Nielsen \([2024](https://arxiv.org/html/2605.14164#bib.bib66)\)’s analysis shows that the media coverage of AI ”tends to be led by industry sources, and often takes claims about what the technology can and can’t do, and might be able to do in the future, at face value in ways that contributes to the hype cycle\.” Taking a more nuanced view,Magalhães and Smit \([2026](https://arxiv.org/html/2605.14164#bib.bib67)\)’s qualitative textual analysis of AI coverage in The New York Times \(US\), De Volkskrant \(Netherlands\), and Folha de S\.Paulo \(Brazil\) suggests that while journalistic reporting is not necessarily fueling hype, ”AI’s impact is seen as inevitable but its exact trajectory remains disputed\.”\(Magalhães and Smit,[2026](https://arxiv.org/html/2605.14164#bib.bib67)\)

Others have explored why AI companies dominate public discourse\.Khanalet al\.\([2025](https://arxiv.org/html/2605.14164#bib.bib68)\)argue that tech monopolies have become ”super policy entrepreneurs\.” They act as ”problem brokers” by highlighting certain issues as problem areas, act as ”policy entrepreneurs” by providing technical solutions to policy problems, and as ”political entrepreneurs” that use their resources to shape political institutions to further their interests\.”\(Khanalet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib68)\)Abdalla and Abdalla \([2021](https://arxiv.org/html/2605.14164#bib.bib69)\)explored how tech monopolies increasingly influence research through funding to shape the academic expertise governmental bodies rely on in ways similar to the Big Tobacco industry\.

This body of research shows that AI companies shape what counts as state\-of\-the\-art\. Benchmarks they choose to highlight are likely to shape public perception despite questions about their scientific validity raised by the work we discussed\. In the following, we complement existing literature by analyzing what AI model builders present as state\-of\-the\-art through benchmarks\.

## 3\.Data

We collect and open\-source the*Benchmarking\-Cultures\-25*dataset, a structured corpus of231231unique benchmarks highlighted by1111prominent model builders across139139distinct generative AI model releases333For the purposes of this work, we define ”generative AI models” as foundation AI models\(Bommasani,[2021](https://arxiv.org/html/2605.14164#bib.bib63)\)capable of generating text, code, image, audio, or video in response to input conditioning \(most commonly, natural language prompts\)\. We treat all major model variations \(e\.g\., Pro, Flash, Instruct\) as distinct releases if separate performance claims were made\.throughout 2025\. The dataset is available at[https://hf\.co/datasets/matybohacek/benchmarking\-cultures\-25](https://hf.co/datasets/matybohacek/benchmarking-cultures-25)\. Alongside the dataset, we also release an interactive tool to introspect individual benchmarks and explore their relationships to model releases and one another at[https://bench\-cultures\.net](https://bench-cultures.net/)\(see Appendix[D](https://arxiv.org/html/2605.14164#A4)for screenshots\)\.

To ensure a representative sample of the industry’s state of the art, we selected the top1111model builders based on their performance in the LMSYS Chatbot Arena\(Chianget al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib41)\)and their inclusion in the ”Notable Models” section of the Stanford AI Index 2025\(Maslejet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib42)\)\. This selection captures the dominant organization in the field while maintaining a geographic balance between Western and Chinese organizations\. The selected model builders include industry labs \(Google, OpenAI, Anthropic, Meta, xAI, Alibaba, Baidu, and DeepSeek\) as well as independent and research\-oriented organizations \(Mistral, Allen Institute for AI, and Z\.ai\)\.

### 3\.1\.Data Collection

For each of the139139model releases, we manually extracted every benchmark explicitly mentioned in the primary release announcement;112112of these highlighted at least one benchmark\. Base models explicitly referenced in announcements were also included\. When an announcement covered multiple parameter sizes, we recorded each size as a separate entry\. For the purposes of analysis, however, we treated different parameter sizes of the same model as a single release, since model builders vary considerably in how many size variants they publish per model\.

Our data collection focused on public\-facing industry artifacts \(press releases and company blogs\) rather than technical documentation \(e\.g\., model cards and API docs\) or research papers \(e\.g\., arXiv\)\. To handle variability in how benchmarks are reported, we implemented the following standardization policy:

- •Variant Normalization\.Metric variants \(e\.g\. ”HumanEval Pass@1” vs\. ”HumanEval”\) were mapped to a single canonical Benchmark ID unless the variation reflected fundamentally different test logic\.
- •Snapshot Resolution\.Ambiguous references to dynamic benchmarks \(e\.g\., LiveCodeBench without a date\) were resolved using the model’s release date and contextual footnotes\.
- •Benchmark Author Affiliations\.Using affiliations listed in arXiv papers, the authors of each benchmark were categorized asAcademic,Industry,Non\-profit,Government, orIndependent\.

To allow for graph analysis of the data, we extended the benchmarks and model releases by a collection of papers, authors, affiliation links, and organizations\. In total, we constructed seven data frames with4444data fields: Models \(1717\), Benchmarks \(66\), Highlights \(44\), Affiliations \(66\), Categories \(33\), and Categorizations \(22\) and Knowledge Subjects \(66\)\. The complete data structure specification is provided in Appendix[A](https://arxiv.org/html/2605.14164#A1)\.

### 3\.2\.Taxonomy of Tested Competencies

A core contribution of this study is a unified taxonomy of tested competencies\. We inductively extracted what the authors of each benchmark in our dataset claim to measure in their publications and release artifacts \(e\.g\., arXiv paper or Hugging Face repository\) and clustered these tested competencies into groups\. Through recursive refinement and consensus discussions among the authors, we defined eight meta\-categories of tested competencies\. A similar process led to the development of additional2222categories that break the meta\-categories down to more granular capabilities\. The complete taxonomy is presented in Appendix[B](https://arxiv.org/html/2605.14164#A2)\. Once finalized, this taxonomy was used to manually annotate each benchmark recorded in the dataset, unifying the tested competencies\. The annotations provide a standardized baseline for comparing how model builders describe benchmarks, who otherwise refer to the same competencies inconsistently\. This enables two lines of analysis: first, examining the gaps between how AI model builders frame a benchmark in a release artifact and what the benchmark actually sets out to do; and second, interrogating the construct validity of the benchmarks themselves by comparing their stated aims with what they actually measure\.

### 3\.3\.Limitations

Single\-year Data Coverage \(2025\)\.We limited our data collection to benchmarks highlighted in model release announcements in 2025 by the selected 11 model builders\. This means that our data does not allow us to study broader trends over time, or direct comparisons between publication years\.

Exclusion of model cards\.We acknowledge that model cards are an important, industry\-wide practice to provide more transparency, especially regarding the safety and security of models\. However, our study specifically interrogates how model capabilities are advertised to the general public via primary release announcements\. We consider our approach as complementary to existing scholarship on the model card landscape\.

No in\-depth analysis of entire benchmark categories\.Analyzing entire benchmark categories qualitatively was beyond the scope of this study\. However, we conducted a case study of ”General knowledge application” benchmarks limited to the most popular benchmarks of this category \(see Section[5](https://arxiv.org/html/2605.14164#S5)\)\. The analysis provided rich results and illustrates the value of a more comprehensive qualitative analysis\.

Annotations for own taxonomy done by a single author only\.Multiple independent annotations with inter\-rater reliability scoring would have strengthened the classification\. To mitigate this limitation, all category assignments were reviewed and discussed among the co\-authors, and ambiguous cases were resolved through deliberation\.

## 4\.Data Analysis: Overall Benchmark Origin, Usage, and Presentation Trends

In this section, we present the overall statistics and trends in the*Benchmarking\-Cultures\-25*dataset by examining the use of benchmarks in139139models released in 2025 from1111AI model builders\. Out of these,3535models are closed\-source,9494are open\-weight, and1010are fully open\-source\. Four model builders in our dataset are Chinese \(Alibaba, Baidu, DeepSeek and Z\.ai\); the remaining seven are US\- or Europe\-based \(Allen Institute for AI, Anthropic, Google, Meta, Mistral, OpenAI, and xAI\)\.

### 4\.1\.Benchmark Origin \(RQ1\)

Increasingly, benchmarks highlighted in model release artifacts are published by industry rather than academia\.43\.9%43\.9\\%of the benchmark authors are affiliated with industry,39\.0%39\.0\\%with academia\. These numbers are more pronounced for Western model builders, where the number of benchmark authors affiliated with industry is52\.3%52\.3\\%\. Authors of benchmarks published in 2025 have an even higher industry affiliation rate\. This trend is, too, more pronounced for Western model builders, where this was64\.5%64\.5\\%\(see Table[2](https://arxiv.org/html/2605.14164#S4.T2)\)\.

Table 1\.Affiliation of Benchmark Authors\.Authors’ affiliations are categorized into organization categories\. A breakdown is provided for benchmarks published in 2025 and all benchmarks present in the dataset\.
Table 2\.Benchmark Authors with Multiple Affiliations\.Distribution of affiliation combinations among authors with multiple affiliation\.AffiliationAuthorsCombination\(%\)Academia & Non\-profit37\.1Academia & Industry21\.8Academia & Academia19\.8Industry & Industry9\.6Academia & Government5\.1Industry & Non\-profit4\.6Acad\. & Ind\. & Non\-profit1\.5

8\.1%\(198\) of benchmark authors have more than one affiliation\.Of these,37\.1%37\.1\\%have an affiliation with a non\-profit as well as with academia\. This derives from the large contribution by authors who are affiliated with the Allen Institute for AI, which usually have an additional academic affiliation\. Almost a third of those197197authors \(32\.9%32\.9\\%\) have a shared affiliation between industry and some other type of organization \(see Table[2](https://arxiv.org/html/2605.14164#S4.T2)\)\.

### 4\.2\.Presentation of Tested Competencies \(RQ2\)

Reported in Table[3](https://arxiv.org/html/2605.14164#S4.T3)are the competencies tested in the top 15 most popular benchmarks\. We see that41\.7%41\.7\\%of them evaluate ”Math”, followed by ”Reasoning and knowledge” \(i\.e\., reasoning in fields other than math or coding\) with25\.0%25\.0\\%\. Notably, all 15 top benchmarks that evaluate ”Reasoning and knowledge” also include math as a subject, hence the overlap of benchmarks in Table[3](https://arxiv.org/html/2605.14164#S4.T3)\(in our own taxonomy, each benchmark could be assigned to multiple categories to reflect overlaps such as this one\)\.

Table 3\.Distribution of Evaluated Competencies in the Top 15 Most Popular Benchmarks\.All benchmarks in the ”Reasoning and knowledge” category are also used to evaluate ”Math” competency\. Hence, they are listed twice\. Listed competencies are based on our own taxonomy\.Model builders inconsistently label the same benchmarks to represent different competencies across releases, even between model releases by the same organization\.Shown in Figure[1](https://arxiv.org/html/2605.14164#S4.F1)are the types of labels that model builders used to describe tested competencies by benchmarks across model releases, indicating that model builders are inconsistent in how they frame benchmarks\. LiveCodeBench, the third most popular benchmark in our dataset overall, is a good example to illustrate this\. The authors of LiveCodeBench describe it as ”a holistic and contamination\-free benchmark for evaluating code capabilities\.”\(Jainet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib56)\)We therefore categorized it asSpecialized knowledge application \- Codingin our taxonomy\. However, only53\.7%53\.7\\%of model release artifacts presented LiveCodeBench as a coding\-related benchmark\. Some model builders refer to it as ”Reasoning” \(DeepSeek, Mistral, and Z\.ai\) or agent\-related functions \(Z\.ai and DeepSeek\)\. What LiveCodeBench is claimed to evaluate is even inconsistent between model releases by the same model builder\. For example, xAI pivoted from ”Coding” to ”Cost\-efficient Intelligence,” and Alibaba presents it either as evaluating instructions, ”post\-training” or simply ”text\.” We found similar inconsistencies across all benchmarks\.

![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/prescribed-competencies-coding.png)Figure 1\.Prescribed Competencies by Model Builders Within The Top 5 ”Coding” Benchmarks\.This graph shows the count of competency categories that model publishers prescribe to benchmarks across model releases\.
### 4\.3\.Benchmark Popularity \(RQ3\)

We ranked benchmarks by popularity using the geometric meanNbuilders⋅Nhighlights\\sqrt\{N\_\{\\text\{builders\}\}\\cdot N\_\{\\text\{highlights\}\}\}of the number of model builders and the number of model releases highlighting each benchmark\. A simple highlight count would be skewed by the uneven release volumes across the1111model builders, risking overrepresentation of benchmarks favored by high\-volume model builders\. We use the geometric mean to normalize benchmark prevalence, yielding rankings that better reflect broad, cross\-industry adoption rather than the idiosyncrasies or release frequency of individual model builders\. The results are shown in Table[4](https://arxiv.org/html/2605.14164#S4.T4)\.

Table 4\.Top 15 Most Popular Benchmarks\.Benchmarks are ranked by popularity score \(see Section[4\.3](https://arxiv.org/html/2605.14164#S4.SS3)\)\.AIME 2025 is the most popular benchmark overall, closely followed by GPQA Diamond and LiveCodeBench\.Notably, GPQA Diamond is more popular with model builders from the West \(ranking 1st\) than from China \(ranking 8th\)\. LiveCodeBench is more popular among open\-weight and open\-source models \(ranking 1st and 12th, respectively\) than proprietary models, where it ranks as the 14th most popular\.

### 4\.4\.Adoption and Cross\-Model Comparability \(RQ4\)

To get a sense of how quickly model builders start highlighting benchmarks after their first release, we calculated the adoption rate as the number of model release announcements who have highlighted a benchmark since its release\.

71\.9%of benchmarks used in 2025 were published in the last three years\.The cumulative adoption rate of the benchmarks published in 2025 is shown in Figure[2](https://arxiv.org/html/2605.14164#S4.F2)\. The majority \(31\.6%31\.6\\%\) were published in 2024, followed by28\.1%28\.1\\%in 2025\. SWE\-bench Verified was by far the most adopted benchmark of all benchmarks published in 2025, followed by Humanity’s Last Exam \(HLE\)\. This makes SWE\-bench Verified the seventh and HLE the ninth most popular benchmark\. For closed models, SWE\-bench Verified and HLE is even more popular and take the fourth and the sixth rank, respectively\.

![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/highlights-2025.png)Figure 2\.Adoption of Benchmarks Released in 2025\.The top five most adopted models are highlighted for clarity\.Table 5\.Publication Years of Benchmark within Selected Tested Competencies\.Looking at the benchmarks released in 2023, 2024, and 2025 we map the number of benchmarks released per year within a tested competency\. See Table[10](https://arxiv.org/html/2605.14164#A3.T10)in Appendix[C](https://arxiv.org/html/2605.14164#A3)for full data\.Adoption for new benchmarks show a trend towards more ”Agentic task execution” benchmarks\.When we look at the release dates of benchmarks highlighted by model builders in 2025, we can identify a few trends \(see Table[5](https://arxiv.org/html/2605.14164#S4.T5)\)\. For some competencies, model builders tend to rely more on older benchmarks\. Models highlighted6\.7%6\.7\\%fewer benchmarks for ”Reasoning and knowledge” released in 2025 over benchmarks released in 2024\. A similar decrease can be observed for benchmarks testing for ”Coding” \(8\.0%8\.0\\%\) or ”Math” \(11\.8%11\.8\\%\)\. We also see competencies that are novel in 2025 and were quickly adopted by model builders, most importantly agentic competencies such as ”Strategic Planning” and ”Tool orchestration,” or very recently also preference alignment for specific domains like ”Health\.” This dynamic is also reflected in the model releases and the competencies they choose to highlight, as seen in Figure[3](https://arxiv.org/html/2605.14164#S4.F3)\. Again, ”Math”, ”Coding,” and ”Reasoning and knowledge” saw a decline in inclusion in release artifacts, while competencies around agentic capabilities saw a steady increase\.

![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/highlighted-competencies-by-month.png)Figure 3\.Highlight Frequency of Selected Competencies by Model Builders\.This graph shows the trend of these selected competencies being highlighted in model releases throughout 2025\. See Figure[5](https://arxiv.org/html/2605.14164#A3.F5)in Appendix[C](https://arxiv.org/html/2605.14164#A3)for the full graph\.Benchmark selection is highly fragmented, limiting cross\-model comparability\.Table[6](https://arxiv.org/html/2605.14164#S4.T6)shows how frequently benchmarks are highlighted across models\.63\.2%63\.2\\%of the benchmarks \(146146\) are used only by a single model builder\. There are differences between the West and China, where respectively70\.3%70\.3\\%and64\.7%64\.7\\%of all benchmarks were used by a single model builder\.8989benchmarks \(38\.5%38\.5\\%\) are used by a single model\.51\.3%51\.3\\%of closed models \(3939in total\) reuse benchmarks three or fewer times\. This number is even higher for open\-weight models, where more than66\.5%66\.5\\%of all models reuse the same benchmark three times or less\.

Table 6\.Distribution of Benchmark Adoption\.Percentage of model builders and models that include a given benchmark exactlyNNtimes across their release artifacts\.Looking at specific benchmarks, AIME 2025 was highlighted most frequently \(in46\.8%46\.8\\%of the analyzed model release artifacts\)\. From there, the frequencies of individual benchmark decrease steeply: MMMLU, the tenth most highlighted benchmark, only appears in24\.5%24\.5\\%of the analyzed model releases artifacts, and HMMT 2025, the 15th most highlighted benchmark, only in 16\.0%\.

## 5\.Case Study: General knowledge application

Some types of comprehension and reasoning, such as math and coding, can utilize existing real\-world resources \(like annual math competitions\), including prescribed languages and testing procedures, and their evaluation is, hence, largely standardized\. Benchmarks measuring ”General knowledge application”, however, are more ambiguous because they evaluate knowledge retrieval, comprehension, or reasoning across a broad spectrum of disciplines, ranging from STEM to the humanities, law and more\. Despite the ambiguity, ”General knowledge application” represents the second most popular benchmark category in our dataset:74\.5%74\.5\\%of all model release announcements highlighted at least one of the top five ”General knowledge application” benchmarks\. Given this combination of popularity and difficulty of evaluation, we analyzed these top five benchmarks in depth to better understand what they, as the most frequently highlighted benchmarks in our dataset, measure and how consistent they are in their stated goals\.444We excluded MMMLU from our analysis despite being in the top five, since it is a translation of MMLU’s test set, which is already included\.The analysis that follows focuses on these five benchmarks specifically, not the category as a whole\.

Table 7\.Stated Goals and Subject Coverage of the Top Five ”General Knowledge Application” Benchmarks\.Despite claiming to measure general knowledge or reasoning broadly, all five benchmarks focus heavily on STEM subjects\.As illustrated in Table[7](https://arxiv.org/html/2605.14164#S5.T7), the stated goals of these examined benchmarks are broad, as they claim to measure general knowledge or reasoning, despite focusing only on select subjects and not covering various other subjects systematically or equally\. A deeper look into these benchmarks reveals several key implications for the benchmarking cultures of AI model builders\.

Breakdown of tested subjects\.We break down the subjects benchmark authors claim to cover in ”Knowledge and reasoning” benchmarks in Table[8](https://arxiv.org/html/2605.14164#S5.T8)\. ”Science” is by far the most popular field with almost a third of all questions relating to it, followed by ”Humanities & Social Sciences” with almost half the amount of questions\. Trailing behind is ”Art & Design”\. A closer look at the science category reveals a strong imbalance within the different sub fields, with more than a third of all questions related to mathematics\. This is additional to the dedicated evaluations for math\.

All top five ”General knowledge application” benchmarks distinguish between knowledge and reasoning, but do not define what the distinction is\.MMLU from 2020 is the oldest benchmark in Table[7](https://arxiv.org/html/2605.14164#S5.T7)and the only one with an emphasis on knowledge\. The authors argued that previous benchmarks in Natural Language Processing \(NLP\) evaluated linguistic skills, but MMLU should evaluate information contained in model’s pretraining data, which the authors refer to as ”knowledge”: ”To bridge the gap between the wide\-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn\.”\(Hendryckset al\.,[2020](https://arxiv.org/html/2605.14164#bib.bib43)\)Essentially, the benchmark is meant to evaluate not only what information was contained in the pretraining data, but also how well models are able to recall it correctly when prompted\. ”Reasoning” is only mentioned in relation to the subjects covered, which would require various forms of reasoning\. Implicitly, reasoning thus appears to be understood as ”applying” knowledge from pretraining to solve tasks: ”We introduced a new test that measures how well text models can learn and apply knowledge encountered during pretraining\.”\(Hendryckset al\.,[2020](https://arxiv.org/html/2605.14164#bib.bib43)\)\.

![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/prescribed-competencies-reasoning-knowledge.png)Figure 4\.Prescribed Competencies by Model Builders Within The Top Five ”Reasoning and knowledge” Benchmarks\.This heatmap shows the count of competency categories that model builders prescribe to benchmarks across model releases\. MMMLU is excluded as it is a translation of MMLU’s test set\.All top five ”General knowledge application” benchmarks but MMLU emphasize reasoning over knowledge, which they implicitly define as making logical inferences\.MMLU\-Pro is supposed to ”extend the mostly knowledge\-driven MMLU benchmark by integrating more challenging, reasoning\-focused questions\.”\(Wanget al\.,[2024b](https://arxiv.org/html/2605.14164#bib.bib46)\)In practice, MMLU\-Pro added six incorrect but plausible options to multiple\-choice questions and increased the number of college\-level exam problems that would require ”deliberate reasoning,”\(Wanget al\.,[2024b](https://arxiv.org/html/2605.14164#bib.bib46)\), a term that the authors do not further define in their paper\. The clearest indication of what the authors understand as ”reasoning” is their error analysis of GPT\-4o: ”The model frequently encounters difficulties with logical reasoning, even when it recalls the correct information and knowledge”\(Wanget al\.,[2024b](https://arxiv.org/html/2605.14164#bib.bib46)\)\. By implication, reasoning is understood to be the making of logical inferences\. The authors of the MMMU benchmark similarly describe ”reasoning errors” as errors ”where the model correctly interprets text and images and recalls relevant knowledge… \[yet\] fails to apply logical and mathematical reasoning skills effectively to derive accurate inferences”\(Yueet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib47)\)\.

To ensure reasoning is required, authors of reasoning\-focused benchmarks claim to develop tasks that are ”non\-searchable”\.GPQA Diamond and HLE are less explicit about their understanding of reasoning but use similar metaphors as the authors of MMLU\-Pro and MMMU\. The questions in both GPQA Diamond and HLE should be ”non\-searchable\.” GPQA’s questions were designed to have a ground truth known to experts, but not to ”non\-experts using easily\-found internet resources, since we require that questions be hard and Google\-proof in order to be suitable for scalable oversight experiment”\(Reinet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib45)\)\. For HLE, questions ”should be precise, unambiguous, solvable, and non\-searchable, ensuring models cannot rely on memorization or simple retrieval methods”\(Phanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib48)\)\. Moreover, the HLE authors put an emphasis on mathematics problems ”aimed at testing deep reasoning skills broadly applicable across multiple academic areas”\(Phanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib48)\)\.

However, what is missing in the reasoning\-focused benchmarks in Table[7](https://arxiv.org/html/2605.14164#S5.T7)is a reflection about the extent to which models really rely on logical inference rather than anything akin to what authors consider ”knowledge” to solve tasks\.As mentioned in Section[2\.2](https://arxiv.org/html/2605.14164#S2.SS2), data contamination is well\-known issue in benchmarking, which skews models towards relying on knowledge rather than reasoning\. Likewise, arguing that reasoning tasks are difficult because they are non\-searchable arguably conflates information scarcity with the complexity or difficulty of the task\. There is also the implicit assumption that reasoning happens on a scale: HLE questions should not just test reasoning, but ”deep reasoning”\(Phanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib48)\), the authors of MMLU\-Pro make a distinction between ”reasoning\-focused” subjects \(like math or physics\) and ”knowledge\-heavy” ones \(like history or law\)\(Wanget al\.,[2024b](https://arxiv.org/html/2605.14164#bib.bib46)\)\. Implicitly, ”more” or ”deeper” reasoning is tied to questions that require more specialist domain expertise, while lower levels or reasoning are associated with common sense questions\. However, these assumptions are not made explicit and are not examined\. In addition, benchmark authors talk about measuring knowledge and reasoning in broad and general terms\. For example, the authors of MMMU argue that they measure progress towards AI systems that equal ”at least 90th percentile of skilled adults in a broad range of tasks”\(Yueet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib47)\)\. The authors of GPQA claim to evaluate on tasks that border ”the frontier of human knowledge”\(Reinet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib45)\)\.

This lack of construct validity reflection appears to be partly driven by some benchmark authors’ goal of measuring progress towards AGI\.The authors of MMMU and MMLU\-Pro explicitly aim to help measure progress towards AGI following a framework defined byMorriset al\.\([2024](https://arxiv.org/html/2605.14164#bib.bib49)\)\. The framework consists of five ”Levels of AGI” based on the performance and generality of AI systems\. FollowingMorriset al\.\([2024](https://arxiv.org/html/2605.14164#bib.bib49)\), knowledge and reasoning are essential to progress to higher AGI levels: ”The ability to learn new skills…is essential to generality, since it is infeasible for a system to be optimized for all possible use cases a priori; this necessitates related sub\-skills such as the ability to select appropriate strategies for learning”\(Morriset al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib49)\)\. The authors of MMMU and MMLU\-Pro both specifically want to measure progress towards whatMorriset al\.\([2024](https://arxiv.org/html/2605.14164#bib.bib49)\)call ”Expert AGI:” an AI system that reaches ”at least 90th percentile of skilled adults” on a ”wide range of non\-physical tasks\.” It is only the third level in their framework, but reaching it, they argue, would likely cause economic disruption as it would enable industries to ”reach the substitution threshold for machine intelligence in lieu of human labor”\(Morriset al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib49)\)\. Therefore, the authors of MMMU argue ”it is of both intellectual and societal importance to closely monitor the progress towards Expert AGI\.”\(Yueet al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib47)\)

However, the main inspiration of the MMMU and MMLU\-Pro authors\(Morriset al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib49)\)remains vague about how progression towards various levels of AGI should be measured\.What constitutes the 90th percentile of ”skilled adults”? And on how many tasks should an AI system reach their performance to cover ”most” tasks these skilled adults can perform?\(Morriset al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib49)\)broadly suggest that an ”AGI benchmark” should evaluate a model’s ”ability to learn new skills…the ability to know when to ask for help, and… social metacognitive abilities such as those relating to theory of mind\.” Subsequently, the authors of MMMU and MMLU\-Pro emphasize reasoning over knowledge and highlight the broad range of tasks and subjects covered by their benchmarks\. This might be sufficient to claim to help measure progress towards ”Expert AGI” as defined by\(Morriset al\.,[2024](https://arxiv.org/html/2605.14164#bib.bib49)\), but the questions about construct validity raised above remain\.

We also found that GPQA Diamond and HLE are clearly informed by AGI narratives without explicitly citing AGI frameworks\.The GPQA Diamond authors caution that if ”narrowly superhuman AI systems could help to advance the frontier of human knowledge,” they are likely to produce answers that are difficult to verify even for subject\-matter experts\(Reinet al\.,[2023](https://arxiv.org/html/2605.14164#bib.bib45)\)\. Their goal is to support experiments with ”scalable oversight,” a concept introduced byAmodeiet al\.\([2016](https://arxiv.org/html/2605.14164#bib.bib44)\)\. The authors of HLE claim to evaluate the ”frontier of human knowledge, designed to be the final closed\-ended academic benchmark of its kind with broad subject coverage”\(Phanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib48)\)\. In this vein, it is only fitting that its authors originally planned to name their benchmark ”Humanity’s Last Stand\.”\(Roose,[2025](https://arxiv.org/html/2605.14164#bib.bib50)\)Branding a benchmark as ”final” or as evaluating ”frontier knowledge” implies a teleological inevitability about AGI\. The authors also stress that good performance on HLE ”would not alone suggest autonomous research capabilities or ’artificial general intelligence’”\(Phanet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib48)\)\. This mirrorsMorriset al\.\([2024](https://arxiv.org/html/2605.14164#bib.bib49)\)’s language about the importance of AGI systems to learn new skills to achieve generality\.

Table 8\.Distribution of Subjects covered in Top 5 \(excluding MMMLU\) ”Reasoning and knowledge” Benchmarks by Field\.Table 9\.Breakdown of Disciplines covered in the Science Field in ”Reasoning and knowledge” Benchmarks\.

## 6\.Discussion

The way model builders highlight benchmark results only offers very limited cross\-modal comparison\.Model builders are very inconsistent about the benchmarks they highlight and how they frame them\. Our analysis of the top five benchmarks evaluating ”General knowledge application” illustrates that among the few benchmarks that are used more widely, several put an emphasis on measuring progress towards vaguely defined concepts of AGI over construct validity, which further undermines model comparison\.

Criticism about the quality of a benchmark does not seem to have much impact on its popularity among model builders\.Despite their popularity, several benchmarks in Table[7](https://arxiv.org/html/2605.14164#S5.T7)have been shown to contain incorrect information\. In July 2025, FutureHouse published a review of HLE pointing out ”that 29 ± 3\.7% \(95% CI\) of the text\-only chemistry and biology questions had answers with directly conflicting evidence in peer reviewed literature\.”\(White,[2025](https://arxiv.org/html/2605.14164#bib.bib51)\)However, more than 60% of all mentions of HLE in model release artifacts appeared after FutureHouse’s publication\. Uncertainty about the veracity of some of the contents of HLE did not stop its adoption by AI model builders\. As mentioned in our discussion of related work above, MMLU has also been criticized for containing a substantial amount of errors, including wrong ground truths\(Gemaet al\.,[2025](https://arxiv.org/html/2605.14164#bib.bib16)\)\.

When presenting general purpose models, model builders in our dataset frequently imply their model’s potential to replace human labor with their selection of benchmarks\.When model builders prominently highlight increased performance on benchmarks that explicitly or implicitly aim to track progress towards AGI they imply that their model is getting closer to AGI and thus has a bigger capacity to replace human labor\. GPQA Diamond is worth pointing out here as the most frequently highlighted benchmark in our data\. Its stated goal is not to evaluate specific model capabilities but to help develop methods to verify the correctness of a model’s response in scenarios where even subject\-matter experts struggle to verify it\. A high score of GPQA Diamond thus suggests that a model is potentially ”dangerous” because its capabilities have outpaced human oversight mechanisms, feeding into the narrative of creating ”superhuman AI systems\.”

We also found a decline in independent benchmarks being highlighted by model builders\.Increasingly, benchmark authors are affiliated with industry rather than academia\. Model builders also increasingly highlight benchmarks they created themselves\. Especially OpenAI highlighted 10 benchmarks it created itself\. A total of 36 benchmarks were fully or partly created by the model builders that evaluated one of their own models against it\. This trend is increasing, with 52\.8% of these benchmarks being published in 2025\.

Model builders focus on performance while leaving safety concerns unaddressed\.In public debate, there are many concerns about the biases, potential harms, and safety issues of generative AI models\. Yet, not a single benchmark in our dataset addresses these issues\. For example, there was no benchmark evaluating robustness against prompt injection, or that evaluated how race and gender tend to be framed by a model\. Those issues are typically reserved to model cards, but those are less public\-facing than public model release announcements\.

Benchmarks serve as narrative devices\.We observed several trends that show a change in the way benchmarks are created and used\. Increasingly, \(1\) benchmarks are produced by authors in the industry, \(2\) benchmarks are created by model builders with the purpose of evaluating their own models, and \(3\) we see a shift in tested competencies that align with broader narratives around generative AI models and AGI\. Benchmarks increasingly serve a dual purpose: they are marketing tools as much as they serve a scientific process\. The boundaries between the two are murky and, looking at benchmarks published in 2025, increasingly disappearing\. Benchmarks highlighted by model builders often say less about the real performance of their AI models and more about their aspirations\.

## Author Contributions

SB, CB, and MB jointly developed the methodology, conducted the data analysis, and wrote the paper\. SB and CB led the data collection effort\. CB additionally designed and built the accompanying interactive tool\.

## Generative AI Usage Statement

Generative AI tools were used for literature search, proofreading, LaTeX table and figure formatting, and grammatical corrections\. They were not used during data collection, normalization, or annotation, all of which were conducted manually by the authors\.

## Acknowledgments

CB thanks the Mozilla Foundation for its support during the fellowship over which this work was conducted\.

## Competing Interests

MB was previously employed by Google DeepMind, which is among the model builders whose benchmarking practices are analyzed in this paper\. The analysis, findings, and conclusions are the authors’ own and do not reflect the views of Google DeepMind\. SB and CB declare no competing interests\.

## Ethical Considerations Statement

This research did not involve human subjects, collection of private data, or interventions\. The released dataset consists of openly available metadata with links and attribution, and does not redistribute proprietary content\.

## References

- M\. Abdalla and M\. Abdalla \(2021\)The Grey Hoodie Project: Big Tobacco, Big Tech, and the threat on academic integrity\.InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society,pp\. 287–297\.External Links:2009\.13676,[Document](https://dx.doi.org/10.1145/3461702.3462563),[Link](http://arxiv.org/abs/2009.13676)Cited by:[§2\.5](https://arxiv.org/html/2605.14164#S2.SS5.p2.1)\.
- N\. Alzahrani, H\. Alyahya, Y\. Alnumay, S\. Alrashed, S\. Alsubaie, Y\. Almushayqih, F\. Mirza, N\. Alotaibi, N\. Al\-Twairesh, A\. Alowisheq,et al\.\(2024\)When benchmarks are targets: revealing the sensitivity of large language model leaderboards\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13787–13805\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané \(2016\)Concrete Problems in AI Safety\.arXiv\.External Links:1606\.06565,[Document](https://dx.doi.org/10.48550/arXiv.1606.06565),[Link](http://arxiv.org/abs/1606.06565)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p10.1)\.
- Anthropic \(2025\)Claude 3\.7 sonnet system card\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- A\. M\. Bean, R\. O\. Kearns, A\. Romanou, F\. S\. Hafner, H\. Mayne, J\. Batzner, N\. Foroutan, C\. Schmitz, K\. Korgul, H\. Batra,et al\.\(2025\)Measuring what matters: construct validity in large language model benchmarks\.arXiv preprint arXiv:2511\.04703\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p2.1)\.
- \[6\]B\. Blili\-Hamelin, C\. Graziul, L\. Hancox\-Li, H\. Hazan, E\. El\-Mhamdi, A\. Ghosh, K\. A\. Heller, J\. Metcalf, F\. Murai, E\. Salvaggio,et al\.Position: stop treating AGI as the north\-star goal of ai research\.InForty\-second International Conference on Machine Learning Position Paper Track,Cited by:[§2\.4](https://arxiv.org/html/2605.14164#S2.SS4.p1.1)\.
- M\. Bohacek, N\. Scherrer, N\. Dufour, T\. Leung, C\. Bregler, and S\. C\. Chan \(2025\)Uncovering competency gaps in large language models and their benchmarks\.arXiv preprint arXiv:2512\.20638\.Cited by:[§2\.3](https://arxiv.org/html/2605.14164#S2.SS3.p1.1)\.
- R\. Bommasani, K\. Klyman, S\. Kapoor, S\. Longpre, B\. Xiong, N\. Maslej, and P\. Liang \(2024\)The 2024 foundation model transparency index\.arXiv preprint arXiv:2407\.12929\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- R\. Bommasani, P\. Liang, and T\. Lee \(2023\)Holistic evaluation of language models\.Annals of the New York Academy of Sciences1525\(1\),pp\. 140–146\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- R\. Bommasani \(2021\)On the opportunities and risks of foundation models\.arXiv preprint arXiv:2108\.07258\.Cited by:[footnote 3](https://arxiv.org/html/2605.14164#footnote3)\.
- S\. Bowman and G\. Dahl \(2021\)What will it take to fix benchmarking in natural language understanding?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4843–4855\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- A\. Campolo \(2025\)State\-of\-the\-art: the temporal order of benchmarking culture\.Digital Society4\(2\),pp\. 35\.Cited by:[§2\.4](https://arxiv.org/html/2605.14164#S2.SS4.p1.1),[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- M\. V\. Carro, D\. A\. Mester, F\. G\. Selasco, L\. N\. F\. Gangi, M\. S\. Musa, L\. R\. Pereyra, M\. Leiva, J\. G\. Corvalan, M\. V\. Martinez, and G\. Simari \(2025\)A conceptual framework for ai capability evaluations\.arXiv preprint arXiv:2506\.18213\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang,et al\.\(2024\)A survey on evaluation of large language models\.ACM transactions on intelligent systems and technology15\(3\),pp\. 1–45\.Cited by:[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- S\. Chen, Y\. Chen, Z\. Li, Y\. Jiang, Z\. Wan, Y\. He, D\. Ran, T\. Gu, H\. Li, T\. Xie,et al\.\(2025\)Benchmarking large language models under data contamination: a survey from static to dynamic evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 10091–10109\.Cited by:[footnote 1](https://arxiv.org/html/2605.14164#footnote1)\.
- Y\. Cheng, Y\. Chang, and Y\. Wu \(2025\)A survey on data contamination for large language models\.arXiv preprint arXiv:2502\.14425\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- W\. Chiang, L\. Zheng, Y\. Sheng, A\. N\. Angelopoulos, T\. Li, D\. Li, B\. Zhu, H\. Zhang, M\. Jordan, J\. E\. Gonzalez,et al\.\(2024\)Chatbot arena: an open platform for evaluating llms by human preference\.InForty\-first International Conference on Machine Learning,Cited by:[§3](https://arxiv.org/html/2605.14164#S3.p2.1)\.
- C\. Deng, Y\. Zhao, X\. Tang, M\. Gerstein, and A\. Cohan \(2024\)Investigating data contamination in modern benchmarks for large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8706–8719\.Cited by:[§2\.2](https://arxiv.org/html/2605.14164#S2.SS2.p1.1)\.
- P\. J\. DiMaggio and W\. W\. Powell \(1983\)The Iron Cage Revisited: Institutional Isomorphism and Collective Rationality in Organizational Fields\.48\(2\),pp\. 147–160\.External Links:2095101,ISSN 0003\-1224,[Document](https://dx.doi.org/10.2307/2095101),[Link](https://www.jstor.org/stable/2095101)Cited by:[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- R\. Dominguez\-Olmedo, F\. E\. Dorner, and M\. Hardt \(2024\)Training on the test task confounds evaluation and emergence\.arXiv preprint arXiv:2407\.07890\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- M\. Eriksson, E\. Purificato, A\. Noroozian, J\. Vinagre, G\. Chaslot, E\. Gomez, and D\. Fernandez\-Llorca \(2025\)Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation\.arXiv preprint arXiv:2502\.06559\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1),[§2\.4](https://arxiv.org/html/2605.14164#S2.SS4.p1.1)\.
- K\. Ethayarajh and D\. Jurafsky \(2020\)Utility is in the eye of the user: a critique of NLP leaderboards\.arXiv preprint arXiv:2009\.13888\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- J\. Fodor \(2025\)Line goes up? inherent limitations of benchmarks for evaluating large language models\.arXiv preprint arXiv:2502\.14318\.Cited by:[§2\.3](https://arxiv.org/html/2605.14164#S2.SS3.p1.1)\.
- A\. P\. Gema, J\. O\. J\. Leang, G\. Hong, A\. Devoto, A\. C\. M\. Mancino, R\. Saxena, X\. He, Y\. Zhao, X\. Du, M\. R\. G\. Madani,et al\.\(2025\)Are we done with mmlu?\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5069–5096\.Cited by:[§6](https://arxiv.org/html/2605.14164#S6.p2.1)\.
- C\. A\. Goodhart \(1984\)Problems of monetary management: the uk experience\.InMonetary theory and practice: The UK experience,pp\. 91–121\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- J\. Haimes, C\. Wenner, K\. Thaman, V\. Tashev, C\. Neo, E\. Kran, and J\. Schreiber \(2024\)Benchmark inflation: revealing llm performance gaps using retro\-holdouts\.arXiv preprint arXiv:2410\.09247\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring Massive Multitask Language Understanding\.arXiv\.External Links:2009\.03300,[Document](https://dx.doi.org/10.48550/arXiv.2009.03300),[Link](http://arxiv.org/abs/2009.03300)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p4.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2403.07974),[Link](https://arxiv.org/abs/2403.07974)Cited by:[§4\.2](https://arxiv.org/html/2605.14164#S4.SS2.p2.1)\.
- A\. S\. Joaquin, R\. Gipiškis, L\. Staufer, and A\. Gil \(2025\)Deprecating benchmarks: criteria and framework\.arXiv preprint arXiv:2507\.06434\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- S\. Khanal, H\. Zhang, and A\. Taeihagh \(2025\)Why and how is the power of Big Tech increasing in the policy process? The case of generative AI\.44\(1\),pp\. 52–69\.External Links:ISSN 1449\-4035,[Document](https://dx.doi.org/10.1093/polsoc/puae012),[Link](https://dx.doi.org/10.1093/polsoc/puae012)Cited by:[§2\.5](https://arxiv.org/html/2605.14164#S2.SS5.p2.1)\.
- D\. Kiela, M\. Bartolo, Y\. Nie, D\. Kaushik, A\. Geiger, Z\. Wu, B\. Vidgen, G\. Prasad, A\. Singh, P\. Ringshia,et al\.\(2021\)Dynabench: rethinking benchmarking in nlp\.InProceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: human language technologies,pp\. 4110–4124\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- B\. Koch, E\. Denton, A\. Hanna, and J\. G\. Foster \(2021\)Reduced, reused and recycled: the life of a dataset in machine learning research\.arXiv preprint arXiv:2112\.01716\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p2.1)\.
- M\. T\. R\. Laskar, S\. Alqahtani, M\. S\. Bari, M\. Rahman, M\. A\. M\. Khan, H\. Khan, I\. Jahan, A\. Bhuiyan, C\. W\. Tan, M\. R\. Parvez,et al\.\(2024\)A systematic survey and critical review on evaluating large language models: challenges, limitations, and recommendations\.arXiv preprint arXiv:2407\.04069\.Cited by:[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- Y\. Li, F\. Geurin, and C\. Lin \(2023\)Avoiding data contamination in language model evaluation: dynamic test construction with latest materials\.arXiv preprint arXiv:2312\.12343\.Cited by:[§2\.2](https://arxiv.org/html/2605.14164#S2.SS2.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar,et al\.\(2022\)Holistic evaluation of language models\.arXiv preprint arXiv:2211\.09110\.Cited by:[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- T\. Liao, R\. Taori, I\. D\. Raji, and L\. Schmidt \(2021\)Are we learning yet? a meta review of evaluation failures across machine learning\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\),Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p2.1)\.
- J\. C\. Magalhães and R\. Smit \(2026\)Less Hype, More Drama: Open\-Ended Technological Inevitability in Journalistic Discourses About AI in the US, The Netherlands, and Brazil\.14\(2\),pp\. 323–340\.External Links:ISSN 2167\-0811,[Document](https://dx.doi.org/10.1080/21670811.2025.2522281),[Link](https://doi.org/10.1080/21670811.2025.2522281)Cited by:[§2\.5](https://arxiv.org/html/2605.14164#S2.SS5.p1.1)\.
- N\. Maslej, L\. Fattorini, R\. Perrault, Y\. Gil, V\. Parli, N\. Kariuki, E\. Capstick, A\. Reuel, E\. Brynjolfsson, J\. Etchemendy,et al\.\(2025\)Artificial intelligence index report 2025\.arXiv preprint arXiv:2504\.07139\.Cited by:[§3](https://arxiv.org/html/2605.14164#S3.p2.1)\.
- M\. R\. Morris, J\. Sohl\-dickstein, N\. Fiedel, T\. Warkentin, A\. Dafoe, A\. Faust, C\. Farabet, and S\. Legg \(2024\)Levels of AGI for Operationalizing Progress on the Path to AGI\.arXiv\.External Links:2311\.02462,[Document](https://dx.doi.org/10.48550/arXiv.2311.02462),[Link](http://arxiv.org/abs/2311.02462)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p10.1),[§5](https://arxiv.org/html/2605.14164#S5.p8.1),[§5](https://arxiv.org/html/2605.14164#S5.p9.1),[§5](https://arxiv.org/html/2605.14164#S5.p9.1.1)\.
- S\. Ni, X\. Kong, C\. Li, X\. Hu, R\. Xu, J\. Zhu, and M\. Yang \(2025\)Training on the benchmark is not all you need\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24948–24956\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- R\. K\. Nielsen \(2024\)External Links:[Link](http://reutersinstitute.politics.ox.ac.uk/news/how-news-coverage-often-uncritical-helps-build-ai-hype)Cited by:[§2\.5](https://arxiv.org/html/2605.14164#S2.SS5.p1.1)\.
- OpenAI \(2023a\)GPT\-4 research preview: capabilities and limitations\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- OpenAI \(2023b\)GPT\-4 system card\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- OpenAI \(2024\)OpenAI o1 system card\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- Y\. Oren, N\. Meister, N\. S\. Chatterji, F\. Ladhak, and T\. Hashimoto \(2023\)Proving test set contamination in black\-box language models\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- S\. Ott, A\. Barbosa\-Silva, K\. Blagec, J\. Brauner, and M\. Samwald \(2022\)Mapping global dynamics of benchmark creation and saturation in artificial intelligence\.Nature Communications13\(1\),pp\. 6793\.Cited by:[§2\.4](https://arxiv.org/html/2605.14164#S2.SS4.p1.1)\.
- L\. Phan, A\. Gatti, Z\. Han, N\. Li, J\. Hu, H\. Zhang, C\. B\. C\. Zhang, M\. Shaaban, J\. Ling, S\. Shi, M\. Choi, A\. Agrawal, A\. Chopra, A\. Khoja, R\. Kim, R\. Ren, J\. Hausenloy, O\. Zhang, M\. Mazeika, D\. Dodonov, T\. Nguyen, J\. Lee, D\. Anderson, M\. Doroshenko, A\. C\. Stokes, M\. Mahmood, O\. Pokutnyi, O\. Iskra, J\. P\. Wang, J\. Levin, M\. Kazakov, F\. Feng, S\. Y\. Feng, H\. Zhao, M\. Yu, V\. Gangal, C\. Zou, Z\. Wang, S\. Popov, R\. Gerbicz, G\. Galgon, J\. Schmitt, W\. Yeadon, Y\. Lee, S\. Sauers, A\. Sanchez, F\. Giska, M\. Roth, S\. Riis, S\. Utpala, N\. Burns, G\. M\. Goshu, M\. M\. Naiya, C\. Agu, Z\. Giboney, A\. Cheatom, F\. Fournier\-Facio, S\. Crowson, L\. Finke, Z\. Cheng, J\. Zampese, R\. G\. Hoerr, M\. Nandor, H\. Park, T\. Gehrunger, J\. Cai, B\. McCarty, A\. C\. Garretson, E\. Taylor, D\. Sileo, Q\. Ren, U\. Qazi, L\. Li, J\. Nam, J\. B\. Wydallis, P\. Arkhipov, J\. W\. L\. Shi, A\. Bacho, C\. G\. Willcocks, H\. Cao, S\. Motwani, E\. d\. O\. Santos, J\. Veith, E\. Vendrow, D\. Cojoc, K\. Zenitani, J\. Robinson, L\. Tang, Y\. Li, J\. Vendrow, N\. W\. Fraga, V\. Kuchkin, A\. P\. Maksimov, P\. Marion, D\. Efremov, J\. Lynch, K\. Liang, A\. Mikov, A\. Gritsevskiy, J\. Guillod, G\. Demir, D\. Martinez, B\. Pageler, K\. Zhou, S\. Soori, O\. Press, H\. Tang, P\. Rissone, S\. R\. Green, L\. Brüssel, M\. Twayana, A\. Dieuleveut, J\. M\. Imperial, A\. Prabhu, J\. Yang, N\. Crispino, A\. Rao, D\. Zvonkine, G\. Loiseau, M\. Kalinin, M\. Lukas, C\. Manolescu, N\. Stambaugh, S\. Mishra, T\. Hogg, C\. Bosio, B\. P\. Coppola, J\. Salazar, J\. Jin, R\. Sayous, S\. Ivanov, P\. Schwaller, S\. Senthilkuma, A\. M\. Bran, A\. Algaba, K\. V\. den Houte, L\. V\. D\. Sypt, B\. Verbeken, D\. Noever, A\. Kopylov, B\. Myklebust, B\. Li, L\. Schut, E\. Zheltonozhskii, Q\. Yuan, D\. Lim, R\. Stanley, T\. Yang, J\. Maar, J\. Wykowski, M\. Oller, A\. Sahu, C\. G\. Ardito, Y\. Hu, A\. G\. K\. Kamdoum, A\. Jin, T\. G\. Vilchis, Y\. Zu, M\. Lackner, J\. Koppel, G\. Sun, D\. S\. Antonenko, S\. Chern, B\. Zhao, P\. Arsene, J\. M\. Cavanagh, D\. Li, J\. Shen, D\. Crisostomi, W\. Zhang, A\. Dehghan, S\. Ivanov, D\. Perrella, N\. Kaparov, A\. Zang, I\. Sucholutsky, A\. Kharlamova, D\. Orel, V\. Poritski, S\. Ben\-David, Z\. Berger, P\. Whitfill, M\. Foster, D\. Munro, L\. Ho, S\. Sivarajan, D\. B\. Hava, A\. Kuchkin, D\. Holmes, A\. Rodriguez\-Romero, F\. Sommerhage, A\. Zhang, R\. Moat, K\. Schneider, Z\. Kazibwe, D\. Clarke, D\. H\. Kim, F\. M\. Dias, S\. Fish, V\. Elser, T\. Kreiman, V\. E\. G\. Vilchis, I\. Klose, U\. Anantheswaran, A\. Zweiger, K\. Rawal, J\. Li, J\. Nguyen, N\. Daans, H\. Heidinger, M\. Radionov, V\. Rozhoň, V\. Ginis, C\. Stump, N\. Cohen, R\. Poświata, J\. Tkadlec, A\. Goldfarb, C\. Wang, P\. Padlewski, S\. Barzowski, K\. Montgomery, R\. Stendall, J\. Tucker\-Foltz, J\. Stade, T\. R\. Rogers, T\. Goertzen, D\. Grabb, A\. Shukla, A\. Givré, J\. A\. Ambay, A\. Sen, M\. F\. Aziz, M\. H\. Inlow, H\. He, L\. Zhang, Y\. Kaddar, I\. Ängquist, Y\. Chen, H\. K\. Wang, K\. Ramakrishnan, E\. Thornley, A\. Terpin, H\. Schoelkopf, E\. Zheng, A\. Carmi, E\. D\. L\. Brown, K\. Zhu, M\. Bartolo, R\. Wheeler, M\. Stehberger, P\. Bradshaw, J\. P\. Heimonen, K\. Sridhar, I\. Akov, J\. Sandlin, Y\. Makarychev, J\. Tam, H\. Hoang, D\. M\. Cunningham, V\. Goryachev, D\. Patramanis, M\. Krause, A\. Redenti, D\. Aldous, J\. Lai, S\. Coleman, J\. Xu, S\. Lee, I\. Magoulas, S\. Zhao, N\. Tang, M\. K\. Cohen, O\. Paradise, J\. H\. Kirchner, M\. Ovchynnikov, J\. O\. Matos, A\. Shenoy, M\. Wang, Y\. Nie, A\. Sztyber\-Betley, P\. Faraboschi, R\. Riblet, J\. Crozier, S\. Halasyamani, S\. Verma, P\. Joshi, E\. Meril, Z\. Ma, J\. Andréoletti, R\. Singhal, J\. Platnick, V\. Nevirkovets, L\. Basler, A\. Ivanov, S\. Khoury, N\. Gustafsson, M\. Piccardo, H\. Mostaghimi, Q\. Chen, V\. Singh, T\. Q\. Khánh, P\. Rosu, H\. Szlyk, Z\. Brown, H\. Narayan, A\. Menezes, J\. Roberts, W\. Alley, K\. Sun, A\. Patel, M\. Lamparth, A\. Reuel, L\. Xin, H\. Xu, J\. Loader, F\. Martin, Z\. Wang, A\. Achilleos, T\. Preu, T\. Korbak, I\. Bosio, F\. Kazemi, Z\. Chen, B\. Bálint, E\. J\. Y\. Lo, J\. Wang, M\. I\. S\. Nunes, J\. Milbauer, M\. S\. Bari, Z\. Wang, B\. Ansarinejad, Y\. Sun, S\. Durand, H\. Elgnainy, G\. Douville, D\. Tordera, G\. Balabanian, H\. Wolff, L\. Kvistad, H\. Milliron, A\. Sakor, M\. Eron, A\. F\. D\. O, S\. Shah, X\. Zhou, F\. Kamalov, S\. Abdoli, T\. Santens, S\. Barkan, A\. Tee, R\. Zhang, A\. Tomasiello, G\. B\. D\. Luca, S\. Looi, V\. Le, N\. Kolt, J\. Pan, E\. Rodman, J\. Drori, C\. J\. Fossum, N\. Muennighoff, M\. Jagota, R\. Pradeep, H\. Fan, J\. Eicher, M\. Chen, K\. Thaman, W\. Merrill, M\. Firsching, C\. Harris, S\. Ciobâcă, J\. Gross, R\. Pandey, I\. Gusev, A\. Jones, S\. Agnihotri, P\. Zhelnov, M\. Mofayezi, A\. Piperski, D\. K\. Zhang, K\. Dobarskyi, R\. Leventov, I\. Soroko, J\. Duersch, V\. Taamazyan, A\. Ho, W\. Ma, W\. Held, R\. Xian, A\. R\. Zebaze, M\. Mohamed, J\. N\. Leser, M\. X\. Yuan, L\. Yacar, J\. Lengler, K\. Olszewska, C\. D\. Fratta, E\. Oliveira, J\. W\. Jackson, A\. Zou, M\. Chidambaram, T\. Manik, H\. Haffenden, D\. Stander, A\. Dasouqi, A\. Shen, B\. Golshani, D\. Stap, E\. Kretov, M\. Uzhou, A\. B\. Zhidkovskaya, N\. Winter, M\. O\. Rodriguez, R\. Lauff, D\. Wehr, C\. Tang, Z\. Hossain, S\. Phillips, F\. Samuele, F\. Ekström, A\. Hammon, O\. Patel, F\. Farhidi, G\. Medley, F\. Mohammadzadeh, M\. Peñaflor, H\. Kassahun, A\. Friedrich, R\. H\. Perez, D\. Pyda, T\. Sakal, O\. Dhamane, A\. K\. Mirabadi, E\. Hallman, K\. Okutsu, M\. Battaglia, M\. Maghsoudimehrabani, A\. Amit, D\. Hulbert, R\. Pereira, S\. Weber, Handoko, A\. Peristyy, S\. Malina, M\. Mehkary, R\. Aly, F\. Reidegeld, A\. Dick, C\. Friday, M\. Singh, H\. Shapourian, W\. Kim, M\. Costa, H\. Gurdogan, H\. Kumar, C\. Ceconello, C\. Zhuang, H\. Park, M\. Carroll, A\. R\. Tawfeek, S\. Steinerberger, D\. Aggarwal, M\. Kirchhof, L\. Dai, E\. Kim, J\. Ferret, J\. Shah, Y\. Wang, M\. Yan, K\. Burdzy, L\. Zhang, A\. Franca, D\. T\. Pham, K\. Y\. Loh, J\. Robinson, A\. Jackson, P\. Giordano, P\. Petersen, A\. Cosma, J\. Colino, C\. White, J\. Votava, V\. Vinnikov, E\. Delaney, P\. Spelda, V\. Stritecky, S\. M\. Shahid, J\. Mourrat, L\. Vetoshkin, K\. Sponselee, R\. Bacho, Z\. Yong, F\. de la Rosa, N\. Cho, X\. Li, G\. Malod, O\. Weller, G\. Albani, L\. Lang, J\. Laurendeau, D\. Kazakov, F\. Adesanya, J\. Portier, L\. Hollom, V\. Souza, Y\. A\. Zhou, J\. Degorre, Y\. Yalın, G\. D\. Obikoya, Rai, F\. Bigi, M\. C\. Boscá, O\. Shumar, K\. Bacho, G\. Recchia, M\. Popescu, N\. Shulga, N\. M\. Tanwie, T\. C\. H\. Lux, B\. Rank, C\. Ni, M\. Brooks, A\. Yakimchyk, Huanxu, Liu, S\. Cavalleri, O\. Häggström, E\. Verkama, J\. Newbould, H\. Gundlach, L\. Brito\-Santana, B\. Amaro, V\. Vajipey, R\. Grover, T\. Wang, Y\. Kratish, W\. Li, S\. Gopi, A\. Caciolai, C\. S\. de Witt, P\. Hernández\-Cámara, E\. Rodolà, J\. Robins, D\. Williamson, V\. Cheng, B\. Raynor, H\. Qi, B\. Segev, J\. Fan, S\. Martinson, E\. Y\. Wang, K\. Hausknecht, M\. P\. Brenner, M\. Mao, C\. Demian, P\. Kassani, X\. Zhang, D\. Avagian, E\. J\. Scipio, A\. Ragoler, J\. Tan, B\. Sims, R\. Plecnik, A\. Kirtland, O\. F\. Bodur, D\. P\. Shinde, Y\. C\. L\. Labrador, Z\. Adoul, M\. Zekry, A\. Karakoc, T\. C\. B\. Santos, S\. Shamseldeen, L\. Karim, A\. Liakhovitskaia, N\. Resman, N\. Farina, J\. C\. Gonzalez, G\. Maayan, E\. Anderson, R\. D\. O\. Pena, E\. Kelley, H\. Mariji, R\. Pouriamanesh, W\. Wu, R\. Finocchio, I\. Alarab, J\. Cole, D\. Ferreira, B\. Johnson, M\. Safdari, L\. Dai, S\. Arthornthurasuk, I\. C\. McAlister, A\. J\. Moyano, A\. Pronin, J\. Fan, A\. Ramirez\-Trinidad, Y\. Malysheva, D\. Pottmaier, O\. Taheri, S\. Stepanic, S\. Perry, L\. Askew, R\. A\. H\. Rodríguez, A\. M\. R\. Minissi, R\. Lorena, K\. Iyer, A\. A\. Fasiludeen, R\. Clark, J\. Ducey, M\. Piza, M\. Somrak, E\. Vergo, J\. Qin, B\. Borbás, E\. Chu, J\. Lindsey, A\. Jallon, I\. M\. J\. McInnis, E\. Chen, A\. Semler, L\. Gloor, T\. Shah, M\. Carauleanu, P\. Lauer, T\. Đ\. Huy, H\. Shahrtash, E\. Duc, L\. Lewark, A\. Brown, S\. Albanie, B\. Weber, W\. S\. Vaz, P\. Clavier, Y\. Fan, G\. P\. R\. e Silva, Long, Lian, M\. Abramovitch, X\. Jiang, S\. Mendoza, M\. Islam, J\. Gonzalez, V\. Mavroudis, J\. Xu, P\. Kumar, L\. P\. Goswami, D\. Bugas, N\. Heydari, F\. Jeanplong, T\. Jansen, A\. Pinto, A\. Apronti, A\. Galal, N\. Ze\-An, A\. Singh, T\. Jiang, J\. o\. A\. Xavier, K\. P\. Agarwal, M\. Berkani, G\. Zhang, Z\. Du, B\. A\. d\. O\. Junior, D\. Malishev, N\. Remy, T\. D\. Hartman, T\. Tarver, S\. Mensah, G\. A\. Loume, W\. Morak, F\. Habibi, S\. Hoback, W\. Cai, J\. Gimenez, R\. G\. Montecillo, J\. Łucki, R\. Campbell, A\. Sharma, K\. Meer, S\. Gul, D\. E\. Gonzalez, X\. Alapont, A\. Hoover, G\. Chhablani, F\. Vargus, A\. Agarwal, Y\. Jiang, D\. Patil, D\. Outevsky, K\. J\. Scaria, R\. Maheshwari, A\. Dendane, P\. Shukla, A\. Cartwright, S\. Bogdanov, N\. Mündler, S\. Möller, L\. Arnaboldi, K\. Thaman, M\. R\. Siddiqi, P\. Saxena, H\. Gupta, T\. Fruhauff, G\. Sherman, M\. Vincze, S\. Usawasutsakorn, D\. Ler, A\. Radhakrishnan, I\. Enyekwe, S\. M\. Salauddin, J\. Muzhen, A\. Maksapetyan, V\. Rossbach, C\. Harjadi, M\. Bahaloohoreh, C\. Sparrow, J\. Sidhu, S\. Ali, S\. Bian, J\. Lai, E\. Singer, J\. L\. Uro, G\. Bateman, M\. Sayed, A\. Menshawy, D\. Duclosel, D\. Bezzi, Y\. Jain, A\. Aaron, M\. Tiryakioglu, S\. Siddh, K\. Krenek, I\. A\. Shah, J\. Jin, S\. Creighton, D\. Peskoff, Z\. EL\-Wasif, R\. P\. V, M\. Richmond, J\. McGowan, T\. Patwardhan, H\. Sun, T\. Sun, N\. Zubić, S\. Sala, S\. Ebert, J\. Kaddour, M\. Schottdorf, D\. Wang, G\. Petruzella, A\. Meiburg, T\. Medved, A\. ElSheikh, S\. A\. Hebbar, L\. Vaquero, X\. Yang, J\. Poulos, V\. Zouhar, S\. Bogdanik, M\. Zhang, J\. Sanz\-Ros, D\. Anugraha, Y\. Dai, A\. N\. Nhu, X\. Wang, A\. A\. Demircali, Z\. Jia, Y\. Zhou, J\. Wu, M\. He, N\. Chandok, A\. Sinha, G\. Luo, L\. Le, M\. Noyé, M\. Perełkiewicz, I\. Pantidis, T\. Qi, S\. S\. Purohit, L\. Parcalabescu, T\. Nguyen, G\. I\. Winata, E\. M\. Ponti, H\. Li, K\. Dhole, J\. Park, D\. Abbondanza, Y\. Wang, A\. Nayak, D\. M\. Caetano, A\. A\. W\. L\. Wong, M\. del Rio\-Chanona, D\. Kondor, P\. Francois, E\. Chalstrey, J\. Zsambok, D\. Hoyer, J\. Reddish, J\. Hauser, F\. Rodrigo\-Ginés, S\. Datta, M\. Shepherd, T\. Kamphuis, Q\. Zhang, H\. Kim, R\. Sun, J\. Yao, F\. Dernoncourt, S\. Krishna, S\. Rismanchian, B\. Pu, F\. Pinto, Y\. Wang, K\. Shridhar, K\. J\. Overholt, G\. Briia, H\. Nguyen, David, S\. Bartomeu, T\. C\. Pang, A\. Wecker, Y\. Xiong, F\. Li, L\. S\. Huber, J\. Jaeger, R\. D\. Maddalena, X\. H\. Lù, Y\. Zhang, C\. Beger, P\. T\. J\. Kon, S\. Li, V\. Sanker, M\. Yin, Y\. Liang, X\. Zhang, A\. Agrawal, L\. S\. Yifei, Z\. Zhang, M\. Cai, Y\. Sonmez, C\. Cozianu, C\. Li, A\. Slen, S\. Yu, H\. K\. Park, G\. Sarti, M\. Briański, A\. Stolfo, T\. A\. Nguyen, M\. Zhang, Y\. Perlitz, J\. Hernandez\-Orallo, R\. Li, A\. Shabani, F\. Juefei\-Xu, S\. Dhingra, O\. Zohar, M\. C\. Nguyen, A\. Pondaven, A\. Yilmaz, X\. Zhao, C\. Jin, M\. Jiang, S\. Todoran, X\. Han, J\. Kreuer, B\. Rabern, A\. Plassart, M\. Maggetti, L\. Yap, R\. Geirhos, J\. Kean, D\. Wang, S\. Mollaei, C\. Sun, Y\. Yin, S\. Wang, R\. Li, Y\. Chang, A\. Wei, A\. Bizeul, X\. Wang, A\. O\. Arrais, K\. Mukherjee, J\. Chamorro\-Padial, J\. Liu, X\. Qu, J\. Guan, A\. Bouyamourn, S\. Wu, M\. Plomecka, J\. Chen, M\. Tang, J\. Deng, S\. Subramanian, H\. Xi, H\. Chen, W\. Zhang, Y\. Ren, H\. Tu, S\. Kim, Y\. Chen, S\. V\. Marjanović, J\. Ha, G\. Luczyna, J\. J\. Ma, Z\. Shen, D\. Song, C\. E\. Zhang, Z\. Wang, G\. Gendron, Y\. Xiao, L\. Smucker, E\. Weng, K\. H\. Lee, Z\. Ye, S\. Ermon, I\. D\. Lopez\-Miguel, T\. Knights, A\. Gitter, N\. Park, B\. Wei, H\. Chen, K\. Pai, A\. Elkhanany, H\. Lin, P\. D\. Siedler, J\. Fang, R\. Mishra, K\. Zsolnai\-Fehér, X\. Jiang, S\. Khan, J\. Yuan, R\. K\. Jain, X\. Lin, M\. Peterson, Z\. Wang, A\. Malusare, M\. Tang, I\. Gupta, I\. Fosin, T\. Kang, B\. Dworakowska, K\. Matsumoto, G\. Zheng, G\. Sewuster, J\. P\. Villanueva, I\. Rannev, I\. Chernyavsky, J\. Chen, D\. Banik, B\. Racz, W\. Dong, J\. Wang, L\. Bashmal, D\. V\. Gonçalves, W\. Hu, K\. Bar, O\. Bohdal, A\. S\. Patlan, S\. Dhuliawala, C\. Geirhos, J\. Wist, Y\. Kansal, B\. Chen, K\. Tire, A\. T\. Yücel, B\. Christof, V\. Singla, Z\. Song, S\. Chen, J\. Ge, K\. Ponkshe, I\. Park, T\. Shi, M\. Q\. Ma, J\. Mak, S\. Lai, A\. Moulin, Z\. Cheng, Z\. Zhu, Z\. Zhang, V\. Patil, K\. Jha, Q\. Men, J\. Wu, T\. Zhang, B\. H\. Vieira, A\. F\. Aji, J\. Chung, M\. Mahfoud, H\. T\. Hoang, M\. Sperzel, W\. Hao, K\. Meding, S\. Xu, V\. Kostakos, D\. Manini, Y\. Liu, C\. Toukmaji, J\. Paek, E\. Yu, A\. E\. Demircali, Z\. Sun, I\. Dewerpe, H\. Qin, R\. Pflugfelder, J\. Bailey, J\. Morris, V\. Heilala, S\. Rosset, Z\. Yu, P\. E\. Chen, W\. Yeo, E\. Jain, R\. Yang, S\. Chigurupati, J\. Chernyavsky, S\. P\. Reddy, S\. Venugopalan, H\. Batra, C\. F\. Park, H\. Tran, G\. Maximiano, G\. Zhang, Y\. Liang, H\. Shiyu, R\. Xu, R\. Pan, S\. Suresh, Z\. Liu, S\. Gulati, S\. Zhang, P\. Turchin, C\. W\. Bartlett, C\. R\. Scotese, P\. M\. Cao, A\. Nattanmai, G\. McKellips, A\. Cheraku, A\. Suhail, E\. Luo, M\. Deng, J\. Luo, A\. Zhang, K\. Jindel, J\. Paek, K\. Halevy, A\. Baranov, M\. Liu, A\. Avadhanam, D\. Zhang, V\. Cheng, B\. Ma, E\. Fu, L\. Do, J\. Lass, H\. Yang, S\. Sunkari, V\. Bharath, V\. Ai, J\. Leung, R\. Agrawal, A\. Zhou, K\. Chen, T\. Kalpathi, Z\. Xu, G\. Wang, T\. Xiao, E\. Maung, S\. Lee, R\. Yang, R\. Yue, B\. Zhao, J\. Yoon, S\. Sun, A\. Singh, E\. Luo, C\. Peng, T\. Osbey, T\. Wang, D\. Echeazu, H\. Yang, T\. Wu, S\. Patel, V\. Kulkarni, V\. Sundarapandiyan, A\. Zhang, A\. Le, Z\. Nasim, S\. Yalam, R\. Kasamsetty, S\. Samal, H\. Yang, D\. Sun, N\. Shah, A\. Saha, A\. Zhang, L\. Nguyen, L\. Nagumalli, K\. Wang, A\. Zhou, A\. Wu, J\. Luo, A\. Telluri, S\. Yue, A\. Wang, and D\. Hendrycks \(2025\)Humanity’s Last Exam\.arXiv\.External Links:2501\.14249,[Document](https://dx.doi.org/10.48550/arXiv.2501.14249),[Link](http://arxiv.org/abs/2501.14249)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p10.1),[§5](https://arxiv.org/html/2605.14164#S5.p6.1),[§5](https://arxiv.org/html/2605.14164#S5.p7.1)\.
- I\. D\. Raji, E\. M\. Bender, A\. Paullada, E\. Denton, and A\. Hanna \(2021\)AI and the everything in the whole wide world benchmark\.arXiv preprint arXiv:2111\.15366\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.14164#S2.SS3.p1.1),[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: A Graduate\-Level Google\-Proof Q&A Benchmark\.arXiv\.External Links:2311\.12022,[Document](https://dx.doi.org/10.48550/arXiv.2311.12022),[Link](http://arxiv.org/abs/2311.12022)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p10.1),[§5](https://arxiv.org/html/2605.14164#S5.p6.1),[§5](https://arxiv.org/html/2605.14164#S5.p7.1)\.
- K\. Roose \(2025\)When a\.i\. passes this test, look out\.New York Times\.External Links:[Link](https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p10.1)\.
- E\. S\. Salido, J\. Gonzalo, and G\. Marco \(2025\)None of the others: a general technique to distinguish reasoning from memorization in multiple\-choice llm evaluation benchmarks\.arXiv preprint arXiv:2502\.12896\.Cited by:[§2\.3](https://arxiv.org/html/2605.14164#S2.SS3.p1.1)\.
- D\. Sculley, J\. Snoek, A\. Wiltschko, and A\. Rahimi \(2018\)Winner’s curse? on pace, progress, and empirical rigor\.External Links:[Link](https://openreview.net/forum?id=rJWF0Fywf)Cited by:[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- H\. Semmelrock, T\. Ross\-Hellauer, S\. Kopeinik, D\. Theiler, A\. Haberl, S\. Thalmann, and D\. Kowald \(2025\)Reproducibility in machine\-learning\-based research: overview, barriers, and drivers\.AI Magazine46\(2\),pp\. e70002\.Cited by:[footnote 2](https://arxiv.org/html/2605.14164#footnote2)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on machine learning research\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- M\. Strathern \(1997\)‘Improving ratings’: audit in the british university system\.European review5\(3\),pp\. 305–321\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- S\. Thais \(2024\)Misrepresented technological solutions in imagined futures: the origins and dangers of ai hype in the research community\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.7,pp\. 1455–1465\.Cited by:[§2\.4](https://arxiv.org/html/2605.14164#S2.SS4.p1.1)\.
- A\. Wan, K\. Klyman, S\. Kapoor, N\. Maslej, S\. Longpre, B\. Xiong, P\. Liang, and R\. Bommasani \(2025\)The 2025 foundation model transparency index\.arXiv preprint arXiv:2512\.10169\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p1.1)\.
- A\. Wang, A\. Hertzmann, and O\. Russakovsky \(2024a\)Benchmark suites instead of leaderboards for evaluating ai fairness\.Patterns5\(11\)\.Cited by:[§1](https://arxiv.org/html/2605.14164#S1.p2.1),[§2](https://arxiv.org/html/2605.14164#S2.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024b\)MMLU\-Pro: A More Robust and Challenging Multi\-Task Language Understanding Benchmark\.arXiv\.External Links:2406\.01574,[Document](https://dx.doi.org/10.48550/arXiv.2406.01574),[Link](http://arxiv.org/abs/2406.01574)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p5.1),[§5](https://arxiv.org/html/2605.14164#S5.p7.1)\.
- L\. Weidinger, I\. D\. Raji, H\. Wallach, M\. Mitchell, A\. Wang, O\. Salaudeen, R\. Bommasani, D\. Ganguli, S\. Koyejo, and W\. Isaac \(2025\)Toward an evaluation science for generative AI systems\.arXiv preprint arXiv:2503\.05336\.Cited by:[§2\.4](https://arxiv.org/html/2605.14164#S2.SS4.p1.1)\.
- A\. White \(2025\)About 30% of humanity’s last exam chemistry/biology answers are likely wrong\.FutureHouse\.External Links:[Link](https://www.futurehouse.org/research-announcements/hle-exam)Cited by:[§6](https://arxiv.org/html/2605.14164#S6.p2.1)\.
- C\. Xie, Y\. Huang, C\. Zhang, D\. Yu, X\. Chen, B\. Y\. Lin, B\. Li, B\. Ghazi, and R\. Kumar \(2024\)On memorization of large language models in logical reasoning\.arXiv preprint arXiv:2410\.23123\.Cited by:[§2\.3](https://arxiv.org/html/2605.14164#S2.SS3.p1.1)\.
- C\. Xu, S\. Guan, D\. Greene, M\. Kechadi,et al\.\(2024\)Benchmark data contamination of large language models: a survey\.arXiv preprint arXiv:2406\.04244\.Cited by:[§2\.2](https://arxiv.org/html/2605.14164#S2.SS2.p1.1)\.
- X\. Yue, Y\. Ni, K\. Zhang, T\. Zheng, R\. Liu, G\. Zhang, S\. Stevens, D\. Jiang, W\. Ren, Y\. Sun, C\. Wei, B\. Yu, R\. Yuan, R\. Sun, M\. Yin, B\. Zheng, Z\. Yang, Y\. Liu, W\. Huang, H\. Sun, Y\. Su, and W\. Chen \(2024\)MMMU: A Massive Multi\-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI\.arXiv\.External Links:2311\.16502,[Document](https://dx.doi.org/10.48550/arXiv.2311.16502),[Link](http://arxiv.org/abs/2311.16502)Cited by:[§5](https://arxiv.org/html/2605.14164#S5.p5.1),[§5](https://arxiv.org/html/2605.14164#S5.p7.1),[§5](https://arxiv.org/html/2605.14164#S5.p8.1)\.
- H\. Zhang, J\. Da, D\. Lee, V\. Robinson, C\. Wu, W\. Song, T\. Zhao, P\. Raja, C\. Zhuang, D\. Slack,et al\.\(2024\)A careful examination of large language model performance on grade school arithmetic\.Advances in Neural Information Processing Systems37,pp\. 46819–46836\.Cited by:[§2\.2](https://arxiv.org/html/2605.14164#S2.SS2.p1.1)\.
- H\. Zhou, H\. Huang, Z\. Zhao, L\. Han, H\. Wang, K\. Chen, M\. Yang, W\. Bao, J\. Dong, B\. Xu,et al\.\(2025a\)Lost in benchmarks? rethinking large language model benchmarking with item response theory\.arXiv preprint arXiv:2505\.15055\.Cited by:[§2\.1](https://arxiv.org/html/2605.14164#S2.SS1.p1.1)\.
- K\. Z\. Zhou, J\. E\. Chen, X\. Zheng, Y\. Qian, Y\. Xiao, and K\. Shu \(2025b\)”Everyone else does it”: the rise of preprinting culture in computing disciplines\.arXiv preprint arXiv:2511\.04081\.Cited by:[footnote 2](https://arxiv.org/html/2605.14164#footnote2)\.

## Appendix A*Benchmarking\-Cultures\-25*Data Structure

This section describes the complete core data structure of our*Benchmarking\-Cultures\-25*dataset, which consists of seven data frames with a total of4444data fields: Models \(1717\), Benchmarks \(66\), Highlights \(44\), Affiliations \(66\), Categories \(33\), and Categorizations \(22\) and Knowledge Subjects \(66\)\. The dataset also includes any derived data and figures referenced in this paper\. The code to produce those is included as well\. The dataset is available at[https://hf\.co/datasets/matybohacek/benchmarking\-cultures\-25](https://hf.co/datasets/matybohacek/benchmarking-cultures-25)\.

### A\.1\.Models

FieldDescriptionmodel\_idUnique identifier for the model \(slug\)\.model\_nameThe display name of the model\.model\_familyThe name of the model family, e\.g\. Gemini or DeepSeek\.model\_versionThe version of the model, e\.g\. 2\.5 or V3\.1\.model\_variantThe variant of the model, e\.g\. Flash or Terminus\.model\_subvariantA subvariant of the model, e\.g Lite\.model\_is\_baseA flag indicating if the model is a base model\.model\_total\_parametersThe number of total parameters of the model\.model\_active\_parametersThe number of active parameters of the model\.model\_hrefURL to the model’s press release or blog post\.model\_published\_atThe date the model was released\.model\_accessThe access level of the model\. Options: Closed, Open\-Weight or Open\-Source\.model\_has\_highlightA flag indicating if the model has any benchmark highlights in its release announcement\.organization\_nameThe name of the organization releasing this model\.organization\_sectorThe sector of the organization\. Options: Industry, Academia or Non\-Profit\.organization\_countryThe country of origin of this organization\.organization\_domainThe domain of influence this organization belongs to\. Options: China or West\.
### A\.2\.Benchmarks

FieldDescriptionbenchmark\_idUnique identifier for the benchmark \(slug\)\.benchmark\_nameThe display name of the benchmark\.paper\_idUnique identifier for the paper announcing the benchmark \(arXiv ID or custom slug\)\.paper\_hrefURL to the paper announcing the benchmark\.paper\_published\_atThe date the paper was published\. This was taken as the benchmark release date \(version 1 if more than one was provided\)\.
### A\.3\.Highlights

FieldDescriptionbenchmark\_idUnique identifier for the benchmark \(slug\)\.model\_idUnique identifier for the model \(slug\)\.prescribed\_competencyThe competency that model builders prescribe to this benchmark for that model release\. This field remains empty if the model builder didn’t assign a competency but highlighted the benchmark anyway\.prescribed\_categoryA generalized categorization of the prescribed\_competency\.
### A\.4\.Affiliations

FieldDescriptionpaper\_idUnique identifier for the release paper \(arXiv ID or custom slug\)\.author\_nameThe name of an author for this paper\.organization\_nameThe name of the organization affiliated with the author\.organization\_sectorThe sector of the organization\. Options: Industry, Academia or Non\-Profit\.organization\_countryThe country of origin of this organization\.organization\_domainThe domain of influence this organization belongs to\. Options: China or West\.
### A\.5\.Categories

FieldDescriptionbenchmark\_categoryGranular functional classification\. Options: Audio\-visual pattern recognition, Audio\-visual understanding, Coding, Commonsense, Embodied spatial understanding, Factuality, Foundational skills, Generic, Health, Instruction following, Instruction retention, Long\-context, Math, Multilingual performance, Multimodal generation, Reasoning and knowledge, Rule adherence, Semantic search, Strategic planning, Tool orchestration, Translation or Writing style\.benchmark\_meta\_categoryHigh\-level classification of the benchmark\. Options: Agentic task execution, Formalized comprehension & reasoning, Information retrieval, Multilingual capabilities, Multimodal processing, Preference\-Alignment, Self\-contained foundational capabilities, Unstructured comprehension & reasoning\.benchmark\_category\_definitionA description of the meaning for the category\.
### A\.6\.Categorizations

FieldDescriptionbenchmark\_idUnique identifier for the benchmark \(slug\)\.benchmark\_categoryGranular functional classification\. Options: Audio\-visual pattern recognition, Audio\-visual understanding, Coding, Commonsense, Embodied spatial understanding, Factuality, Foundational skills, Generic, Health, Instruction following, Instruction retention, Long\-context, Math, Multilingual performance, Multimodal generation, Reasoning and knowledge, Rule adherence, Semantic search, Strategic planning, Tool orchestration, Translation or Writing style\.
### A\.7\.Knowledge Subjects

FieldDescriptionbenchmark\_idUnique identifier for the benchmark \(slug\)\.subjectThe subject as it was named in the benchmark data set\.fieldA mapping of the subject to a field\. Options: Art & Design, Business, Health & Medicine, Humanities & Social Sciences, Law, Science, Tech & Engineering or nil\.science\_disciplineA mapping of the science field to a concrete discipline\.nThe number of questions related to this subject in the benchmark\.pThe percentage of questions related to this subject in the benchmark\.

## Appendix BUnified Benchmark Taxonomy

Meta\-CategoryCategoryDefinitionGeneral knowledge applicationReasoning and knowledgeKnowledge retrieval or “reasoning” in the sense of solving complex logical problems that ideally are “non\-searchable\.”CommonsenseKnowledge and reasoning applied to everyday scenarios rather than specialized domains\.Information retrievalFactualityTesting model knowledge on direct, verifiable facts \(e\.g\., “What’s the capital of France?”\) and ability to avoid hallucinations\.Long\-contextCorrectly retrieving information from context \(e\.g\., “Add a paragraph to the poem I asked you to write 10 queries earlier”\)\.Semantic searchTests embedding mechanisms \(classifying text based on meaning\)\. Only used when the benchmark explicitly evaluates this\.Specialized knowledge applicationCodingCode generation, Self\-Repair, Code execution\.MathText problems, visual math understanding, result evaluation, process evaluation\.Multimodal processingAudio\-visual pattern recognitionSimple recognition tasks, such as “recognize the letters in this image” or “count object XYZ\.”Audio\-visual understandingInterpretative questions about an image, audio, or video\.Multimodal generationProducing audio\-visual output \(audio, image, video\) based on a task\.Embodied spatial understandingThree\-dimensional orientation and spatial reasoning\.Preference\-AlignmentGenericAlignment with LLM\-judge preferences on an unspecific and broad range of subjects\.Writing styleModel performance in writing style aligns with LLM\-judge preferences\.HealthAlignment on health\-related questions for accuracy and safety \(e\.g\., symptom checking\)\.Continued on next page…

Meta CategorySubcategoryDefinitionFoundational capabilitiesInstruction followingExplicit evaluation of whether the model correctly follows specific instructions\.Instruction retentionAbility to maintain state and remember constraints across a multi\-turn conversation\.Base model capabilitiesFundamental aspects of how well the model works as a language model, without targeting a specific downstream application\.Agentic task executionTool orchestrationChecks if models use various tools and their outputs to solve tasks\.Rule adherenceChecks if the model consistently uses tools in compliance with a rule set\.Strategic planningTasks requiring the identification and execution of intermediate steps to achieve a goal \(Chain\-of\-thought, decomposition\)\.Multilingual capabilitiesTranslationTranslating text or multimodal inputs\.Multilingual performanceEvaluates model performance across languages in various tasks\.
## Appendix CFull Tables and Figures

Table 10\.Publication Years of Benchmark within Tested Competencies\.Looking at the benchmarks released in 2023, 2024 and 2025 we map the number of benchmarks released per year within a tested competency\.![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/highlighted-competencies-by-month-facets.png)Figure 5\.Highlights of Competencies by Model Builders\.This graph shows the trend of these selected competencies being highlighted in model releases\.
## Appendix D*Bench Cultures*Tool Screenshots

![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/bench_cultures_screenshots/screen_0.png)Figure 6\.Benchmarks View\.Ordered by rank, each benchmark record presents its date of publication, assigned categories and models, affiliation distribution, and a paper link\.![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/bench_cultures_screenshots/screen_3.png)Figure 7\.Benchmarks Visualization\.Pictured above is a lollipop chart comparison of affiliation of benchmark creators by year, opened from the Benchmarks View\.![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/bench_cultures_screenshots/screen_1.png)Figure 8\.Models View\.Pictured above is the models view filtered by MMLU\-Pro usage\. Each model record presents its date of publication, publisher, access policy, affiliation sector and model parameters if available, domain, and the announcement link\.![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/bench_cultures_screenshots/screen_4.png)Figure 9\.Models Visualization\.Pictured above is a grouped bar chart of model access and publisher domain statistics filtered by model publisher sector \(Industry\), opened from the Models View\.![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/bench_cultures_screenshots/screen_2.png)Figure 10\.Competencies View\.The list contains all tested competencies within our custom taxonomy\. Each taxonomy record presents the connected benchmarks, models, and prescribed categories, as well as the definition\.![Refer to caption](https://arxiv.org/html/2605.14164v1/fig/bench_cultures_screenshots/screen_5.png)Figure 11\.Competencies Visualization\.Pictured above is a heatmap chart comparing the competencies that benchmarks are measuring vs\. the competencies that model builders prescribe to them, opened from the Competencies View\.

Similar Articles

The Evaluation Trap: Benchmark Design as Theoretical Commitment

arXiv cs.AI

This paper identifies the 'evaluation trap' where AI benchmarks inadvertently stabilize dominant paradigms by narrowing what counts as progress, and introduces Epistematics, a meta-evaluative methodology to ensure evaluation criteria discriminate true capability from proxy behaviors.

Introducing HealthBench

OpenAI Blog

OpenAI introduces HealthBench, a new benchmark for evaluating AI systems in healthcare contexts, created with 262 physicians across 60 countries. The benchmark includes 5,000 realistic health conversations with physician-written rubrics to assess model performance on meaningful, trustworthy, and improvable metrics.