SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Summary
SciCustom is a framework for constructing custom scientific benchmarks from large-scale data, enabling fine-grained evaluation of LLMs' scientific capabilities without expert annotation. It uses ontology-grounded knowledge units and voting-based consensus to select relevant benchmarks, demonstrated in chemistry and healthcare.
View Cached Full Text
Cached at: 05/20/26, 08:25 AM
# SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Source: [https://arxiv.org/html/2605.19357](https://arxiv.org/html/2605.19357)
Yiyang Gu1, Junwei Yang1∗, Junyu Luo1, Ye Yuan1, Bin Feng3, Yingce Xia2, Shufang Xie2, Kaili Liu4, Bohan Wu1, Qi Shi5, Haoran Li5, Beier Xiao5, Zhiping Xiao6, Xiao Luo7, Weizhi Zhang8, Philip S\. Yu8, Zequn Liu2†, Ming Zhang1† 1State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU\-Anker LLM Lab, Peking University2Zhongguancun Academy3IDEA 4Xidian University5Peking University6University of Washington 7University of Wisconsin–Madison8University of Illinois Chicago \{yiyanggu,yjwtheonly,mzhang\_cs\}@pku\.edu\.cn,patxiao@uw\.edu,liuzequn@bza\.edu\.cn
###### Abstract
Large language models \(LLMs\) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine\-grained capabilities required in practice\. Most benchmarks are manually curated or domain\-generic, limiting scalability and alignment with real scientific use cases\. In this paper, we propose a new framework namedSciCustomto address the problem\. It enables the custom construction of benchmarks from large\-scale scientific data to evaluate application\-specific scientific capabilities in LLMs\.SciCustomfirst organizes scientific knowledge into ontology\-grounded knowledge units with controlled granularity and trains a tagger to map large\-scale data instances into this knowledge space\. Given a custom requirement, relevant knowledge units are identified via voting\-based multi\-model consensus\. These units enable relevance\-aware benchmark retrieval via binary search, followed by proxy subset selection and data\-grounded benchmark generation for efficient evaluation\. Experiments in chemistry and healthcare demonstrate thatSciCustomreveals fine\-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation\. This work provides a scalable and application\-aware foundation for benchmarking scientific capabilities in LLMs\. The source code is available at[https://github\.com/yjwtheonly/SciCustom](https://github.com/yjwtheonly/SciCustom)\.
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Yiyang Gu1††thanks:Equal contribution with order determined by flipping a coin\., Junwei Yang1∗, Junyu Luo1, Ye Yuan1, Bin Feng3, Yingce Xia2,Shufang Xie2, Kaili Liu4, Bohan Wu1, Qi Shi5, Haoran Li5, Beier Xiao5, Zhiping Xiao6††thanks:Corresponding authors\.,Xiao Luo7, Weizhi Zhang8, Philip S\. Yu8, Zequn Liu2†, Ming Zhang1†1State Key Laboratory for Multimedia Information Processing, School of Computer Science,PKU\-Anker LLM Lab, Peking University2Zhongguancun Academy3IDEA4Xidian University5Peking University6University of Washington7University of Wisconsin–Madison8University of Illinois Chicago\{yiyanggu,yjwtheonly,mzhang\_cs\}@pku\.edu\.cn,patxiao@uw\.edu,liuzequn@bza\.edu\.cn
## 1Introduction
Figure 1:Illustrations ofSciCustom\. \(a\) Comparison between traditional off\-the\-shelf benchmarking and our ontology\-driven framework\. \(b, c\) Evaluation of 10 LLMs on different benchmarks, where each dot represents a model\. Targeting specific capabilities in Technical Chemistry, \(b\) the general scientific benchmark \(GPQA Diamond\) aligns poorly with expert ground truth, whereas \(c\) the benchmark constructed bySciCustomdemonstrates strong alignment\.With the rapid advancement of Large Language Models \(LLMs\), their applications have significantly broadened across the scientific landscapeChanget al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib69)\); Bommasaniet al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib54)\); Birhaneet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib55)\); Luoet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib50)\); Yuanet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib46)\); Liuet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib79)\); Xiaet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib80)\)\. For example, an LLM\-based system successfully proposed a novel mechanism of gene transfer in bacteria, mirroring conclusions that took years of experimental validationGottweiset al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib67)\); Penadéset al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib68)\)\. However, as scientists increasingly seek to leverage LLMs for specific scientific applications, a critical challenge emerges: the performance of a given LLM within a particular scientific context remains largely uncertain and difficult to evaluateCaiet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib71)\); Singhalet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib57)\); Miret and Krishnan \([2025](https://arxiv.org/html/2605.19357#bib.bib58)\); Bediet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib65)\)\.
While numerous benchmarks exist to evaluate various aspects of model capability, they frequently fail to reflect the requirements of specialized usageAnjumet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib70)\); Lianget al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib56)\); Bandelet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib66)\)\. Empirical observations indicate that performance on widely\-used benchmarks often diverges from that in specific scientific tasks \(Figure[1](https://arxiv.org/html/2605.19357#S1.F1)b\), necessitating the customization of benchmarks tailored to particular scientific applicationsSinghalet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib57)\); Miret and Krishnan \([2025](https://arxiv.org/html/2605.19357#bib.bib58)\)\. Given the vast and ever\-expanding range of scientific applications, manual curation of benchmarks for each emerging use case is impracticalHuanget al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib60)\); Wuet al\.\([2018](https://arxiv.org/html/2605.19357#bib.bib61)\)\.
A plausible solution is to automate the customized benchmark construction processFarchiet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib72)\); Liet al\.\([2025b](https://arxiv.org/html/2605.19357#bib.bib32)\); Chouet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib8)\); Wanget al\.\([2025b](https://arxiv.org/html/2605.19357#bib.bib9)\)\. However, developing such an automated framework presents two challenges\. First, scientific applications are inherently complex and highly interdisciplinary\. A single application often requires knowledge from multiple subfields \(e\.g\., drug discovery intertwines organic chemistry, molecular biology, and pharmacologyLuet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib73)\)\), while distinct applications frequently share underlying knowledge \(e\.g\., both drug discovery and clinical decision\-making require the knowledge of pharmacologySonget al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib7)\); Biet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib6)\); Onget al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib74)\)\)\. Consequently, constructing benchmarks from scratch for each specific scenario results in massive redundant labor and limits the framework’s scalabilityHuanget al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib60)\); Liet al\.\([2025a](https://arxiv.org/html/2605.19357#bib.bib5)\)\. Second, scientific evaluation requires grounded validity\. High\-quality evaluation data are frequently derived from costly wet\-lab experiments or computational simulationsChenet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib4)\); Ramoset al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib3)\), rendering simple general\-purpose data synthesis methods infeasibleParker \([2020](https://arxiv.org/html/2605.19357#bib.bib77)\); Weiet al\.\([2024b](https://arxiv.org/html/2605.19357#bib.bib1)\); Chouet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib8)\); Xuet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib2)\)\.
To this end, we introduceSciCustom, a framework for custom evaluation of scientific capabilities in LLMs \(Figure[1](https://arxiv.org/html/2605.19357#S1.F1)a,c\)\. The key idea is that a complex scientific application can be approximated by a composition of fine\-grained knowledge units\. To construct these units, we select 642 representative concepts from a comprehensive scientific ontology to serve as unified knowledge units\. Leveraging the rich hierarchical information within the ontology, which links specific scientific concepts to knowledge units, we train a tagger to automatically align real\-world scientific corpora with the corresponding knowledge units\. This offline process performs a once\-for\-all population of each knowledge unit with grounded scientific data\. When a user presents an evaluation requirement,SciCustomidentifies the relevant knowledge units and orchestrates a hierarchical retrieval process to construct a tailored benchmark\. The combinability of these pre\-constructed knowledge units enables dynamic adaptation to new requirements via efficient reuse, ensuring scalability for diverse downstream applications\.
Experimental results in chemistry and healthcare domains demonstrate that our automatically constructed benchmarks closely align with expert\-curated benchmarks\. Furthermore, for pericyclic reaction, a novel scientific application lacking prior benchmarks, our framework successfully generates a high\-quality benchmark, providing a scalable and reliable solution for assessing LLMs in the ever\-expanding landscape of scientific research\.
Our contributions can be summarized as follows:
- •We proposeSciCustom, a framework that models scientific applications as compositions of reusable knowledge units to automate benchmark construction, enabling efficient adaptation to diverse downstream tasks\.
- •We develop a tagger that leverages ontology structures to map large\-scale, real\-world scientific corpora into knowledge units, ensuring that evaluations are grounded in verifiable data\.
- •Experiments in chemistry and healthcare confirmSciCustom’s strong alignment with expert\-curated benchmarks and its capability to evaluate novel scientific contexts where no prior benchmarks exist\.
## 2Methodology
Figure 2:Framework ofSciCustom\. It consists of an offline phase where scientific data is indexed into ontology\-grounded knowledge units via a trained tagger, and an online phase where user requirements are parsed by multi\-model voting to identify relevant tags\. These tags guide the binary search\-based selection and proxy selection of data\. The problem set of the benchmark is generated based on these data\.### 2\.1Overview
We study the problem of customized scientific capability evaluation for LLMs\. Given an evaluation requirementrrand a large corpus of scientific data𝒟\\mathcal\{D\}, the goal is to construct a benchmarkℬr\\mathcal\{B\}\_\{r\}that reflects the capabilities required byrrwithout expert annotation\.
SciCustomis an automated framework for this problem, operating through offline data indexing and online data\-grounded generation phases\. In the offline phase,SciCustomrepresents large\-scale scientific data within shared knowledge units using a tagger\. This phase organizes scattered scientific content into reusable units\. In the online phase,SciCustomfirst employs a voting\-based multi\-model consensus to identify knowledge units relevant to the user requirementrr\. Guided by these units, the framework performs a hierarchical data selection process: retrieving candidate data, filtering via binary search, and extracting a representative proxy subset to generate final questions for efficient evaluation\. An overview ofSciCustomis depicted inFigure[2](https://arxiv.org/html/2605.19357#S2.F2)\.
### 2\.2Ontology\-Grounded Knowledge Units
To enable fine\-grained yet reusable modeling of scientific knowledge,SciCustomis built upon a structured scientific ontology that serves as a unified semantic backbone for both knowledge definition and data organization\.
FollowingLiuet al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib25)\), we integrate multiple authoritative repositories to construct a large\-scale scientific ontology, organized into a collection of directed acyclic graphs \(DAGs\) spanning227227scientific subdisciplines\. Within each DAG, concepts are interconnected via*is\-a*relations, where descendant nodes represent finer\-grained specializations of their parent concepts \(e\.g\.,Organic Chemistryis a descendant node ofChemistry\.\)
We select concepts from this ontology as knowledge units\. The knowledge units should be neither overly coarse nor excessively specific\. Overly coarse concepts lack the resolution to distinguish fine\-grained model capabilities, while excessively specific ones hinder the effective reuse of knowledge units for broader applications\. We empirically select concepts at a granularity comparable to “textbook chapter titles”\. We therefore capture these knowledge units via a depth\-first traversal over each ontology DAG𝒢i\\mathcal\{G\}\_\{i\}\. At each nodevv, an LLM evaluates whether the ontology term aligns with the granularity criteria based on the term name, illustrative examples, and its prior knowledge of scientific taxonomy\.
Algorithm 1Ontology\-Guided Knowledge Unit SelectionInput: Ontology\{𝒢i\}\\\{\\mathcal\{G\}\_\{i\}\\\},Output: Units𝒯\\mathcal\{T\}
1:
𝒯←∅\\mathcal\{T\}\\leftarrow\\emptyset
2:foreachnode
vvvisited by DFS in
\{𝒢i\}\\\{\\mathcal\{G\}\_\{i\}\\\}do
3:if
\|Desc\(v\)\|<10\|\\mathrm\{Desc\}\(v\)\|<10then
4:Backtrack
5:endif
6:
label←label\\leftarrowLLM classifies
vvascoarse,moderate, orfine
7:if
labellabeliscoarsethen
8:Continue traversal \(recurse into children\)
9:elseif
labellabelismoderatethen
10:
𝒯←𝒯∪\{v\}\\mathcal\{T\}\\leftarrow\\mathcal\{T\}\\cup\\\{v\\\}
11:else
12:Backtrack \{Prune branch iffine\}
13:endif
14:endfor
15:return
𝒯\\mathcal\{T\}
Figure 3:Illustration of the synthetic data construction pipeline for tagger training\. We sample knowledge units \(green circles\) and extract descendant keywords \(blue squares\) from the ontology\. An LLM then generates a natural language query based on these keywords, creating a labeled training instance\.The overall procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.19357#alg1), and the detailed prompting strategy is provided in Appendix[G](https://arxiv.org/html/2605.19357#A7)\.
Through this procedure, we obtain641641scientific knowledge units\. We additionally introduce a dedicatedNon\-Scientificunit, resulting in a total of642642knowledge units\. These units define the ontology\-grounded scientific knowledge space that underpins all subsequent benchmark construction inSciCustom\.
### 2\.3Mapping Data to Knowledge Units
To populate the identified knowledge units with grounded scientific data, we collect a large\-scale corpus containing high\-quality question\-answering \(QA\) pairs across diverse scientific domains\. The corpus comprises a total ofNNdata instances, each defined asd=⟨q,a⟩d=\\langle q,a\\rangle, whereqqdenotes a natural language question andaathe corresponding reference answer\. These instances are originally scattered, necessitating the development of a specialized scientific tagger to organize them into knowledge units\.
To enable efficient large\-scale annotation, we train a small language model as the tagger\. The model maps each queryqqto a subset of knowledge units,𝒯q⊆𝒯\\mathcal\{T\}\_\{q\}\\subseteq\\mathcal\{T\}\. This design implements a once\-for\-all mapping between data and knowledge units during the construction phase, allowing the same annotated corpus to be reused across diverse evaluation requirements\.
Supervised data for training the tagger is constructed using a two\-stage strategy\. In the first stage, we generate synthetic scientific queries via controlled compositional sampling over the knowledge units𝒯\\mathcal\{T\}\. Specifically, we randomly sample between11and55knowledge units to form a target set𝒯q⊂𝒯\\mathcal\{T\}\_\{q\}\\subset\\mathcal\{T\}\. We fully leverage the rich hierarchical information of the ontology: for each selected unitv∈𝒯qv\\in\\mathcal\{T\}\_\{q\}, we aggregate all its descendant nodes to form a representative keyword set𝒦v\\mathcal\{K\}\_\{v\}\(e\.g\.,Ventricular Septumis a keyword for the unitAnatomical Entities\)\. We further sample keywords from each𝒦v\\mathcal\{K\}\_\{v\}and combine them to form a composite prompt for an LLM to generate a synthetic queryqq\(Figure[3](https://arxiv.org/html/2605.19357#S2.F3)\)\. Each generated query is then paired with its corresponding set of source knowledge units, yielding a labeled training instance of the form⟨q,𝒯q⟩\\langle q,\\mathcal\{T\}\_\{q\}\\rangle\. Using this procedure, we construct synthetic scientific training instances covering complex compositional knowledge patterns\. In the second stage, we complement the synthetic data with real\-world scientific queries sampled from existing instruction\-tuning datasets\. These queries are annotated with their corresponding knowledge units with an LLM\. In addition, we include non\-scientific queries to model boundary cases between scientific and non\-scientific content\.
By introducing a small and efficient tagger for annotating scientific queries,SciCustomachieves scalable and interpretable alignment between natural language queries and scientific knowledge, forming a key foundation for customized scientific evaluation\.
### 2\.4Voting\-Based Knowledge Unit Selection
With the large\-scale scientific corpus successfully grounded to ontology\-defined knowledge units via the tagger, we possess a structured knowledge base ready for dynamic retrieval\. To construct a benchmark tailored to a specific user evaluation requirementrr, the critical first step is to accurately identify the subset of knowledge units𝒯r⊂𝒯\\mathcal\{T\}\_\{r\}\\subset\\mathcal\{T\}that underpins the requirement\.
To identify the target knowledge units𝒯r\\mathcal\{T\}\_\{r\}, we employ a multi\-model consensus mechanism where a set of heterogeneous LLMs independently rank candidate units based on their relevance torr\. We aggregate these rankings by calculating the average rank position for each unit across all models, and select the top\-K1K\_\{1\}units with the lowest consensus ranks \(indicating highest relevance\) to form the target set𝒯r\\mathcal\{T\}\_\{r\}\.
### 2\.5Hierarchical Benchmark Generation
##### Binary Search\-Based Data Selection\.
With the requirement\-relevant knowledge unit set𝒯r\\mathcal\{T\}\_\{r\}established, we aim to retrieve a dataset𝒟r\\mathcal\{D\}\_\{r\}from the massive corpus𝒟\\mathcal\{D\}that best matchesrr\. For any data sample⟨q,a⟩∈𝒟\\langle q,a\\rangle\\in\\mathcal\{D\}, let𝒯r,q=𝒯r∩𝒯q\\mathcal\{T\}\_\{r,q\}=\\mathcal\{T\}\_\{r\}\\cap\\mathcal\{T\}\_\{q\}represent the intersection between the data’s inherent knowledge units and the query knowledge units\.
Intuitively, we construct a candidate set𝒟r′=\{⟨q,a⟩∈𝒟∣\|𝒯r,q\|≥1\}\\mathcal\{D\}\_\{r\}^\{\\prime\}=\\\{\\langle q,a\\rangle\\in\\mathcal\{D\}\\mid\|\\mathcal\{T\}\_\{r,q\}\|\\geq 1\\\}\. However, a simple intersection filter is insufficient to guarantee high semantic alignment\. While employing LLMs to verify the relevance of each candidate againstrrwould ensure quality, applying this to the entire set is computationally prohibitive\. To optimize API usage without compromising retrieval quality, we adopt a binary search\-based strategy that relies on a pre\-sorted candidate list\. Specifically, we define a ordering where data sampledi=⟨qi,ai⟩d\_\{i\}=\\langle q\_\{i\},a\_\{i\}\\rangleis ranked higher thandj=⟨qj,aj⟩d\_\{j\}=\\langle q\_\{j\},a\_\{j\}\\rangleif:
1. 1\.\|𝒯r,qi\|\>\|𝒯r,qj\|\|\\mathcal\{T\}\_\{r,q\_\{i\}\}\|\>\|\\mathcal\{T\}\_\{r,q\_\{j\}\}\|; or
2. 2\.\|𝒯r,qi\|=\|𝒯r,qj\|\|\\mathcal\{T\}\_\{r,q\_\{i\}\}\|=\|\\mathcal\{T\}\_\{r,q\_\{j\}\}\|andℛ¯r\(di\)<ℛ¯r\(dj\)\\bar\{\\mathcal\{R\}\}\_\{r\}\(d\_\{i\}\)<\\bar\{\\mathcal\{R\}\}\_\{r\}\(d\_\{j\}\),
where𝒯r,q=𝒯r∩𝒯q\\mathcal\{T\}\_\{r,q\}=\\mathcal\{T\}\_\{r\}\\cap\\mathcal\{T\}\_\{q\}denotes the matching knowledge units, andℛ¯r\(d\)\\bar\{\\mathcal\{R\}\}\_\{r\}\(d\)represents average rank of the knowledge units in𝒯r,q\\mathcal\{T\}\_\{r,q\}acquired in Sec\.[2\.4](https://arxiv.org/html/2605.19357#S2.SS4)\.
Algorithm 2Binary Search for Relevant Benchmark ConstructionInput: Sorted listL𝒟r′L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}, Requirementrr, Modelsℳ\\mathcal\{M\}Output: Data𝒟r\\mathcal\{D\}\_\{r\}
1:
low←0,high←\|L𝒟r′\|−1,cutoff←0low\\leftarrow 0,\\ high\\leftarrow\|L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}\|\-1,\\ cutoff\\leftarrow 0
2:while
low≤highlow\\leq highdo
3:
mid←low\+⌊\(high−low\)/2⌋mid\\leftarrow low\+\\lfloor\(high\-low\)/2\\rfloor;
dmid←L𝒟r′\[mid\]d\_\{mid\}\\leftarrow L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}\[mid\]
4:
vote←∑Mi∈ℳ𝕀\(Mijudgesdmidrelevant tor\)vote\\leftarrow\\sum\_\{M\_\{i\}\\in\\mathcal\{M\}\}\\mathbb\{I\}\(M\_\{i\}\\text\{ judges \}d\_\{mid\}\\text\{ relevant to \}r\)
5:if
vote\>\|ℳ\|/2vote\>\|\\mathcal\{M\}\|/2then
6:
cutoff←midcutoff\\leftarrow mid;
low←mid\+1low\\leftarrow mid\+1
7:else
8:
high←mid−1high\\leftarrow mid\-1
9:endif
10:endwhile
11:return
𝒟r=L𝒟r′\[0:cutoff\]\\mathcal\{D\}\_\{r\}=L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}\[0:cutoff\]
We assume that within the prioritized candidate list, the relevance to the requirementrrfollows a generally monotonic decreasing trend\. Under this assumption, we apply the binary search procedure to efficiently identify the cutoff point—defined as the last position where a majority of the LLM ensemble still judges the sample as relevant torr\. This approach enables us to determine the maximal relevant data subset without linearly verifying every instance, significantly reducing the number of required oracle judgments toO\(log\(\|𝒟r′\|\)\)O\(\\log\(\|\\mathcal\{D\}^\{\\prime\}\_\{r\}\|\)\)\. The detailed procedure is outlined in Algorithm[2](https://arxiv.org/html/2605.19357#alg2)\.
##### Efficient Proxy Subset Selection\.
The constructed dataset𝒟r\\mathcal\{D\}\_\{r\}may still be excessively large \(e\.g\.,\|𝒟r\|\>200,000\|\\mathcal\{D\}\_\{r\}\|\>200,000in our experiments\)\. To enable efficient evaluation, we propose a selection strategy to extract a proxy subset𝒫r⊆𝒟r\\mathcal\{P\}\_\{r\}\\subseteq\\mathcal\{D\}\_\{r\}such that\|𝒫r\|≤K2\|\\mathcal\{P\}\_\{r\}\|\\leq K\_\{2\}, while maintaining the evaluative power of the full set\.
For each data sample⟨q,a⟩∈𝒟r\\langle q,a\\rangle\\in\\mathcal\{D\}\_\{r\}, we compute two commonly used intrinsic metricsSaranathanet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib20)\): hardness score characterizing the difficulty and quality score indicating the linguistic quality\. Our objective is to select𝒫r\\mathcal\{P\}\_\{r\}such that the distributions of these scores closely approximate those of𝒟r\\mathcal\{D\}\_\{r\}\. We formulate this as minimizing the cumulative Wasserstein distance between the distributions of the subset and the full set\. To solve the sampling of𝒫r\\mathcal\{P\}\_\{r\}efficiently, we adopt a cluster\-based strategy following SubLIMESaranathanet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib20)\)\. Intuitively, we encode all pairsd=⟨q,a⟩d=\\langle q,a\\rangleinto an embedding space using RoBERTaLiuet al\.\([2019](https://arxiv.org/html/2605.19357#bib.bib21)\)\. We then perform clustering in embedding space and sample representatives from each cluster\. Detailed information is in Appendix[A](https://arxiv.org/html/2605.19357#A1)\.
##### Data\-Grounded Benchmark Generation\.
To facilitate standardized and deterministic evaluation, we further transform the raw query\-answer pairs in the selected proxy subset𝒫r\\mathcal\{P\}\_\{r\}into multiple\-choice questions \(MCQs\)\. We utilize LLM to generate plausible distractors for each instance\. The final output is a structured, efficient benchmarkℬr\\mathcal\{B\}\_\{r\}composed of high\-quality MCQs, ready for immediate automated model assessment\.
## 3Experiments
We established a comprehensive experimental framework to evaluate the effectiveness ofSciCustom\. First, we benchmarked our approach against groundtruth expert\-curated datasets across 11 specific evaluation requirements\. We focused on measuring the alignment of model rankings produced bySciCustomversus those from human experts\. Furthermore, to verify the framework’s utility in real\-world exploration, we appliedSciCustomto Pericyclic Reaction, a novel evaluation requirement without existing benchmarks, and qualitatively analyzed the validity of the constructed benchmark\.
### 3\.1Experimental Setup
Table 1:Ranking consistency analysis of 10 LLMs on 6 chemistry\-specific benchmarks\. The table comparesSciCustomagainst baseline benchmarks in preserving model rankings, quantified by Spearman and Kendall correlation coefficients\. Higher scores indicate better consistency\. Best results are inbold\.MetricsBenchmarksAnalytical chemistryInorganic chemistryMaterial scienceOrganic chemistryPhysical chemistryTechnical chemistrySpearmancorrelationIfBench\-0\.32\-0\.34\-0\.430\.12\-0\.61\-0\.54SimpleQA\-0\.67\-0\.69\-0\.78\-0\.48\-0\.42\-0\.31GPQA0\.610\.520\.210\.720\.210\.03MMLU0\.210\.27\-0\.610\.210\.520\.31MedQA0\.310\.420\.01\-0\.210\.630\.72GPT\-5\-0\.110\.05\-0\.040\.380\.24\-0\.07Embedding\-0\.34\-0\.59\-0\.390\.11\-0\.73\-0\.41\\cellcolormygraySciCustom\\cellcolormygray0\.86\\cellcolormygray0\.67\\cellcolormygray0\.42\\cellcolormygray0\.89\\cellcolormygray0\.74\\cellcolormygray0\.86KendallcorrelationIfBench\-0\.14\-0\.19\-0\.290\.10\-0\.43\-0\.43SimpleQA\-0\.52\-0\.49\-0\.56\-0\.39\-0\.24\-0\.23GPQA0\.390\.390\.100\.420\.14\-0\.05MMLU0\.140\.18\-0\.490\.100\.430\.24MedQA0\.240\.40\-0\.09\-0\.140\.430\.48GPT\-5\-0\.150\.00\-0\.060\.350\.160\.00Embedding\-0\.31\-0\.52\-0\.310\.09\-0\.62\-0\.32\\cellcolormygraySciCustom\\cellcolormygray0\.62\\cellcolormygray0\.48\\cellcolormygray0\.35\\cellcolormygray0\.68\\cellcolormygray0\.57\\cellcolormygray0\.65
Table 2:Ranking consistency analysis of 10 LLMs on 5 healthcare\-specific benchmarks\. Higher scores indicate better consistency\. Best results are inbold\.MetricsBenchmarksVirologyHuman agingMedical geneticsAnatomyNutritionSpearmancorrelationIfBench0\.04\-0\.32\-0\.64\-0\.26\-0\.11SimpleQA0\.35\-0\.500\.00\-0\.370\.11GPQA\-0\.11\-0\.10\-0\.090\.480\.18MMLU0\.350\.210\.04\-0\.150\.56MedQA0\.440\.620\.35\-0\.190\.45GPT\-50\.250\.200\.090\.110\.52Embedding0\.180\.21\-0\.21\-0\.320\.27\\cellcolormygraySciCustom\\cellcolormygray0\.55\\cellcolormygray0\.49\\cellcolormygray0\.42\\cellcolormygray0\.62\\cellcolormygray0\.78KendallcorrelationIfBench\-0\.05\-0\.14\-0\.43\-0\.21\-0\.00SimpleQA0\.31\-0\.330\.00\-0\.210\.10GPQA\-0\.05\-0\.10\-0\.110\.320\.10MMLU0\.310\.140\.000\.000\.51MedQA0\.350\.420\.23\-0\.100\.31GPT\-50\.200\.180\.060\.060\.43Embedding0\.110\.10\-0\.14\-0\.160\.18\\cellcolormygraySciCustom\\cellcolormygray0\.45\\cellcolormygray0\.37\\cellcolormygray0\.33\\cellcolormygray0\.51\\cellcolormygray0\.64
##### Ground\-Truth Benchmarks\.
We evaluatedSciCustomon 6 tasks in chemistry, including analytical chemistry, inorganic chemistry, material science, organic chemistry, physical chemistry and technical chemistry, and 5 tasks in healthcare, including virology, human aging, medical genetics, anatomy and nutrition\. The groundtruth benchmarks for chemistry tasks are fromChemBenchMirzaet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib27)\), while the healthcare tasks are sourced from the health subset ofMMLU\-ProWanget al\.\([2024b](https://arxiv.org/html/2605.19357#bib.bib26)\); Hendryckset al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib36)\)\.
##### Baselines\.
We comparedSciCustom\-curated benchmarks against a comprehensive set of baselines spanning general\-purpose benchmarks, scientific benchmarks and alternative benchmark construction methods\. To assess the correlation between general\-purpose benchmarks and target tasks, we utilizedIfBenchPyatkinet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib37)\)andSimpleQAWeiet al\.\([2024a](https://arxiv.org/html/2605.19357#bib.bib38)\)\.GPQA\-DiamondReinet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib28)\)was included to compare general scientific benchmarks\. For domain\-specific comparisons, we selected the chemistry\-related subset ofMMLU\-ProWanget al\.\([2024b](https://arxiv.org/html/2605.19357#bib.bib26)\)for chemistry tasks andMedQAJinet al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib33)\)for healthcare tasks\. Furthermore, to isolate the contributions of our grounded data and ontology\-driven design, we introduced two alternative construction baselines:GPT\-5, a fully synthetic benchmark where the model generates multiple\-choice questions directly without access to grounded data, andEmbedding, an embedding\-based baseline that selects data instances from the corpus𝒟\\mathcal\{D\}solely via k\-nearest neighbor \(k\-NN\) retrieval to the requirement, bypassing the ontology\-grounded knowledge units\.
##### Evaluation Protocol\.
We first evaluate 10 representative LLMs on the ground\-truth benchmark to obtain the reference ranking\. We then evaluate the same 10 LLMs using benchmarks constructed by each baseline method \(including ours\) and compute the ranking induced by each alternative benchmark\. Finally, we measure the consistency between the baseline\-induced ranking and the ground\-truth ranking using Spearman and Kendall correlation coefficients\. Further implementation details are provided in Appendix[C](https://arxiv.org/html/2605.19357#A3)\.
Figure 4:Bar plot showing the effectiveness of relevance cutoff and subset selection strategies\.
### 3\.2Alignment BetweenSciCustomand Expert\-Curated Benchmarks
We report the results ofSciCustomagainst various baselines inTable[1](https://arxiv.org/html/2605.19357#S3.T1)andTable[2](https://arxiv.org/html/2605.19357#S3.T2)\.SciCustomexhibits a robust correlation with expert assessments, achieving the highest Spearman correlation in 10 out of 11 tasks\. This strong alignment indicates that the benchmarks generated bySciCustomeffectively capture the capabilities required for specific applications, serving as a reliable proxy for expensive expert evaluation\. Notably,SciCustomselects the same top\-1 model as the ground\-truth benchmark on 8 out of 11 tasks, highlighting its high practical utility in assisting users to identify the optimal model for their specific needs\.
In contrast, widely adopted benchmarks fails to show consistent alignment with expert rankings\. This underscores that neither general instruction following skills nor broad scientific reasoning capabilities are reliable predictors of proficiency for specialized scientific requirements\. Furthermore, established domain\-specific benchmarks also struggle to predict sub\-domain\-level performance due to insufficient granularity\. These findings further necessitate theSciCustomframework, which dynamically constructs benchmarks at the precise granularity of the target requirement\.
ComparingSciCustomwith alternative construction baselines reveals the critical role of our methodology design\. First, our framework significantly outperforms the fully syntheticGPT\-5baseline, underscoring the necessity of grounded scientific data\. Second,SciCustomalso surpasses theEmbeddingbaseline, which retrieves grounded data via simple semantic similarity \(kk\-NN\) rather than knowledge\-unit\-based mapping\. This demonstrates that: by organizing data into knowledge units, we achieve better alignment with user intent than unstructured semantic retrieval\. Additional comparison results forGPT\-5\-RAGandEmbedding\-filteringare provided in Appendix[D](https://arxiv.org/html/2605.19357#A4)\.
##### Special\-Case Analysis\.
We observed a relatively lower Spearman correlation \(0\.42\) in Material Science compared to other tasks\. To investigate this discrepancy, we further evaluated the LLMs using MatSci\-NLPSonget al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib17)\), another large\-scale expert\-curated benchmark in this field \(comprising 9,466 classification questions grounded in experimental material science facts\)\. Interestingly, we found that the alignment between the two expert\-curated benchmarks themselves was also limited \(ρ\\rho= 0\.31,τb\\tau\_\{b\}= 0\.22\)\. This suggests that Material Science is an exceptionally broad and multifaceted disciplineShoghiet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib18)\); Wanget al\.\([2025a](https://arxiv.org/html/2605.19357#bib.bib19)\), where different evaluation protocols may focus on orthogonal knowledge areas or distinct reasoning types\. We identify the fine\-grained unification of such complex and heterogeneous domains as a direction for future research\.
##### Human Evaluation\.
We evaluate the quality of the benchmarks generated bySciCustomusing a random sample of 50 chemistry\-related questions that are originally not in multiple\-choice format\. Three Master students in AI4Chemistry annotate the samples with binary scores forCorrectness\(the validity of the MCQ after transformation\) andRelevance\(the alignment with the benchmark query\)\. The resulting average scores are 0\.92 for Correctness and 0\.7 for Relevance, confirming the reliability of our benchmark generation pipeline\.
Figure 5:Case study on constructing a benchmark for “Pericyclic Reaction”\. The pipeline progresses from identifying relevant knowledge units via voting \(Left\), retrieving grounded raw data with relevance\-highlighted terms \(Middle\), to generating standardized multiple\-choice questions for evaluation \(Right\)
### 3\.3Component Analysis
##### Effectiveness of Tagger\.
We evaluate the tagger on 1,000 unseen queries generated by compositional sampling over the knowledge units𝒯\\mathcal\{T\}, where each query is generated based on the keywords corresponding to the knowledge units\. We use fuzzy label matching to assess performance, where predicted tags are considered if their normalized Indel similarity with any gold tag exceeds 85%, using the rapidfuzz library111[https://rapidfuzz\.github\.io/RapidFuzz/](https://rapidfuzz.github.io/RapidFuzz/)\. The Macro F1 and Micro F1 scores are 75\.2% and 78\.6%, respectively, demonstrating the effectiveness of the tagger in mapping queries to relevant scientific knowledge\.
##### Effectiveness of Relevance Filtering and Subset Selection Strategies\.
We evaluate the efficacy of our retrieval and selection components by comparing the fullSciCustomframework against two ablation variants:w/o cutoff\(removing the binary search relevance filter\) andw/o selection\(replacing the intrinsic score\-based subset selection with random sampling\)\. The fullSciCustomconsistently outperforms both variants across all 11 sub\-disciplines \(Figure[4](https://arxiv.org/html/2605.19357#S3.F4)\)\. The drop in performance forw/o selection\(random sampling\) is substantial, necessitating the data selection component\. Furthermore, removing the binary search cutoff \(w/o cutoff\) also leads to a notable degradation in ranking consistency, suggesting that our cutoff strategy ensures the benchmark remains tightly focused on the target capability\. These results demonstrate that both precise relevance filtering and representative subset selection are indispensable for achieving efficient and accurate evaluation\.
We further report an automatic evaluation of the quality of LLM\-generated distractors in MCQs in Appendix[F](https://arxiv.org/html/2605.19357#A6)\.
### 3\.4Benchmark for Pericyclic Reaction
To evaluate the capability ofSciCustomin handling highly specific and novel scientific requirements, we conducted a case study onPericyclic Reactions\. Unlike general organic chemistry queries, pericyclic reactions require a nuanced understanding of orbital symmetry, stereochemistry, and specific reaction conditions \(e\.g\., thermal vs\. photochemical\)Woodward and Hoffmann \([1969](https://arxiv.org/html/2605.19357#bib.bib78)\)\. No existing LLM benchmark specifically isolates this distinct class of reactions\.
As illustrated inFigure[5](https://arxiv.org/html/2605.19357#S3.F5), the construction process progresses through three key stages, demonstrating high alignment with expert judgment at each step: Upon receiving the abstract requirement "Pericyclic Reaction benchmark,"SciCustomfirst decomposes it into fine\-grained knowledge anchors\. As shown in the left panel,SciCustomsuccessfully prioritizes highly relevant concepts\. The ranking produced bySciCustomhighly correlates with expert assessment, where the top\-ranked units, such asCyclization,Aromatic hydrocarbon, andRing compound, are identified by experts as the most scientifically relevant \(indicated by darker shades\)\. Guided by these concepts,SciCustomproceeds to data retrieval and question transformation \(Figure[5](https://arxiv.org/html/2605.19357#S3.F5), middle and right panels\)\. Both the retrieved raw data and the transformed multiple\-choice questions exhibit high alignment with expert requirements, accurately capturing critical domain\-specific terminology \(highlighted in the figure\) and complex reaction mechanisms\.
## 4Related Works
Existing scientific evaluation benchmarks primarily fall into two categories\. General scientific benchmarksHendryckset al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib36)\); Wanget al\.\([2024b](https://arxiv.org/html/2605.19357#bib.bib26)\); Reinet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib28)\); Sunet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib30)\); Wanget al\.\([2024a](https://arxiv.org/html/2605.19357#bib.bib53)\)focus predominantly on evaluating broad reasoning capabilities or generalized scientific common sense, lacking the depth required for specific scientific sub\-domains\. Conversely, domain\-specific benchmarksMirzaet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib27)\); Jinet al\.\([2021](https://arxiv.org/html/2605.19357#bib.bib33)\); Singhalet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib43)\); Songet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib17)\)offer more specialized evaluations but are inherently static and labor\-intensive to maintain due to their reliance on manual expert curation\.
To address the limitations of manual curation, there are several automated benchmarking frameworks in computer visionZhanget al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib35)\)and general natural language processingPombalet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib31)\); Liet al\.\([2025b](https://arxiv.org/html/2605.19357#bib.bib32)\); Guoet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib34)\)\. They are ill\-suited for complex scientific evaluation, as they typically suffer from restricted requirement scopesZhanget al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib35)\), a lack of grounded dataPombalet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib31)\), or the inability to reuse compositional knowledge unitsGuoet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib34)\); Liet al\.\([2025b](https://arxiv.org/html/2605.19357#bib.bib32)\), thereby failing to meet the critical demands of scientific applications for diversity, intersectionality, and verifiability\.
## 5Conclusion
In this paper, we introducedSciCustom, an ontology\-driven framework that automates the construction of customized benchmarks for scientific LLM evaluation\. By reorganizing scattered scientific corpora into fine\-grained, reusable knowledge units, our approach overcomes the scalability bottlenecks of manual curation while ensuring evaluations remain grounded in verifiable facts\. Extensive experiments demonstrate thatSciCustomnot only achieves high alignment with expert\-curated benchmarks in chemistry and healthcare but also reliably assesses model capabilities in novel, unbenchmarked scientific frontiers\. This work establishes a scalable and application\-aware evaluation framework for navigating the complex and evolving landscape of scientific research\.
## Limitations
SciCustomhas two primary limitations\. First, the scope of our current evaluation is constrained by the underlying ontologies, which predominantly cover biomedical and chemical domains \(derived from OBO, BioPortal, and OLS\)\. Consequently, some scientific disciplines such as mathematics and theoretical physics, are not yet included\. Future work can address this by integrating broader scientific taxonomies to expand the knowledge space\. Second,SciCustom\-curated benchmarks depend on the coverage of the source scientific corpus𝒟\\mathcal\{D\}\. Some knowledge units may currently suffer from data sparsity\. With the tagging model, we can identify these low\-resource knowledge units by monitoring frequencies of knowledge units\. In future work, this framework will continuously evolve and scale as new scientific datasets become available\.
## Ethics Statement
Our experimental evaluation includes a benchmark subset focused on healthcare\. We emphasize that our framework and the constructed benchmarks are strictly designed for educational and evaluative purposes within a textual Question\-Answering \(QA\) context\. The benchmarks assess an LLM’s ability to recall and reason about established scientific facts found in public academic literature\. They do not involve, facilitate, or encourage any actionable bio\-security threats\. Furthermore, all scientific data used inSciCustomare aggregated from open\-source, publicly available repositories\. No private, classified, or patient\-identifiable data were used in the construction of our benchmarks\. We believe that rigorous evaluation of LLMs in these domains is a prerequisite for their safe and responsible deployment in scientific research\.
## Acknowledgments
The authors Ming Zhang, Yiyang Gu, and Junwei Yang are supported by grants from the National Natural Science Foundation of China \(NSFC Grant Number 62276002\)\. The authors Zequn Liu, Yingce Xia, and Shufang Xie are supported by Zhongguancun Academy \(Grant No\. C20250513 and Grant No\. P190260302\)\.
## References
- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p1.4)\.
- K\. Anjum, M\. A\. Arshad, K\. Hayawi, E\. Polyzos, A\. Tariq, M\. A\. Serhani, L\. Batool, B\. Lund, N\. R\. Mannuru, R\. V\. K\. Bevara,et al\.\(2025\)Domain specific benchmarks for evaluating multimodal large language models\.arXiv preprint arXiv:2506\.12958\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p2.1)\.
- E\. Bandel, Y\. Perlitz, E\. Venezian, R\. Friedman, O\. Arviv, M\. Orbach, S\. Don\-Yehiya, D\. Sheinwald, A\. Gera, L\. Choshen,et al\.\(2024\)Unitxt: flexible, shareable and reusable data preparation and evaluation for generative ai\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 3: System Demonstrations\),pp\. 207–215\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p2.1)\.
- S\. Bedi, Y\. Liu, L\. Orr\-Ewing, D\. Dash, S\. Koyejo, A\. Callahan, J\. A\. Fries, M\. Wornow, A\. Swaminathan, L\. S\. Lehmann,et al\.\(2025\)Testing and evaluation of health care applications of large language models: a systematic review\.Jama333\(4\),pp\. 319–328\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- X\. Bi, Y\. Wang, J\. Wang, and C\. Liu \(2025\)Machine learning for multi\-target drug discovery: challenges and opportunities in systems pharmacology\.Pharmaceutics17\(9\),pp\. 1186\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- A\. Birhane, A\. Kasirzadeh, D\. Leslie, and S\. Wachter \(2023\)Science in the age of large language models\.Nature Reviews Physics5\(5\),pp\. 277–280\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- R\. Bommasani, D\. A\. Hudson, E\. Adeli, R\. Altman, S\. Arora, S\. von Arx, M\. S\. Bernstein, J\. Bohg, A\. Bosselut, E\. Brunskill,et al\.\(2021\)On the opportunities and risks of foundation models\.arXiv preprint arXiv:2108\.07258\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- H\. Cai, X\. Cai, J\. Chang, S\. Li, L\. Yao,et al\.\(2025\)SciAssess: benchmarking llm proficiency in scientific literature analysis\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 2335–2357\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang,et al\.\(2024\)A survey on evaluation of large language models\.ACM transactions on intelligent systems and technology15\(3\),pp\. 1–45\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- Z\. Chen, Y\. Liu, Y\. G\. Wang, and Y\. Shen \(2024\)Validation of an llm\-based multi\-agent framework for protein engineering in dry lab and wet lab\.In2024 IEEE International Conference on Bioinformatics and Biomedicine \(BIBM\),pp\. 5364–5370\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- J\. Chou, A\. Liu, Y\. Deng, Z\. Zeng, T\. Zhang, H\. Zhu, J\. Cai, Y\. Mao, C\. Zhang, L\. Tan,et al\.\(2025\)Autocodebench: large language models are automatic code benchmark generators\.arXiv preprint arXiv:2508\.09101\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- Y\. Fang, X\. Liang, N\. Zhang, K\. Liu, R\. Huang, Z\. Chen, X\. Fan, and H\. Chen \(2024\)Mol\-instructions: a large\-scale biomolecular instruction dataset for large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Tlsdsb6l9n)Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1)\.
- E\. Farchi, S\. Froimovich, R\. Katan, and O\. Raz \(2024\)Automatic generation of benchmarks and reliable llm judgment for code tasks\.arXiv preprint arXiv:2410\.21071\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- J\. Gottweis, W\. Weng, A\. Daryin, T\. Tu, A\. Palepu, P\. Sirkovic, A\. Myaskovsky, F\. Weissenberger, K\. Rong, R\. Tanno,et al\.\(2025\)Towards an ai co\-scientist\.arXiv preprint arXiv:2502\.18864\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p1.4)\.
- C\. Guo, H\. Kai, S\. Liang, Y\. Jiang, Y\. Gao, X\. Hua, and W\. Dong \(2025\)SDBench: a survey\-based domain\-specific llm benchmarking and optimization framework\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13492–13506\.Cited by:[§4](https://arxiv.org/html/2605.19357#S4.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- K\. Huang, T\. Fu, W\. Gao, Y\. Zhao, Y\. H\. Roohani, J\. Leskovec, C\. W\. Coley, C\. Xiao, J\. Sun, and M\. Zitnik \(2021\)Therapeutics data commons: machine learning datasets and tasks for drug discovery and development\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 1\),External Links:[Link](https://openreview.net/forum?id=8nvgnORnoWr)Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p2.1),[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p1.4)\.
- J\. Li, B\. Chen, Z\. Zou, rumin zhang, sheng ding, and jinjiang guo \(2025a\)DMPKBench: a multi\-modal benchmark for evaluating LLMs and agents in drug discovery DMPK tasks\.InNeurIPS 2025 AI for Science Workshop,External Links:[Link](https://openreview.net/forum?id=1NSnXVTxNR)Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- X\. L\. Li, F\. Kaiyom, E\. Z\. Liu, Y\. Mai, P\. Liang, and T\. Hashimoto \(2025b\)AutoBencher: towards declarative benchmark construction\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ymt4crbbXh)Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1),[§4](https://arxiv.org/html/2605.19357#S4.p2.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Re, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. WANG, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. S\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. A\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=iO4LZibEqW)Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p2.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[Appendix A](https://arxiv.org/html/2605.19357#A1.p2.5),[§2\.5](https://arxiv.org/html/2605.19357#S2.SS5.SSS0.Px2.p2.5)\.
- Z\. Liu, S\. Wang, Y\. Gu, R\. Zhang, M\. Zhang, and S\. Wang \(2021\)Graphine: a dataset for graph\-aware terminology definition generation\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 3453–3463\.Cited by:[§2\.2](https://arxiv.org/html/2605.19357#S2.SS2.p2.1)\.
- Z\. Liu, W\. Zhang, Y\. Xia, L\. Wu, S\. Xie, T\. Qin, M\. Zhang, and T\. Liu \(2023\)MolXPT: wrapping molecules with text for generative pre\-training\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1606–1616\.External Links:[Link](https://aclanthology.org/2023.acl-short.138/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-short.138)Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- J\. Lu, K\. Choi, M\. Eremeev, J\. Gobburu, S\. Goswami, Q\. Liu, G\. Mo, C\. J\. Musante, and M\. H\. Shahin \(2025\)Large language models and their applications in drug discovery and development: a primer\.Clinical and Translational Science18\(4\),pp\. e70205\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- J\. Luo, W\. Zhang, Y\. Yuan, Y\. Zhao, J\. Yang, Y\. Gu, B\. Wu, B\. Chen, Z\. Qiao, Q\. Long,et al\.\(2025\)Large language model agent: a survey on methodology, applications and challenges\.arXiv preprint arXiv:2503\.21460\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- S\. Miret and N\. A\. Krishnan \(2025\)Enabling large language models for real\-world materials discovery\.Nature Machine Intelligence7\(7\),pp\. 991–998\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1),[§1](https://arxiv.org/html/2605.19357#S1.p2.1)\.
- A\. Mirza, N\. Alampara, S\. Kunchapu, M\. Ríos\-García, B\. Emoekabu, A\. Krishnan, T\. Gupta, M\. Schilling\-Wilhelmi, M\. Okereke, A\. Aneesh,et al\.\(2024\)Are large language models superhuman chemists?\.arXiv preprint arXiv:2404\.01475\.Cited by:[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- J\. C\. L\. Ong, L\. Jin, K\. Elangovan, G\. Y\. San Lim, D\. Y\. Z\. Lim, G\. G\. R\. Sng, Y\. H\. Ke, J\. Y\. M\. Tung, R\. J\. Zhong, C\. M\. Y\. Koh,et al\.\(2025\)Large language model as clinical decision support system augments medication safety in 16 clinical specialties\.Cell Reports Medicine6\(10\)\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- W\. S\. Parker \(2020\)Evaluating data journeys: climategate, synthetic data and the benchmarking of methods for climate data processing\.InData journeys in the sciences,pp\. 191–206\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- J\. R\. Penadés, J\. Gottweis, L\. He, J\. B\. Patkowski, A\. Shurick, W\. Weng, T\. Tu, A\. Palepu, A\. Myaskovsky, A\. Pawlosky,et al\.\(2025\)AI mirrors experimental science to uncover a novel mechanism of gene transfer crucial to bacterial evolution\.bioRxiv,pp\. 2025–02\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- J\. Pombal, N\. M\. Guerreiro, R\. Rei, and A\. F\. Martins \(2025\)Zero\-shot benchmarking: a framework for flexible and scalable automatic evaluation of language models\.arXiv preprint arXiv:2504\.01001\.Cited by:[§4](https://arxiv.org/html/2605.19357#S4.p2.1)\.
- V\. Pyatkin, S\. Malik, V\. Graf, H\. Ivison, S\. Huang, P\. Dasigi, N\. Lambert, and H\. Hajishirzi \(2025\)Generalizing verifiable instruction following\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=yfYgwjj5F8)Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1),[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px2.p1.1)\.
- M\. C\. Ramos, C\. J\. Collison, and A\. D\. White \(2025\)A review of large language models and autonomous agents in chemistry\.Chemical science16\(6\),pp\. 2514–2572\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)Gpqa: a graduate\-level google\-proof q&a benchmark\.InFirst conference on language modeling,Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1),[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- G\. Saranathan, C\. Xu, M\. P\. Alam, T\. Kumar, M\. Foltin, S\. Y\. Wong, and S\. Bhattacharya \(2025\)SubLIME: subset selection via rank correlation prediction for data\-efficient llm evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 30572–30593\.Cited by:[Appendix A](https://arxiv.org/html/2605.19357#A1.p2.5),[Appendix C](https://arxiv.org/html/2605.19357#A3.p3.3),[§2\.5](https://arxiv.org/html/2605.19357#S2.SS5.SSS0.Px2.p2.5)\.
- N\. Shoghi, A\. Kolluru, J\. R\. Kitchin, Z\. W\. Ulissi, C\. L\. Zitnick, and B\. M\. Wood \(2023\)From molecules to materials: pre\-training large generalizable models for atomic property prediction\.arXiv preprint arXiv:2310\.16802\.Cited by:[§3\.2](https://arxiv.org/html/2605.19357#S3.SS2.SSS0.Px1.p1.2)\.
- K\. Singhal, S\. Azizi, T\. Tu, S\. S\. Mahdavi, J\. Wei, H\. W\. Chung, N\. Scales, A\. Tanwani, H\. Cole\-Lewis, S\. Pfohl,et al\.\(2023\)Large language models encode clinical knowledge\.Nature620\(7972\),pp\. 172–180\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis,et al\.\(2025\)Toward expert\-level medical question answering with large language models\.Nature medicine31\(3\),pp\. 943–950\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1),[§1](https://arxiv.org/html/2605.19357#S1.p2.1)\.
- K\. Song, A\. Trotter, and J\. Y\. Chen \(2025\)LLM agent swarm for hypothesis\-driven drug discovery\.arXiv preprint arXiv:2504\.17967\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- Y\. Song, S\. Miret, and B\. Liu \(2023\)MatSci\-nlp: evaluating scientific language models on materials science language tasks using text\-to\-schema modeling\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3621–3639\.Cited by:[§3\.2](https://arxiv.org/html/2605.19357#S3.SS2.SSS0.Px1.p1.2),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- L\. Sun, Y\. Han, Z\. Zhao, D\. Ma, Z\. Shen, B\. Chen, L\. Chen, and K\. Yu \(2024\)Scieval: a multi\-level large language model evaluation benchmark for scientific research\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19053–19061\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- D\. Wadden, K\. Shi, J\. Morrison, A\. Li, A\. Naik, S\. Singh, N\. Barzilay, K\. Lo, T\. Hope, L\. Soldaini, S\. Z\. Shen, D\. Downey, H\. Hajishirzi, and A\. Cohan \(2025\)SciRIFF: a resource to enhance language model instruction\-following over scientific literature\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 6072–6109\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.310/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.310)Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1)\.
- B\. Wang, Y\. Ouyang, Y\. Li, M\. Pan, Y\. Tang, Y\. Wang, H\. Cui, J\. Zhang, X\. Wang, W\. Ma,et al\.\(2025a\)Moma: a modular deep learning framework for material property prediction\.arXiv preprint arXiv:2502\.15483\.Cited by:[§3\.2](https://arxiv.org/html/2605.19357#S3.SS2.SSS0.Px1.p1.2)\.
- S\. Wang, J\. Tan, Z\. Dou, and J\. Wen \(2025b\)Omnieval: an omnidirectional and automatic rag evaluation benchmark in financial domain\.InProceedings of the 2025 conference on empirical methods in natural language processing,pp\. 5737–5762\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- X\. Wang, Z\. Hu, P\. Lu, Y\. Zhu, J\. Zhang, S\. Subramaniam, A\. R\. Loomba, S\. Zhang, Y\. Sun, and W\. Wang \(2024a\)SciBench: evaluating college\-level scientific problem\-solving abilities of large language models\.InInternational Conference on Machine Learning,pp\. 50622–50649\.Cited by:[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024b\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1),[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.19357#S4.p1.1)\.
- J\. Wei, N\. Karina, H\. W\. Chung, Y\. J\. Jiao, S\. Papay, A\. Glaese, J\. Schulman, and W\. Fedus \(2024a\)Measuring short\-form factuality in large language models\.arXiv preprint arXiv:2411\.04368\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1),[§3\.1](https://arxiv.org/html/2605.19357#S3.SS1.SSS0.Px2.p1.1)\.
- Y\. Wei, Z\. Wang, J\. Liu, Y\. Ding, and L\. Zhang \(2024b\)Magicoder: empowering code generation with oss\-instruct\.InInternational Conference on Machine Learning,pp\. 52632–52657\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- R\. B\. Woodward and R\. Hoffmann \(1969\)The conservation of orbital symmetry\.Angewandte Chemie International Edition in English8\(11\),pp\. 781–853\.Cited by:[§3\.4](https://arxiv.org/html/2605.19357#S3.SS4.p1.1)\.
- Z\. Wu, B\. Ramsundar, E\. N\. Feinberg, J\. Gomes, C\. Geniesse, A\. S\. Pappu, K\. Leswing, and V\. Pande \(2018\)MoleculeNet: a benchmark for molecular machine learning\.Chemical science9\(2\),pp\. 513–530\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p2.1)\.
- Y\. Xia, P\. Jin, S\. Xie, L\. He, C\. Cao, R\. Luo, G\. Liu, Y\. Wang, Z\. Liu, Y\. Chen, Z\. Guo, Y\. Bai, P\. Deng, Y\. Min, Z\. Lu, H\. Hao, H\. Yang, J\. Li, C\. Liu, J\. Zhang, J\. Zhu, R\. Bi, K\. Wu, W\. Zhang, K\. Gao, Q\. Pei, Q\. Wang, X\. Liu, Y\. Li, H\. Zhu, Y\. Lu, M\. Ma, Z\. Wang, T\. Xie, K\. Maziarz, M\. Segler, Z\. Yang, Z\. Chen, Y\. Shi, S\. Zheng, L\. Wu, C\. Hu, P\. Dai, T\. Liu, H\. Liu, and T\. Qin \(2025\)Nature language model: deciphering the language of nature for scientific discovery\.External Links:2502\.07527,[Link](https://arxiv.org/abs/2502.07527)Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- Z\. Xu, Y\. Liu, Y\. Yin, M\. Zhou, and R\. Poovendran \(2025\)Kodcode: a diverse, challenging, and verifiable synthetic dataset for coding\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 6980–7008\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p3.1)\.
- Y\. Yuan, K\. Tang, J\. Shen, M\. Zhang, and C\. Wang \(2024\)Measuring social norms of large language models\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 650–699\.Cited by:[§1](https://arxiv.org/html/2605.19357#S1.p1.1)\.
- D\. Zhang, Z\. Hu, S\. Zhoubian, Z\. Du, K\. Yang, Z\. Wang, Y\. Yue, Y\. Dong, and J\. Tang \(2024\)Sciinstruct: a self\-reflective instruction annotated dataset for training scientific language models\.Advances in Neural Information Processing Systems37,pp\. 1443–1473\.Cited by:[Appendix C](https://arxiv.org/html/2605.19357#A3.p2.1)\.
- J\. Zhang, W\. Huang, Z\. Ma, O\. Michel, D\. He, T\. Gupta, W\. Ma, A\. Farhadi, A\. Kembhavi, and R\. Krishna \(2025\)Task me anything\.External Links:2406\.11775,[Link](https://arxiv.org/abs/2406.11775)Cited by:[§4](https://arxiv.org/html/2605.19357#S4.p2.1)\.
## Appendix ADetails of Proxy Subset Selection
For each data sample⟨q,a⟩∈𝒟r\\langle q,a\\rangle\\in\\mathcal\{D\}\_\{r\}, we compute two commonly used intrinsic metrics to characterize its difficulty and quality: \(i\)Hardness Score \(HH\):We utilize open\-weights LLMs to calculate the perplexity \(PPL\) of the answerada\_\{d\}conditioned on the questionqdq\_\{d\}:
H\(d\)=exp\(−1\|ad\|∑i=1\|ad\|logP\(ti∣qd,t<i\)\)H\(d\)=\\exp\\left\(\-\\frac\{1\}\{\|a\_\{d\}\|\}\\sum\_\{i=1\}^\{\|a\_\{d\}\|\}\\log P\(t\_\{i\}\\mid q\_\{d\},t\_\{<i\}\)\\right\)\(1\)wheretit\_\{i\}represents theii\-th token of the answerada\_\{d\}\. \(ii\)Quality Score \(QQ\):We assess the linguistic quality using the Flesch Readability Score\.
Our objective is to select𝒫q\\mathcal\{P\}\_\{q\}such that the distributions of these scores closely approximate those of𝒟q\\mathcal\{D\}\_\{q\}\. We formulate this as minimizing the cumulative Wasserstein distance \(𝒲\\mathcal\{W\}\) between the distributions of the subset and the full set:
min𝒫q\(𝒲\(PH𝒟q,PH𝒫q\)\+𝒲\(PQ𝒟q,PQ𝒫q\)\)\\min\_\{\\mathcal\{P\}\_\{q\}\}\\left\(\\mathcal\{W\}\(P^\{\\mathcal\{D\}\_\{q\}\}\_\{H\},P^\{\\mathcal\{P\}\_\{q\}\}\_\{H\}\)\+\\mathcal\{W\}\(P^\{\\mathcal\{D\}\_\{q\}\}\_\{Q\},P^\{\\mathcal\{P\}\_\{q\}\}\_\{Q\}\)\\right\)\(2\)To solve the sampling of𝒫q\\mathcal\{P\}\_\{q\}efficiently, we adopt a cluster\-based strategy following SubLIMESaranathanet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib20)\)\. Intuitively, we encode all pairsd=⟨q,a⟩d=\\langle q,a\\rangleinto an embedding space using RoBERTaLiuet al\.\([2019](https://arxiv.org/html/2605.19357#bib.bib21)\)\. We then perform clustering in the embedding space and sample representatives from each cluster\.
## Appendix BDetails of Ground\-Truth Benchmarks
Table[3](https://arxiv.org/html/2605.19357#A2.T3)presents the detailed statistics of the expert\-curated ground\-truth benchmarks used in our evaluation\.
Table 3:Statistics of the ground\-truth benchmarks used in our experiments\.TaskSize \(\# Samples\)Chemistry DomainAnalytical Chemistry152Inorganic Chemistry92Material Science84Organic Chemistry429Physical Chemistry165Technical Chemistry40Healthcare DomainVirology46Human Aging86Medical Genetics54Anatomy79Nutrition179
## Appendix CImplementation Details
We finetuned LLaMa\-3\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib39)\)to train a tagging model\. We sampled50,00050,000synthetic scientific queries and30,00030,000real\-world scientific queries as the training set\. The LLM used to construct the dataset is GPT\-4oAchiamet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib40)\)\. The model was finetuned for22epochs with a learning rate of2e−52e\-5\. We conducted experiments on 8 NVIDIA A100 GPUs in a standard Linux environment, using the vLLMKwonet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib29)\)framework for efficient large\-scale model inference\.
We constructed a comprehensive scientific data corpus by aggregating diverse high\-quality instruction\-tuning datasets and benchmarks, includingSciRIFFWaddenet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib41)\),SciInstructZhanget al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib42)\),Mol\-InstructFanget al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib44)\),MultiMedQASinghalet al\.\([2023](https://arxiv.org/html/2605.19357#bib.bib43)\),SciEvalSunet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib30)\),MMLU\-ProWanget al\.\([2024b](https://arxiv.org/html/2605.19357#bib.bib26)\),GPQAReinet al\.\([2024](https://arxiv.org/html/2605.19357#bib.bib28)\),IfBenchPyatkinet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib37)\), andSimpleQAWeiet al\.\([2024a](https://arxiv.org/html/2605.19357#bib.bib38)\)\. This collection yielded a total of2,000,3672,000,367data instances\. To avoid data leakage, we filtered out any instances that overlap with the expert\-curated ground\-truth benchmarks used in our experiments\.
To facilitate multi\-model voting as discussed in both Section[2\.4](https://arxiv.org/html/2605.19357#S2.SS4)and Section[2\.5](https://arxiv.org/html/2605.19357#S2.SS5), we utilized three advanced models: GPT\-5, Claude\-opus\-4\.5, and Gemini\-3\-pro\-preview\. We set the voting parameterK1=10K\_\{1\}=10\. For the proxy subset selection, we setK2=100K\_\{2\}=100, ensuring the scale remains consistent with the expert\-curated ground\-truth benchmarks \(Appendix[B](https://arxiv.org/html/2605.19357#A2)\)\. Furthermore, following the SubLIMESaranathanet al\.\([2025](https://arxiv.org/html/2605.19357#bib.bib20)\)framework, we employed a cluster\-based strategy to sample100100candidate sets and selected the optimal subset based on the criteria defined in Equation[2](https://arxiv.org/html/2605.19357#A1.E2)\.
We evaluated 10 LLMs, including Gemini\-2\.5\-flash, GPT\-4o, GPT\-5\-chat, Claude\-opus\-4\.5, Qwen3\-235b\-a22b\-instruct\-2507, Qwen3\-max, Grok\-4\-fast\-non\-reasoning, Kimi\-k2\-preview, DeepSeek\-V3\.2 and Mistral\-large on the all comparison benchmarks\.
Table 4:Comparison between a greedy retrieval strategy andSciCustomon healthcare benchmarks\. Scores are Spearman correlations with the corresponding expert\-curated ground\-truth rankings\.MethodVirologyHuman agingMedical geneticsAnatomyNutritionGreedy Search0\.210\.340\.160\.240\.31SciCustom0\.550\.490\.420\.620\.78
Table 5:Ranking consistency analysis of 10 LLMs on 6 chemistry\-specific benchmarks\. The table comparesSciCustomagainst baseline benchmarks in preserving model rankings, quantified by Spearman correlation coefficients\. Higher scores indicate better consistency\. Best results are inbold\.MethodAnalytical chemistryInorganic chemistryMaterial scienceOrganic chemistryPhysical chemistryTechnical chemistryGPT\-5\-0\.110\.05\-0\.040\.380\.24\-0\.07GPT\-5\-RAG0\.01\-0\.02\-0\.070\.410\.35\-0\.02Embedding\-0\.34\-0\.59\-0\.390\.11\-0\.73\-0\.41Embedding\-filtering0\.230\.040\.150\.370\.040\.11SciCustom0\.860\.670\.420\.890\.740\.86
Table 6:Ranking consistency analysis of 10 LLMs on 5 healthcare\-specific benchmarks\. The table comparesSciCustomagainst baseline benchmarks in preserving model rankings, quantified by Spearman correlation coefficients\. Higher scores indicate better consistency\. Best results are inbold\.MethodVirologyHuman agingMedical geneticsAnatomyNutritionGPT\-50\.250\.200\.090\.110\.52GPT\-5\-RAG0\.310\.260\.020\.090\.48Embedding0\.180\.21\-0\.21\-0\.320\.27Embedding\-filtering0\.240\.340\.130\.260\.51SciCustom0\.550\.490\.420\.620\.78
##### Discussion on Non\-Monotonicity and Selection Strategy
We acknowledge that the relevance of data samples within the sorted listL𝒟r′L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}does not strictly follow a monotonic decreasing trend, which theoretically challenges the standard binary search assumption\. A seemingly more rigorous alternative would be a "Greedy Top\-KKSelection", where we sequentially scanL𝒟r′L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}from the beginning and select the firstK2K\_\{2\}samples judged as relevant by the voting ensemble\. However, our experiments show that this greedy strategy \(K2=100K\_\{2\}=100\) results in a significantly lower correlation with ground\-truth benchmarks compared to the proposed binary search method \(Table[4](https://arxiv.org/html/2605.19357#A3.T4)\)\. We hypothesize that the top\-ranked instances inL𝒟r′L\_\{\\mathcal\{D\}^\{\\prime\}\_\{r\}\}, which share the highest semantic and keyword overlap with the requirement, often correspond to "canonical" textbook knowledge or highly frequent scientific facts\. These instances are likely exposed during the pre\-training of most LLMs \(potential data leakage\), leading to a performance ceiling effect where most models achieve near\-perfect accuracy\. Consequently, these "highly relevant" samples lack the discriminative power to differentiate model capabilities\. In contrast, our binary search strategy determines a cutoff based on a broader structural trend\. This approach inadvertently bypasses the saturation of the top\-most samples and captures a more challenging and diverse distribution of data, thereby offering higher discriminative validity and better alignment with expert rankings\.
## Appendix DMore Experiment Results
To further isolate the contribution of our ontology\-driven architecture, we report Spearman correlation results for two alternative baselines\.GPT\-5\-RAGaugments the syntheticGPT\-5generator with open\-book retrieval from the full corpus, whileEmbedding\-filteringapplies the same binary\-search cutoff and proxy subset selection used inSciCustomon top of theEmbeddingcandidate pool\. As shown inTable[5](https://arxiv.org/html/2605.19357#A3.T5)andTable[6](https://arxiv.org/html/2605.19357#A3.T6), adding retrieval toGPT\-5yields only limited gains over the fully synthetic baseline, whileEmbedding\-filteringimproves over rawEmbeddingbut still remains consistently belowSciCustomacross both chemistry and healthcare tasks\. These results show that neither retrieval augmentation alone nor post\-retrieval filtering alone is sufficient to match the performance of the full ontology\-grounded pipeline\. This confirms that the fine\-grained structural guidance provided by our ontology\-grounded knowledge units is indispensable for accurately capturing complex scientific requirements\.
## Appendix EDetails of Human Evaluation
### E\.1Human Annotation Guidelines
Objective:You will be presented with 50 multiple\-choice questions generated for a specific scientific requirement \(e\.g\.,Technical Chemistry\)\. For each question, please provide two binary labels \(0or1\) based on the criteria below\.
#### 1\. Relevance Label \(relevant\)
Determine whether the question requires specific knowledge of the target requirement\.
- •1 \(Relevant\):The question is strictly aligned with the requirement\.
- •0 \(Irrelevant\):The question is off\-topic, generic \(can be answered by a layperson\), or belongs to a distinctly different scientific field\.
#### 2\. Correctness Label \(correct\)
Evaluate the scientific accuracy of the MCQs\.
- •1 \(Correct\):The questions and options are scientifically accurate\.
- •0 \(Incorrect\):The question is wrong, or the selected option is factually wrong, scientifically flawed, or there is a significantly better/more accurate option available in the choices\.
### E\.2Human Annotators
The annotation was performed by three Master’s students specializing in AI for Science \(AI4Chemistry\)\. Participants were recruited from an university department and were compensated at an hourly rate exceeding the local minimum wage\. The results are discussed in Section[3\.2](https://arxiv.org/html/2605.19357#S3.SS2.SSS0.Px2)\.
For the case study \(Section[3\.4](https://arxiv.org/html/2605.19357#S3.SS4)\), we engaged a Ph\.D\. researcher specializing in AI and Chemistry to assess the generated benchmark\. The expert was tasked with categorizing the knowledge units and keywords in the MCQs into three distinct levels of granularity and relevance:Strong Relevance,Relevant, andWeak Relevance\. The expert was compensated at a professional hourly rate commensurate\.
Prior to the task, all annotators were informed of the data usage scope and provided written informed consent\. The study protocol was reviewed and determined exempt by an ethics review board\.
## Appendix FAutomatic Evaluation of Distractor Quality
To validate the quality of LLM\-generated distractors, we evaluate 500 generated MCQs with Claude\-Opus\-4\.6 as an independent judge\. Each MCQ is scored on a 1–10 scale along two dimensions:Heuristic Transparency Score \(HTS\), which measures whether distractors can be eliminated by superficial linguistic cues, andScientific Plausibility Score \(SPS\), which measures how scientifically relevant and challenging the distractors are\. The generated MCQs obtain an average HTS of 8\.7 and SPS of 7\.9, indicating low heuristic leakage and strong near\-miss quality\. We additionally construct a context\-free baseline where the judge only sees the options and a simplified question, without the supporting scientific context; accuracy drops to 22%, close to random guess\. These results support that our MCQ transformation pipeline produces distractors that are both scientifically plausible and difficult to solve through shallow heuristics alone\.
## Appendix GPrompt Template Inventory
Granularity classification of scientific termsSystem:You are a helpful AI assistant\.User:Determine whether the given term is a suitable scientific subfield and answer with one of the following categories:\(moderate\): The term is appropriately specific for a scientific subfield\. It refers to a category that is neither too general nor too specific for its scientific context\. Its scale is similar to the scale of chapter names in subject textbooks\.\(too coarse\): The term is overly broad or vague for a scientific subfield\. It encompasses a wide range of concepts that could be divided into smaller, more specific subfields\.\(too fine\): The term is overly specific and pertains to a very narrow aspect of a scientific subfield\. It may be too detailed to serve as a broader category within the discipline\.Answer at the beginning, explain later\. The examples are as follows:Example 1:Input: term: anatomical entityOutput: \(moderate\); Explanation: Anatomical entity refers to a category that is sufficiently specific for many biological subfields but not too narrow\.Example 2:Input: term: nuclear structureOutput: \(moderate\); Explanation: Nuclear structure is a well\-defined category in molecular biology, specific but not too narrow\.Example 3:Input: term: electronic file statusOutput: \(too fine\); Explanation: The term refers to a very specific technical concept that is too detailed to be considered a scientific subfield\.Example 4:Input: term: b\-lymphocyteOutput: \(too fine\); Explanation: While important, the term refers to a very specific type of cell, not a broad enough category to encompass a subfield\.Example 5:Input: term: continuantOutput: \(too coarse\); Explanation: The term is too general and could refer to a wide variety of objects or concepts, making it too broad for a specific subfield\.Example 6:Input: term: occurrentOutput: \(too coarse\); Explanation: Occurrent is overly vague and applies to many concepts, making it too broad for a scientific subfield\.Input: term:\{term\}\.
Query generationSystem:You are a helpful AI assistant\.User: Low Complexity:You are a \{persona\}\. Generate a user query containing the following keywords: \{keywords\}\. Do not introduce other scientific entities or topics\. Only return the query\.High Complexity:You are a \{persona\}\. Generate a user query containing the following keywords: \{keywords\}\. Do not introduce other scientific entities or topics\. Make the query long and complex\. Only return the query\.Scientific Personas:\- Astrophysicist\- Marine Biologist\- AI Researcher\- Molecular Geneticist\- Quantum Physicist\- Environmental Chemist\- Neuroscientist\- Ecologist\- Bioinformatician\- Pharmacologist\- Geologist\- Biomedical Engineer\- Mathematical Modeler\- Virologist\- Behavioral Psychologist\- Data Scientist\- Theoretical Chemist\- Climate Scientist\- Structural Biologist\- Robotics Engineer
Benchmark requirements and descriptionsChemistry benchmarks:•Analytical chemistry: Generate questions that test knowledge and reasoning in analytical chemistry\. The questions should assess understanding of how experimental analytical signals \(e\.g\., NMR, IR, UV–Vis, mass spectra, chromatographic behavior, titration curves\) relate to molecular structure, composition, concentration, or purity\. Focus on conceptual interpretation and chemical reasoning rather than numerical data processing or instrument\-specific operating procedures\.•Inorganic chemistry: Generate questions that test core knowledge and reasoning in inorganic chemistry\. The questions should focus on electronic structure, oxidation states, coordination geometry, ligand field effects, symmetry, and periodic trends in inorganic systems\. Emphasize conceptual understanding of structure–property relationships rather than memorization of isolated facts\.•Material science: Generate questions that evaluate understanding in materials science\. The questions should assess how atomic or microstructural features \(e\.g\., crystal structure, defects, phases, interfaces\) determine macroscopic properties such as mechanical strength, electrical conductivity, or thermal behavior\. Focus on structure–property reasoning rather than detailed synthesis protocols\.•Organic chemistry: The organic chemistry benchmark assesses a wide range of skills on reasoning about chemical structures and reaction pathways, such as Reaction Mechanism Identification, Product Prediction, NMR Signal Prediction, Number of Isomers, Polymer Chemistry, Nomenclature Conversion and Organic Reactivity\.•Physical chemistry: Generate questions that test conceptual understanding in physical chemistry\. The questions should assess reasoning about thermodynamics, kinetics, equilibrium, and molecular\-level physical principles\. Emphasize qualitative reasoning about trends and relationships rather than explicit numerical calculation\.•Technical chemistry: Generate questions that assess knowledge in technical and industrial chemistry\. The questions should focus on chemical processes at scale, such as reactor behavior, process optimization, safety considerations, and material or energy efficiency\. Emphasize reasoning about system\-level behavior rather than detailed engineering design\.Healthcare benchmarks:•Virology: Generate questions that test conceptual understanding in virology\. The questions should assess knowledge of viral structure, replication cycles, genome organization, and interactions with host cells and immune systems\. Avoid clinical treatment guidelines or laboratory diagnostic protocols\.•Human aging: Generate questions that probe understanding of biological mechanisms of human aging\. The questions should focus on molecular, cellular, and systemic processes associated with aging, such as genomic stability, cellular senescence, metabolic regulation, and tissue\-level decline\. Emphasize mechanistic reasoning rather than epidemiological statistics\.•Medical genetics: Generate questions that test reasoning in medical genetics\. The questions should assess understanding of inheritance patterns, genotype–phenotype relationships, penetrance, and genetic variation\. Focus on conceptual genetic reasoning rather than clinical decision\-making\.•Anatomy: Generate questions that evaluate knowledge of human anatomy\. The questions should focus on the identification, spatial relationships, and functional roles of anatomical structures\. Avoid surgical procedures or pathological conditions\.•Nutrition: Generate questions that assess understanding of nutritional science\. The questions should focus on the biological roles of macro\- and micronutrients, their involvement in metabolism, and the physiological consequences of deficiency or imbalance\. Emphasize mechanistic understanding over dietary recommendations\.
Voting\-based relevant tag selectionSystem:You are an expert in\{domain\}\. Your task is to map a benchmark description to the most relevant technical tagsUser:Task:Given a\{domain\}benchmark description, identify and rank the most relevant tags from a candidate list\.Benchmark Description:\{description\}Candidate Tags \(sorted by frequency, lowest frequency first; lower frequency usually indicates higher specificity\):\{Tag list\}\.Ranking Principles:Rank tags from highest to lowest relevance to the benchmark, following these rules:Relevance First:A tag is relevant if it directly reflects the core concepts, tasks, data modalities, or evaluation focus of the benchmark\. Irrelevant or weakly related tags should not be selected\.Specificity as a Tie\-breaker:If multiple tags are similarly relevant, rank the more specific and narrowly scoped tag higher\. Prefer concrete technical terms \(e\.g\., “Histone Acetylation Prediction”\) over broader categories \(e\.g\., “Epigenetics”\)\.Avoid Overly Generic Tags:High\-level or generic tags \(e\.g\., “biological process”, “chemical entity”\) should only be selected if no more specific alternative applies\.Frequency Awareness:When relevance and specificity are comparable, prefer lower\-frequency tags, as they tend to be more precise\.Output Requirements:Return a single list of tags, sorted from most to least relevant\. For efficiency, return only the top 100 tags \(or fewer if fewer are relevant\)\. Do not include explanations, scores, or extra text—output the ranked list only\.
Benchmark generationSystem:You are an expert in\{domain\} and tasked with constructing a high\-quality benchmark to assess the domain\-specific knowledge abilities of large language models\. Please return the benchmark in a JSON format\.User:Your task is to generate exactly\{K\}single\-choice questions in the domain of\{domain\}\.Detailed description of this domain:\{description\}The questions should:1\. Focus on core concepts, expert\-level knowledge, and non\-trivial reasoning in this domain\.2\. Avoid trivial definitions, purely factual memorization, or overly ambiguous questions\.3\. Include a mix of:\- Conceptual understanding\- Mechanism or principle\-based reasoning\- Application or scenario\-based reasoning4\. Be answerable without external tools, but not solvable by surface\-level pattern matching\.Question format:1\. Each question must have 4–5 options\.2\. Options should be concise and mutually exclusive\.3\. Each question have only one correct answers\.Output format \(STRICT\):Return only a JSON array of length\{K\}\.Each element must have the following structure:\{\{"query": "<question text with options labeled A, B, C, D \(and E if applicable\)\>","answer": "<correct option label\>"\}\}
MCQ transformationSystem:You are an expert in\{domain\}and tasked with curating a rigorous benchmark to evaluate the capabilities of Large Language Models\. Please return the processed entry in a JSON format\.User:Your task is to convert the following raw problem content into a standardized single\-choice question suitable for LLM evaluation\.Raw problem:\{input\_content\}Conversion Guidelines:1\. Format Adaptation:\- If the input is already a multiple\-choice question: Preserve the original stem and options exactly\. Ensure the formatting aligns with the output requirements\.\- If the input is not a multiple\-choice question: Convert it into a single\-choice question by generating 3–4 incorrect options \(distractors\)\.2\. Distractor Engineering:\- Avoid trivial errors, logical fallacies that are easily filtered, or clearly unrelated concepts\.3\. Fidelity & Difficulty:\- Strict adherence to the factual truth and reasoning logic of the original content is required\.\- Do not simplify the problem complexity\. The resulting MCQ must maintain the same discriminative power as the original input\.4\. Exclusivity: Ensure there is exactly one indisputably correct option\.Question format:1\. The final output must contain 4–5 options \(A, B, C, D, \[E\]\)\.2\. Options should be concise and mutually exclusive\.Output format \(STRICT\):Return only a single JSON object\.The object must have the following structure:\{\{"query": "<question stem followed by options labeled A, B, C, D \(and E if applicable\), separated by newlines\>","answer": "<correct option label, e\.g\., ’A’\>"\}\}
## Appendix HLLMs Usage
We adhere to the ACL Code of Ethics\. We use large language models solely for polishing writing\. All scientific contributions remain entirely our own\.Similar Articles
Reward Modeling for Scientific Writing Evaluation
This paper proposes SciRM, cost-efficient open-source reward models tailored for evaluating scientific writing through a two-stage training framework that optimizes evaluation preferences and reasoning capabilities. The models generalize across diverse scientific writing tasks without requiring task-specific retraining, addressing limitations of existing LLM-based judges on domain-specific evaluation criteria.
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
This paper introduces Elmes+, an automated framework for constructing fine-grained evaluation rubrics for LLMs in long-tail educational scenarios, and presents the Edu-330 benchmark covering 330 scenarios across 11 subjects. The framework uses a multi-agent engine and self-evolving module to co-optimize evaluation criteria and test data, revealing multidimensional educational capability differences among top LLMs.
OpenCompass: A Universal Evaluation Platform for Large Language Models
OpenCompass is a one-stop, scalable, high-concurrency evaluation platform for large language models, supporting diverse benchmarks and modular design to unify and standardize LLM assessment.
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
UMBC researchers show LLMs judge scientific claim feasibility better when given outcome data than experiment descriptions, and that incomplete experimental context can hurt accuracy.
SciPaths: Forecasting Pathways to Scientific Discovery
Introduces SciPaths, a benchmark for forecasting the enabling contributions required to realize a target scientific discovery, and evaluates frontier and open-weight language models, finding significant room for improvement in reasoning backward from contributions to enabling building blocks.