What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
Summary
This paper presents a corpus-centric diagnostic framework for analyzing biomedical NER and EL benchmarks, revealing substantial differences across nine corpora and arguing that standard statistics are insufficient for characterizing evaluation demands.
View Cached Full Text
Cached at: 05/21/26, 06:33 AM
# What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework Source: [https://arxiv.org/html/2605.20537](https://arxiv.org/html/2605.20537) Robert Leaman Robert\.Leaman@nih\.gov &Rezarta Islamaj11footnotemark:1 National Library of Medicine, Bethesda, MD Rezarta\.Islamaj@nih\.gov &Zhiyong Lu Zhiyong\.Lu@nih\.gov ###### Abstract Biomedical named entity recognition \(NER\) and entity linking \(EL\) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized\. We present a corpus\-centric framework for diagnosing benchmark\-relevant properties directly from corpus annotations, concept links, train\-test splits, document metadata, and terminology mappings\. The framework organizes standardized statistics into five families: \(1\) scale, density and label distribution, \(2\) lexical and conceptual structure, \(3\) train\-test overlap, \(4\) metadata composition, and \(5\) terminology coverage where applicable\. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task\. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train–test reuse they permit, and the regions of biomedical literature and concept space they represent\. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate\. We argue that corpus\-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions\. We release the framework as open\-source code111https://github\.com/NLM\-DIR/CorpusBenchmarkingwith an interactive dashboard to support reproducing our analyses and characterizing additional corpora\. What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus\-Centric Diagnostic Framework Robert Leaman††thanks:These authors contributed equallyRobert\.Leaman@nih\.govRezarta Islamaj11footnotemark:1National Library of Medicine, Bethesda, MDRezarta\.Islamaj@nih\.govZhiyong LuZhiyong\.Lu@nih\.gov ## 1Introduction Extracting structured information from biomedical literature requires systems to identify and link entities \(e\.g\., genes, diseases, chemicals\) to standardized identifiers\. These grounding tasks—named entity recognition \(NER\) and entity linking \(EL\)—remain critical in the large language models \(LLM\) era to ensure outputs are auditable, comparable, and reusable\. Manually annotated corpora serve both as training data and as evaluation benchmarks\(Kimet al\.,[2003](https://arxiv.org/html/2605.20537#bib.bib19); Collieret al\.,[2004](https://arxiv.org/html/2605.20537#bib.bib5); Morganet al\.,[2008](https://arxiv.org/html/2605.20537#bib.bib27); Luet al\.,[2011](https://arxiv.org/html/2605.20537#bib.bib24); Weiet al\.,[2013](https://arxiv.org/html/2605.20537#bib.bib39); Doğanet al\.,[2014](https://arxiv.org/html/2605.20537#bib.bib8); Krallingeret al\.,[2015](https://arxiv.org/html/2605.20537#bib.bib20); Liet al\.,[2016](https://arxiv.org/html/2605.20537#bib.bib21); Islamajet al\.,[2021b](https://arxiv.org/html/2605.20537#bib.bib16),[a](https://arxiv.org/html/2605.20537#bib.bib17); Badaet al\.,[2012](https://arxiv.org/html/2605.20537#bib.bib4); Herrero\-Zazoet al\.,[2013](https://arxiv.org/html/2605.20537#bib.bib12); Weiet al\.,[2016](https://arxiv.org/html/2605.20537#bib.bib40); Miranda\-Escaladaet al\.,[2023](https://arxiv.org/html/2605.20537#bib.bib26)\)\. When used as benchmarks, they function as measurement instruments: the key question is not only whether annotations are correct, but what capabilities the benchmark tests and whether conclusions transfer to intended use cases\. This distinction matters because benchmark utility is task\- and domain\-dependent\. A corpus can be carefully annotated yet too narrow, homogeneous, or leaky across splits to support informative evaluation\. Biomedical natural language processing \(NLP\) values rare, specialized, and emerging concepts, making evaluation sensitive to which entities, subdomains, time periods, and document types are represented\. Without characterizing the*corpus domain*and the target*application domain*, it is difficult to distinguish generalization issues from corpus\-specific artifacts, leakage, or domain mismatch\. Benchmark corpora have primarily been compared by size, entity type, or reported system performance\. These descriptors do not reflect overlap risks, lexical difficulty, domain bias, or concept coverage\. To address this gap, we introduce a corpus\-centric framework that computes standardized statistics directly from annotations, concept links, corpus splits, metadata, and terminologies \(Figure[1](https://arxiv.org/html/2605.20537#S1.F1)\)\. These statistics provide a multidimensional analysis for PubMed\- and PMC\-based NER and EL corpora: density indicates how much evaluation signal is available; lexical and conceptual variation indicate what kinds of generalization are required; overlap reveals leakage risk; metadata composition characterizes the literature represented; and terminology coverage indicates which parts of the concept space are represented\. Applied to nine corpora spanning diseases, chemicals, and cell types, the framework shows that resources similar by task label or size can differ substantially across these signals\. Our contributions are: \(1\) a corpus\-as\-instrument framing for biomedical NER and EL benchmarks; \(2\) a practical framework of corpus\-centric diagnostics; \(3\) an analysis showing how structural differences inform evaluation sensitivity, leakage risk, coverage, and transferability; and \(4\) open\-source code and an interactive dashboard for reproducing our results and for analyzing new corpora\. Figure 1:Corpus diagnostic framework\. Entity\-annotated corpora are converted into a common representation, enabling computation of statistics over annotations, identifiers, and metadata\. These statistics characterize scale and density, lexical and conceptual variation, train\-test overlap, metadata composition, and terminology coverage, enabling principled comparison of biomedical NER and EL benchmarks\. ## 2Related Work Biomedical entity annotation corpora have proliferated over the years, yet they are routinely treated as benchmarks without any systematic analysis of what they actually measure\. Early efforts, such as GENIA\(Kimet al\.,[2003](https://arxiv.org/html/2605.20537#bib.bib19)\)established large\-scale manual mention annotation with fine\-grained semantic categories, later simplified for shared tasks such as JNLPBA\(Collieret al\.,[2004](https://arxiv.org/html/2605.20537#bib.bib5)\)\. Subsequent corpora introduced entity normalization, typically targeting a single entity type: diseases \(NCBI Disease\), chemicals \(CHEMDNER, BC5CDR\), genomic variants \(tmVar\) and genes \(BioCreative challenges\)\(Doğanet al\.,[2014](https://arxiv.org/html/2605.20537#bib.bib8); Krallingeret al\.,[2015](https://arxiv.org/html/2605.20537#bib.bib20); Liet al\.,[2016](https://arxiv.org/html/2605.20537#bib.bib21); Weiet al\.,[2013](https://arxiv.org/html/2605.20537#bib.bib39); Morganet al\.,[2008](https://arxiv.org/html/2605.20537#bib.bib27)\)\. Resources such as CRAFT\(Badaet al\.,[2012](https://arxiv.org/html/2605.20537#bib.bib4)\)broadened the scope to multi\-entity, full\-text annotation\. More recent corpora—including NLM\-Chem, BioRED, and CellLink–have further expanded document and entity coverage, and incorporate richer annotation structures such as relations\(Islamajet al\.,[2021b](https://arxiv.org/html/2605.20537#bib.bib16),[2024](https://arxiv.org/html/2605.20537#bib.bib18); Rotenberget al\.,[2026](https://arxiv.org/html/2605.20537#bib.bib31)\)\. Despite being widely used together, these corpora vary considerably in document type, annotation density, normalization support, temporal range, and domain focus—differences that are rarely examined in terms of what each benchmark actually evaluates\. NLP research has shown that dataset properties can distort benchmark interpretation\. Work on saturation motivates aggregated benchmarks such as GLUE and SuperGLUE\(Wanget al\.,[2018](https://arxiv.org/html/2605.20537#bib.bib38),[2019](https://arxiv.org/html/2605.20537#bib.bib37)\); work on artifacts, leakage, and memorization shows that apparent improvements can reflect shortcuts or overlap rather than the intended capability\(Gururanganet al\.,[2018](https://arxiv.org/html/2605.20537#bib.bib11); Lianget al\.,[2023](https://arxiv.org/html/2605.20537#bib.bib22); Tutubalinaet al\.,[2020](https://arxiv.org/html/2605.20537#bib.bib34)\); and HELM emphasizes multi\-metric evaluation\(Lianget al\.,[2023](https://arxiv.org/html/2605.20537#bib.bib22)\)\. Biomedical suites such as BLUE, BLURB, and BigBIO standardize cross\-task evaluation\(Penget al\.,[2019](https://arxiv.org/html/2605.20537#bib.bib29); Guet al\.,[2021](https://arxiv.org/html/2605.20537#bib.bib10); Frieset al\.,[2022](https://arxiv.org/html/2605.20537#bib.bib9)\), but generally treat corpora as fixed inputs rather than explaining how corpus properties shape validity or transferability\. Annotation quality measures, particularly inter\-annotator agreement \(IAA\), assess consistency but not benchmark scope\. Prior work distinguishes agreement across span boundaries, labels, and concept links\(Artstein and Poesio,[2008](https://arxiv.org/html/2605.20537#bib.bib3)\), recommends F1 for span\-based tasks lacking a well\-defined negative class\(Hripcsak and Rothschild,[2005](https://arxiv.org/html/2605.20537#bib.bib14)\), and notes that disagreement may reflect ambiguity, error, or guideline limitations\(Aroyo and Welty,[2015](https://arxiv.org/html/2605.20537#bib.bib2); Umaet al\.,[2021](https://arxiv.org/html/2605.20537#bib.bib35)\)\. High agreement is necessary but not sufficient: simplifying annotation can increase agreement while removing realistic ambiguity\(Hovy and Lavid,[2010](https://arxiv.org/html/2605.20537#bib.bib13)\)\. Corpus papers often report counts and distributions, but these statistics are rarely organized around evaluation claims\. Our framework connects such descriptions to overlap, memorization, domain shift, annotation scope, and terminology coverage\. ## 3Methods ### 3\.1Framework Overview and Representation Our framework characterizes NER and EL corpora through four stages: conversion to a shared representation, filtering, metric computation, and visualization\. Corpora are standardized into documents containing text, optional metadata, and annotations \(spans, surface forms, labels, and linked concept identifiers\)\. This design supports both NER\-only and NER\+EL PubMed\- or PMC\-based corpora\. NER\-only corpora are evaluated on text\-, span\-, and mention\-level statistics, while EL\-supported datasets additionally yield concept\-level diagnostics\. Metrics are computed over configurable corpus bundles, comparison suites, entity scopes, and train/dev/test partitions to enable interpretable comparisons across heterogeneous datasets\. ### 3\.2Corpora Analyzed We apply the framework to nine corpora spanning diverse entity types \(e\.g\., diseases, chemicals, cell types\) and document scopes \(abstracts, captions, full\-text\): AnatEM\(Pyysalo and Ananiadou,[2014](https://arxiv.org/html/2605.20537#bib.bib30)\), BC5CDR\(Liet al\.,[2016](https://arxiv.org/html/2605.20537#bib.bib21)\), BioID\(Arighiet al\.,[2017](https://arxiv.org/html/2605.20537#bib.bib1)\), CHEMDNER\(Krallingeret al\.,[2015](https://arxiv.org/html/2605.20537#bib.bib20)\), CRAFT\(Badaet al\.,[2012](https://arxiv.org/html/2605.20537#bib.bib4)\), CellLink\(Rotenberget al\.,[2026](https://arxiv.org/html/2605.20537#bib.bib31)\), JNLPBA\(Collieret al\.,[2004](https://arxiv.org/html/2605.20537#bib.bib5)\), NCBI\-Disease\(Doğanet al\.,[2014](https://arxiv.org/html/2605.20537#bib.bib8)\), and NLM\-Chem\(Islamajet al\.,[2021b](https://arxiv.org/html/2605.20537#bib.bib16)\)\. Where public test data were unavailable or original annotation layers were altered, we utilized the closest documented subset and note these limitations alongside the relevant results\. ### 3\.3Diagnostic Statistics The framework computes corpus statistics across five families to diagnose benchmark properties prior to system evaluation: - •Scale, Density, and Label Distribution: We compute total documents, tokens, annotations, and unique mentions/identifiers per document\. - •Lexical and Conceptual Structure: For normalized corpora, we measure mention ambiguity \(the number of distinct label/link pairs mapped to a single surface form\) and identifier variation \(the number of distinct surface forms mapped to a label/link pair\)\. These distinguish a benchmark’s demand for contextual disambiguation versus its demand for recognizing diverse lexical realizations\. - •Train\-Test Overlap: To assess leakage and memorization risk, we compute Jaccard overlap between train and test splits at four abstraction levels: general token vocabulary, tokens inside entity mentions, exact mention strings, and concept identifiers\. - •Metadata Composition: We profile the represented literature slice via temporal statistics \(publication year ranges and distributions\) and journal diversity \(unique journals and top\-journal concentration\)\. We derive broad topic profiles from article Medical Subject Headings \(MeSH\)\(Lipscomb,[2000](https://arxiv.org/html/2605.20537#bib.bib23)\)topics where available, falling back to NLM Catalog MeSH journal topics if necessary\. - •Terminology\-Aware Coverage: For diseases and chemicals, we link identifiers to their respective concepts in MeSH, MONDO\(Vasilevskyet al\.,[2026](https://arxiv.org/html/2605.20537#bib.bib36)\), or ChEBI\(Maliket al\.,[2026](https://arxiv.org/html/2605.20537#bib.bib25)\); for cell types, we support Cell Ontology \(CL\)\(Tanet al\.,[2026](https://arxiv.org/html/2605.20537#bib.bib33)\)\. We quantify vocabulary coverage by analyzing the distribution of annotations across high\-level branches and compute hierarchy depth as a proxy for concept specificity\. ### 3\.4Implementation The framework is implemented as an open\-source, YAML\-configurable Python pipeline that outputs structured JSON statistics\. It includes acquisition specifications for downloading, extracting, converting, caching, and validating expected corpus files\. Input support includes BioC XML\(Comeauet al\.,[2013](https://arxiv.org/html/2605.20537#bib.bib6),[2019](https://arxiv.org/html/2605.20537#bib.bib7)\), PubTator\(Weiet al\.,[2024](https://arxiv.org/html/2605.20537#bib.bib41)\), BRAT/standoff\(Stenetorpet al\.,[2012](https://arxiv.org/html/2605.20537#bib.bib32)\), and Knowtator\(Ogren,[2006](https://arxiv.org/html/2605.20537#bib.bib28)\)annotations, with registry\-based extension points for additional loaders and metrics\. Ontologies in OBO format are directly supported as terminology sources\. The accompanying self\-contained HTML/JavaScript dashboard222https://nlm\-dir\.github\.io/CorpusBench\-marking/dashboard\.htmlcombines scale, overlap, metadata, terminology, and entity\-scope views to reproduce our analyses and evaluate new corpora\. ## 4Results We use the nine corpora to illustrate how corpus\-centric diagnostics clarify the evaluation role of a benchmark\. The goal is not to rank corpora, but to demonstrate that datasets with similar task labels often function as different measurement instruments: they expose systems to different volumes of evaluation signal, different forms of lexical and conceptual generalization, different leakage risks, and different regions of the biomedical literature and concept space\. ### 4\.1Scale, annotation density, lexical and conceptual variation Table[1](https://arxiv.org/html/2605.20537#S4.T1)reports statistics for nine heterogeneous biomedical corpora\. These measurements show that corpora differ fundamentally in the structural nature of the evaluation signal they provide\. Annotation density varies widely across corpora, reflecting differences in document scope, annotation unit, and entity scope\. Dense full\-text corpora such as NLM\-Chem and CRAFT provide many labeled decisions per article, whereas passage\-, abstract\-, and caption\-based corpora distribute fewer decisions across more sampled text units\. This raw density is useful because it affects the number of system decisions contributing to an evaluation estimate\. However, annotations from the same full\-text article are not necessarily independent: repeated mentions, recurring identifiers, and section\-specific language can increase decision volume without proportionally increasing lexical, conceptual, or contextual diversity\. Density should therefore be interpreted as signal concentration, not as a direct measure of task difficulty or benchmark quality\. In this sense, full\-text corpora evaluate behavior over long\-document contexts and repeated real\-world usage, while shorter\-unit corpora may provide broader sampling of independent contexts per annotation\. Concept\-level diagnostics further define what the instrument is calibrated to measure\. Variation ranges from 1\.48 surface forms per label/link pair in BioID to 3\.74 in CellLink, indicating differing demands on lexical generalization\. aAmbiguity: mean label/link pairs per unique mention string\. Values near 1\.0 indicate most mentions map to a single label/link pair\.bVariation: mean surface forms per label/link pair\. Reported only for corpora with concept\-level identifiers; AnatEM, CHEMDNER, and JNLPBA are excluded \(n/a\)\.cBioID and CRAFT link entity types to many ontology identifier resources\. Table 1:Basic statistics for each corpus\. E = number of annotated entity types; Men\./doc = unique mention strings per document; IDs/doc = unique concept identifiers per document\. CellLink values reflect the currently released annotated train and development partitions\. ### 4\.2Train\-Test Overlap Figure[2](https://arxiv.org/html/2605.20537#S4.F2)reports train\-test Jaccard overlap at four levels: token vocabulary, mention tokens, exact mention strings, and concept identifiers\. These levels distinguish increasingly task\-specific forms of reuse\. Across all corpora, token vocabulary overlap is higher than mention\-token overlap, which is, in turn, higher than exact mention string overlap\. The slope of this drop varies and is itself informative\. JNLPBA falls from 35\.9% token overlap to 6\.6% mention\-string overlap, indicating that its test split contains relatively little exact entity\-form reuse\. CellLink presents a different structural pattern: its mention\-token overlap \(27\.9%\) is much higher than its exact mention\-string overlap \(9\.1%\), while identifier overlap is higher still \(40\.6%\)\. This profile suggests compositional novelty, where new cell\-population names are constructed from familiar component tokens and mapped to familiar concepts\. Examples include “CD4\+ T cells” \(CL:0000624\), “resting CD4 memory T cells” \(CL:0000897\), and “resting NK cells” \(CL:0000623\), which combine recognizable modifiers with cell\-type heads\. Identifier overlap adds a concept\-level view unavailable from strings alone\. Among normalized corpora, identifier overlap ranges from 14\.2% in BioID to 40\.6% in CellLink—a range wider than the corresponding mention string overlaps, and with a different ordering\. CellLink’s high identifier overlap paired with low string overlap means it tests the ability to map novel text to relatively familiar concepts\. Conversely, NCBI\-Disease’s low identifier overlap requires systems to generalize to genuinely novel conceptual territory\. Figure 2:Train\-test overlap across nine biomedical corpora\. All values are Jaccard similarity \(%\) between training and test splits\. Token vocab: unique vocabulary tokens shared between splits\. Mention tokens: unique tokens within entity mention spans\. Mention strings: exact entity surface forms shared between splits\. Identifiers: concept identifiers shared between splits\. ### 4\.3Metadata Composition Temporal coverage differs sharply: CellLink is most recent, spanning 2019\-2025; CHEMDNER is nearly a single\-year snapshot \(97\.7% from 2013\); CRAFT spans 2001\-2007; NCBI\-Disease is predominantly pre\-2000 \(90\.8%\); and BC5CDR is the broadest, spanning 1968\-2016\. The journal distribution shows similar variation in domain\. BC5CDR is by far the most venue\-diverse: 703 journals, with the top five journals accounting for only 5\.0% of the corpus\. BioID represents the opposite extreme, with 18 journals total and 83\.2% of corpus documents from the top five journals\. The article\-topic mappings in Table[2](https://arxiv.org/html/2605.20537#S4.T2)demonstrate substantial differences in the slice of biomedical literature these corpora represent\. NCBI\-Disease is strongest in genetics/genomics, CHEMDNER and NLM\-Chem emphasize chemistry/materials, BioID is dominated by molecular biology/biochemistry, and CellLink is enriched for general biology and cell/developmental biology\. Topics are assigned from article MeSH terms where available; unresolved article terms fall back to NLM Catalog MeSH journal topics and configured journal\-name anchors\. “Other” aggregates all categories not shown\. The largest displayed non\-Other entry in each column is shown in bold\. Table 2:Broad research topic composition of each corpus\. Values represent approximate percentage of corpus articles from each topic area\. CHEM = CHEMDNER; NCBI = NCBI\-Disease; NLM = NLM\-Chem\. ### 4\.4Terminology\-based coverage Figure[3](https://arxiv.org/html/2605.20537#S4.F3)demonstrates that corpora sharing an entity label may still exercise different concept regions of the concept space\. The left panels normalize branch counts within each corpus, showing what share of the corpus’s own annotations falls into each high\-level terminology branch\. The right panels normalize the same branch counts by the size of the corresponding terminology branch, highlighting how the corpus represents different areas of the terminology\. Figure[3](https://arxiv.org/html/2605.20537#S4.F3)focuses on MeSH diseases and chemicals; analogous CL\-based statistics for cell\-type corpora are available in the dashboard\. For diseases, NCBI\-Disease is dominated by Congenital and Hereditary Diseases \(C16, 23%\) and Neoplasms \(C04, 14%\), consistent with its genetics origin\. BC5CDR instead peaks at Pathological Conditions \(C23, 20%\) and Cardiovascular Diseases \(C14, 17%\), consistent with pharmacovigilance as a clinically cross\-cutting focus rather than a narrow specialty focus\. Chemically\-Induced Disorders account for 5\.8% of BC5CDR disease annotations but only 0\.05% of NCBI\-Disease\. Thus, while both corpora evaluate “disease” recognition, they measure fundamentally distinct capabilities\. For chemicals, BC5CDR is concentrated in Organic Chemicals \(D02, 37%\) and Heterocyclic Compounds \(D03, 28%\), the categories containing most small\-molecule drugs\. NLM\-Chem is more broadly distributed, with higher representation of Inorganic Chemicals \(D01, 18%\) and Amino Acids/Proteins \(D12, 12%\)\. Figure 3:Terminology\-based coverage analysis\. Left panels: distribution normalized within each corpus’s own annotations\. Right panels: distribution normalized within the MeSH tree vocabulary\. Top panels: distribution across MeSH chemical branches \(D\-branches\) for BC5CDR and NLM\-Chem\. Bottom panels: distribution across MeSH disease branches \(C\-branches\) for BC5CDR and NCBI\-Disease\. ## 5Discussion The central implication of this study is that benchmark results for biomedical NER and EL cannot be interpreted independently of the corpora that produced them\. A benchmark corpus is not simply a labeled sample of text: it is a measurement instrument whose structure determines which system capabilities are exercised and limits how far evaluation conclusions can reasonably transfer\. Because benchmark utility is task\- and domain\-dependent, resources that share an entity type label can still function as fundamentally different instruments, evaluating different capabilities, imposing different generalization demands, and representing different regions of biomedical literature and concept space\. No single statistic captures this multidimensionality\. Instead, benchmark interpretation depends on the interaction of several diagnostic signals, and the nine corpora examined here illustrate how these signals combine in practice\. Annotation density and label distribution\.Annotation density determines how concentrated evaluation signal is within each sampled unit\. Dense full\-text corpora such as CRAFT and NLM\-Chem yield many linked decisions from relatively few articles, which can improve evaluation sensitivity and expose systems to full\-text sectional variation\. At the same time, raw density is partly confounded with document length and annotation scope\. A large number of annotations from one article may include repeated mentions of the same entities, and therefore may not provide the same independent evidence as the same number of annotations sampled from many distinct passages or documents\. Sparse corpora based on abstracts, passages, or figure captions require broader sampling to achieve stable estimates, and they provide more independent contexts per annotation\. Each design serves different evaluation purposes: full\-text corpora emphasize document\-level realism and repeated\-use behavior, whereas passage\- or abstract\-based corpora can emphasize breadth of contexts under fixed annotation budgets\. Lexical and conceptual variation\.Variation determines whether a corpus tests recognition of repeated forms, alternative expressions for known concepts, or generalization to genuinely new ones\. CellLink’s variation of 3\.74 surface forms per label\-link pair reflects the compositional naming conventions of single\-cell biology: novel cell\-population names are assembled from familiar modifiers and cell\-type heads, so systems must generalize across surface forms while mapping to relatively familiar concepts\. BC5CDR’s lower variation is consistent with the pharmacovigilance literature, where drug and disease names tend to be standardized\. Train\-test overlap\.Overlap analysis offers a particularly direct diagnostic for interpreting unexpectedly strong or weak results, consistent with prior work on leakage and memorization\(Tutubalinaet al\.,[2020](https://arxiv.org/html/2605.20537#bib.bib34); Lianget al\.,[2023](https://arxiv.org/html/2605.20537#bib.bib22)\)\. The steepness of the token\-level overlap drop varies and is in itself informative\. CellLink illustrates one diagnostic profile: high identifier overlap \(40\.6%\) combined with much lower mention\-string overlap \(9\.1%\) means the benchmark emphasizes compositional surface\-form generalization rather than novel concept recognition\. NCBI\-Disease presents the opposite profile, with low identifier overlap \(19\.6%\) that places genuine demands on concept\-level generalization\. JNLPBA’s sharp drop from 35\.9% token overlap to 6\.7% mention\-string overlap, combined with the absence of normalization, may partially recontextualize its relatively low performance ceiling \[e\.g\.,\(Huanget al\.,[2020](https://arxiv.org/html/2605.20537#bib.bib15)\)\]: systems cannot rely heavily on exact entity\-form reuse\. Metadata composition\.Temporal distributions, journal diversity, and topic profiles describe the slice of literature represented by the corpus\. These properties define the domain over which benchmark claims can reasonably apply and determine whether conclusions transfer to corpora drawn from different time periods, venues, or subfields\. CHEMDNER’s near\-total concentration in 2013 chemistry literature and BioID’s extreme journal concentration \(83\.2% from five journals\) are structural features that may limit the generalizability of results\. BC5CDR’s breadth across 703 journals and five decades makes performance claims more domain\-general\. These metadata properties are rarely reported yet have a bearing on whether performance on a benchmark supports claims about a target deployment setting\. Terminology coverage\.The contrast between NCBI\-Disease and BC5CDR illustrates why terminology\-aware analysis matters even when two corpora nominally share a task label\. Both annotate disease entities, yet NCBI\-Disease is dominated by Congenital and Hereditary Diseases \(C16\) and Neoplasms \(C04\), reflecting its genetics origin, while BC5CDR peaks at Pathological Conditions \(C23\) and Cardiovascular Diseases \(C14\), reflecting its pharmacovigilance focus\. Chemically\-Induced Disorders \(C25\) account for 5\.8% of BC5CDR disease annotations but only 0\.05% of NCBI\-Disease\. High performance on one therefore supports only limited claims about transfer to the other, despite the shared entity label\. The same logic applies to chemical corpora: BC5CDR’s concentration in Organic Chemicals and Heterocyclic Compounds reflects a small\-molecule drug emphasis, while NLM\-Chem’s broader distribution across Inorganic Chemicals, Amino Acids, and Lipids reflects the wider biochemical scope of full\-text molecular biology literature\. Differences in terminology coverage may reflect different practical capabilities even when their reported task names are identical\. From diagnostics to corpus decisions\.Corpus diagnostics are informative not only retrospectively but before a benchmark is finalized\. Before releasing a split, developers can compute mention\-string and identifier overlap to detect leakage\. While extending a corpus, they can sample documents to fill temporal, journal, topic, or terminology\-branch gaps\. When selecting a benchmark, researchers can choose corpora whose density, overlap profile, and terminology coverage match the intended deployment claim\. When combining corpora, they can verify whether a new source adds concept regions not already represented or mainly duplicates familiar strings and identifiers\. The goal, then, is to tie benchmark interpretation to corpus characteristics and intended use, rather than assign a single quality score\. Limitations\.The framework characterizes corpus structure but does not directly predict downstream system rankings or benchmark saturation\. Measured ambiguity conflates true polysemy, underspecified guidelines, and annotation errors; the present framework does not distinguish among these sources\. Surface\-form variation measures observed diversity rather than the full range of synonymy present in the literature\. Terminology analyses depend on the version of the reference ontology used\. Finally, topic mappings provide practical, lightweight domain diagnostics rather than definitive subject classifications\. ## 6Conclusion We presented a corpus\-centric framework for diagnosing the benchmark utility of biomedical corpora with entity annotations\. The framework organizes standardized statistics over annotations, linked identifiers, corpus splits, document metadata, and terminology mappings into five diagnostic families: \(1\) scale, density and label distribution, \(2\) lexical and conceptual structure, \(3\) train\-test overlap, \(4\) metadata composition, and \(5\) terminology coverage\. Applied to nine biomedical NER and EL corpora, these diagnostics show that corpora for the same apparent task can measure substantially different capabilities\. The main implication is that these statistics provide a practical basis for more complete reporting of biomedical NER and EL benchmarks\. Corpus reports should describe annotation density, normalization support, lexical and identifier variation, train\-test overlap at multiple abstraction levels, temporal and journal composition, and terminology coverage where applicable\. Without these measurements, benchmark results are difficult to interpret\. Standardized corpus diagnostics would make evaluation claims more interpretable, clarify what a benchmark measures, and support transparent, reproducible, transfer\-aware evaluation\. ## Acknowledgments This research was supported by the Intramural Research Program of the National Institutes of Health \(NIH\)\. The NIH author contributions are considered Works of the United States Government\. The findings and conclusions presented are those of the authors and do not necessarily reflect the views of the NIH or the U\.S\. Department of Health and Human Services\. ## References - C\. Arighi, L\. Hirschman, T\. Lemberger, S\. Bayer, R\. Liechti, D\. Comeau, and C\. Wu \(2017\)Bio\-ID track overview\.InBioCreative VI Challenge Evaluation Workshop,Bethesda, MD,pp\. 14–19\.Cited by:[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - L\. Aroyo and C\. Welty \(2015\)Truth is a lie: crowd truth and the seven myths of human annotation\.AI Magazine36\(1\),pp\. 15–24\.External Links:ISSN 0738\-4602,[Link](https://doi.org/10.1609/aimag.v36i1.2564),[Document](https://dx.doi.org/10.1609/aimag.v36i1.2564)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p3.1)\. - R\. Artstein and M\. Poesio \(2008\)Survey article: inter\-coder agreement for computational linguistics\.Computational Linguistics34\(4\),pp\. 555–596\.External Links:[Document](https://dx.doi.org/10.1162/coli.07-034-R2),[Link](https://aclanthology.org/J08-4004/)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p3.1)\. - M\. Bada, M\. Eckert, D\. Evans, K\. Garcia, K\. Shipley, D\. Sitnikov, W\. A\. B\. Jr, K\. B\. Cohen, K\. Verspoor, J\. A\. Blake, and L\. E\. Hunter \(2012\)Concept annotation in the CRAFT corpus\.\.BMC Bioinformatics13\(\),pp\. 161\.External Links:[Document](https://dx.doi.org/10.1186/1471-2105-13-161)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - N\. Collier, T\. Ohta, Y\. Tsuruoka, Y\. Tateisi, and J\. Kim \(2004\)Introduction to the bio\-entity recognition task at JNLPBA\.InProceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications \(NLPBA/BioNLP\),Geneva, Switzerland,pp\. 73–78\.External Links:[Link](https://aclanthology.org/W04-1213/)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - D\. C\. Comeau, C\. Wei, R\. Islamaj Doğan, and Z\. Lu \(2019\)PMC text mining subset in BioC: about three million full\-text articles and growing\.Bioinformatics35\(18\),pp\. 3533–3535\.External Links:ISSN 1367\-4803,[Document](https://dx.doi.org/10.1093/bioinformatics/btz070),[Link](https://doi.org/10.1093/bioinformatics/btz070),https://academic\.oup\.com/bioinformatics/article\-pdf/35/18/3533/48975610/bioinformatics\_35\_18\_3533\.pdfCited by:[§3\.4](https://arxiv.org/html/2605.20537#S3.SS4.p1.1)\. - D\. C\. Comeau, R\. I\. Doğan, P\. Ciccarese, K\. B\. Cohen, M\. Krallinger, F\. Leitner, Z\. Lu, Y\. Peng, F\. Rinaldi, M\. Torii, A\. Valencia, K\. Verspoor, T\. C\. Wiegers, C\. H\. Wu, and W\. J\. Wilbur \(2013\)BioC: a minimalist approach to interoperability for biomedical text processing\.\.Database \(Oxford\)2013,pp\. bat064\.External Links:[Document](https://dx.doi.org/10.1093/database/bat064)Cited by:[§3\.4](https://arxiv.org/html/2605.20537#S3.SS4.p1.1)\. - R\. I\. Doğan, R\. Leaman, and Z\. Lu \(2014\)NCBI disease corpus: A resource for disease name recognition and concept normalization\.\.Journal of Biomedical Informatics47\(\),pp\. 1–10\.External Links:[Document](https://dx.doi.org/10.1016/j.jbi.2013.12.006),[Link](https://www.sciencedirect.com/science/article/pii/S1532046413001974)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - J\. Fries, L\. Weber, N\. Seelam, G\. Altay, D\. Datta, S\. Garda, S\. Kang, R\. Su, W\. Kusa, S\. Cahyawijaya,et al\.\(2022\)BigBio: a framework for data\-centric biomedical natural language processing\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 25792–25806\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/a583d2197eafc4afdd41f5b8765555c5-Abstract-Datasets_and_Benchmarks.html)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1)\. - Y\. Gu, R\. Tinn, H\. Cheng, M\. Lucas, N\. Usuyama, X\. Liu, T\. Naumann, J\. Gao, and H\. Poon \(2021\)Domain\-Specific Language Model Pretraining for Biomedical Natural Language Processing\.\.ACM Transactions on Computing for Healthcare \(HEALTH\)3,pp\. 1–23\.External Links:[Document](https://dx.doi.org/10.1145/3458754)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1)\. - S\. Gururangan, S\. Swayamdipta, O\. Levy, R\. Schwartz, S\. R\. Bowman, and N\. A\. Smith \(2018\)Annotation Artifacts in Natural Language Inference Data, New Orleans, Louisiana, Association for Computational Linguistics\.\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),New Orleans, Louisiana,pp\. 107–112\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-2017)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1)\. - M\. Herrero\-Zazo, I\. Segura\-Bedmar, P\. Martínez, and T\. Declerck \(2013\)The DDI corpus: an annotated corpus with pharmacological substances and drug\-drug interactions\.Journal of Biomedical Informatics46\(5\),pp\. 914–920\.External Links:[Document](https://dx.doi.org/10.1016/j.jbi.2013.07.011)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1)\. - E\. Hovy and J\. Lavid \(2010\)Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics\.International Journal of Translation22\(1\),pp\. 13–36\.Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p3.1)\. - G\. Hripcsak and A\. S\. Rothschild \(2005\)Agreement, the f\-measure, and reliability in information retrieval\.Journal of the American Medical Informatics Association12\(3\),pp\. 296–298\.External Links:[Document](https://dx.doi.org/10.1197/jamia.M1733)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p3.1)\. - M\. Huang, P\. Lai, P\. Lin, Y\. You, R\. T\. Tsai, and W\. Hsu \(2020\)Biomedical named entity recognition and linking datasets: survey and our recent development\.Briefings in Bioinformatics21\(6\),pp\. 2219–2238\.External Links:[Document](https://dx.doi.org/10.1093/bib/bbaa054)Cited by:[§5](https://arxiv.org/html/2605.20537#S5.p4.1)\. - Islamaj, R\., C\. H\. Wei, D\. Cissel, N\. Miliaras, O\. Printseva, O\. Rodionov, K\. Sekiya, J\. Ward, and Z\. Lu \(2021a\)NLM\-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi\-species gene recognition\.\.Journal of Biomedical Informatics118\(\),pp\. 103779\.External Links:[Document](https://dx.doi.org/10.1016/j.jbi.2021.103779)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1)\. - Islamaj, R\., C\. H\. Wei, P\. T\. Lai, L\. Luo, C\. Coss, P\. G\. Kochar, N\. Miliaras, O\. Rodionov, K\. Sekiya, D\. Trinh, D\. Whitman, and Z\. Lu \(2024\)The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop\.\.Database \(Oxford\)2024,pp\. baae099\.External Links:[Document](https://dx.doi.org/10.1093/database/baae071)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p1.1)\. - R\. Islamaj, R\. Leaman, S\. Kim, D\. Kwon, C\. Wei, D\. C\. Comeau, Y\. Peng, D\. Cissel, C\. Coss, C\. Fisher, R\. Guzman, P\. G\. Kochar, S\. Koppel, D\. Trinh, K\. Sekiya, J\. Ward, D\. Whitman, S\. Schmidt, and Z\. Lu \(2021b\)NLM\-Chem, a new resource for chemical entity recognition in PubMed full text literature\.\.Scientific Data8\(1\),pp\. 91\.External Links:[Document](https://dx.doi.org/10.1038/s41597-021-00875-1)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - J\. Kim, T\. Ohta, Y\. Tateisi, and J\. Tsujii \(2003\)GENIA corpus–a semantically annotated corpus for bio\-textmining\.Bioinformatics19,pp\. i180–i182\.External Links:[Document](https://dx.doi.org/10.1093/bioinformatics/btg1023)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1)\. - M\. Krallinger, O\. Rabal, F\. Leitner, M\. Vazquez, D\. Salgado, Z\. Lu, R\. Leaman, Y\. Lu, D\. Ji, D\. M\. Lowe, R\. A\. Sayle, R\. T\. Batista\-Navarro, R\. Rak, T\. Huber, T\. Rocktäschel, S\. Matos, D\. Campos, B\. Tang, H\. Xu, T\. Munkhdalai, K\. H\. Ryu, S\. Ramanan, S\. Nathan, S\. Žitnik, M\. Bajec, L\. Weber, M\. Irmer, S\. A\. Akhondi, J\. A\. Kors, S\. Xu, X\. An, U\. K\. Sikdar, A\. Ekbal, M\. Yoshioka, T\. M\. Dieb, M\. Choi, K\. Verspoor, M\. Khabsa, C\. L\. Giles, H\. Liu, K\. E\. Ravikumar, A\. Lamurias, F\. M\. Couto, H\. Dai, R\. T\. Tsai, C\. Ata, T\. Can, A\. Usié, R\. Alves, I\. Segura\-Bedmar, P\. Martínez, J\. Oyarzabal, and A\. Valencia \(2015\)The CHEMDNER corpus of chemicals and drugs and its annotation principles\.Journal of Cheminformatics7,pp\. S2\.External Links:[Document](https://dx.doi.org/10.1186/1758-2946-7-S1-S2)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - J\. Li, Y\. Sun, R\. J\. Johnson, D\. Sciaky, C\. Wei, R\. Leaman, A\. P\. Davis, C\. J\. Mattingly, T\. C\. Wiegers, and Z\. Lu \(2016\)BioCreative V CDR task corpus: a resource for chemical disease relation extraction\.Database2016,pp\. baw068\.External Links:[Document](https://dx.doi.org/10.1093/database/baw068)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Re, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. WANG, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. S\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. A\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=iO4LZibEqW)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1),[§5](https://arxiv.org/html/2605.20537#S5.p4.1)\. - C\. E\. Lipscomb \(2000\)Medical subject headings \(MeSH\)\.Bulletin of the Medical Library Association88\(3\),pp\. 265–266\.Cited by:[4th item](https://arxiv.org/html/2605.20537#S3.I1.i4.p1.1)\. - Z\. Lu, H\. Kao, C\. Wei, M\. Huang, J\. Liu, C\. Kuo, C\. Hsu, R\. T\. Tsai, H\. Dai, N\. Okazaki, H\. Cho, M\. Gerner, I\. Solt, S\. Agarwal, F\. Liu, D\. Vishnyakova, P\. Ruch, M\. Romacker, F\. Rinaldi, S\. Bhattacharya, P\. Srinivasan, H\. Liu, M\. Torii, S\. Matos, D\. Campos, K\. Verspoor, K\. M\. Livingston, and W\. J\. Wilbur \(2011\)The gene normalization task in BioCreative III\.BMC Bioinformatics12,pp\. S2\.External Links:[Document](https://dx.doi.org/10.1186/1471-2105-12-S8-S2)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1)\. - A\. Malik, M\. Arsalan, C\. Moreno, J\. Mosquera, E\. Félix, T\. Kizilören, V\. Muthukrishnan, B\. Zdrazil, A\. R\. Leach, and N\. M\. O’Boyle \(2026\)ChEBI: re\-engineered for a sustainable future\.Nucleic Acids Research54,pp\. D1768–D1778\.External Links:[Document](https://dx.doi.org/10.1093/nar/gkaf1271)Cited by:[5th item](https://arxiv.org/html/2605.20537#S3.I1.i5.p1.1)\. - Miranda\-Escalada, A\., F\. Mehryary, J\. Luoma, D\. Estrada\-Zavala, L\. Gasco, S\. Pyysalo, A\. Valencia, and M\. Krallinger \(2023\)Overview of DrugProt task at BioCreative VII: data and methods for large\-scale text mining and knowledge graph generation of heterogenous chemical\-protein relations\.\.Database \(Oxford\)2023,pp\. baad080\.External Links:[Document](https://dx.doi.org/10.1093/database/baad080)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1)\. - A\. A\. Morgan, Z\. Lu, X\. Wang, A\. M\. Cohen, J\. Fluck, P\. Ruch, A\. Divoli, K\. Fundel, R\. Leaman, J\. Hakenberg, C\. Sun, H\. Liu, R\. Torres, M\. Krauthammer, W\. W\. Lau, H\. Liu, C\. Hsu, M\. Schuemie, K\. B\. Cohen, and L\. Hirschman \(2008\)Overview of BioCreative II gene normalization\.Genome Biology9\(Suppl 2\),pp\. S3\.External Links:[Document](https://dx.doi.org/10.1186/gb-2008-9-s2-s3)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1)\. - P\. V\. Ogren \(2006\)Knowtator: a protégé plug\-in for annotated corpus construction\.InProceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Demonstrations,A\. Rudnicky, J\. Dowding, and N\. Milic\-Frayling \(Eds\.\),New York City, USA,pp\. 273–275\.External Links:[Link](https://aclanthology.org/N06-4006/)Cited by:[§3\.4](https://arxiv.org/html/2605.20537#S3.SS4.p1.1)\. - Y\. Peng, S\. Yan, and Z\. Lu \(2019\)Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets\.InProceedings of the 18th BioNLP Workshop and Shared Task,Florence, Italy,pp\. 58–65\.External Links:[Document](https://dx.doi.org/10.18653/v1/W19-5006)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1)\. - S\. Pyysalo and S\. Ananiadou \(2014\)Anatomical entity mention recognition at literature scale\.Bioinformatics30\(6\),pp\. 868–875\.External Links:[Document](https://dx.doi.org/10.1093/bioinformatics/btt580)Cited by:[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - Rotenberg, N\. H\., R\. Leaman, R\. Islamaj, H\. Kuivaniemi, G\. Tromp, B\. Fluharty, S\. Richardson, C\. Eastwood, M\. Diller, B\. Xu, A\. V\. Pankajam, D\. Osumi\-Sutherland, Z\. Lu, and R\. H\. Scheuermann \(2026\)Cell phenotypes in the biomedical literature: a systematic analysis and text mining corpus\.\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.64898/2026.02.11.705457)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20537#S3.SS2.p1.1)\. - P\. Stenetorp, S\. Pyysalo, G\. Topić, T\. Ohta, S\. Ananiadou, and J\. Tsujii \(2012\)Brat: a web\-based tool for nlp\-assisted text annotation\.InProceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics,Avignon, France,pp\. 102–107\.External Links:[Link](https://aclanthology.org/E12-2021/)Cited by:[§3\.4](https://arxiv.org/html/2605.20537#S3.SS4.p1.1)\. - S\. Z\. K\. Tan, A\. Puig\-Barbe, D\. Goutte\-Gattat, C\. Eastwood, B\. Aevermann, A\. Avola, J\. P\. Balhoff, I\. U\. Bayindir, J\. Belfiore, A\. R\. Caron, D\. S\. Fischer, N\. George, B\. M\. Gyori, M\. A\. Haendel, C\. T\. Hoyt, H\. Kir, T\. Lubiana, N\. Matentzoglu, J\. A\. Overton, B\. Peng, B\. Peters, E\. M\. Quardokus, P\. L\. Ray, P\. Roncaglia, A\. D\. Rivera, R\. Stefancsik, W\. K\. Teh, S\. Toro, N\. Vasilevsky, C\. Xu, Y\. Zhang, R\. H\. Scheuermann, C\. J\. Mungall, A\. D\. Diehl, and D\. Osumi\-Sutherland \(2026\)The cell ontology in the age of single\-cell omics\.Scientific Data\.External Links:[Document](https://dx.doi.org/10.1038/s41597-026-07173-8)Cited by:[5th item](https://arxiv.org/html/2605.20537#S3.I1.i5.p1.1)\. - E\. Tutubalina, A\. Kadurin, and Z\. Miftahutdinov \(2020\)Fair evaluation in concept normalization: a large\-scale comparative analysis for BERT\-based models\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),Barcelona, Spain \(Online\),pp\. 6710–6716\.External Links:[Link](https://aclanthology.org/2020.coling-main.588/),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.588)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1),[§5](https://arxiv.org/html/2605.20537#S5.p4.1)\. - A\. Uma, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, and M\. Poesio \(2021\)Learning from disagreement: a survey\.Journal of Artificial Intelligence Research72,pp\. 1385–1470\.External Links:[Document](https://dx.doi.org/10.1613/jair.1.12752)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p3.1)\. - N\. A\. Vasilevsky, S\. Toro, N\. Matentzoglu, J\. E\. Flack, K\. R\. Mullen, H\. Hegde, S\. Gehrke, P\. L\. Whetzel, Y\. Shwetar, N\. L\. Harris, M\. S\. Ngu, G\. L\. Alyea, M\. S\. Kane, P\. Roncaglia, E\. Sid, C\. L\. Thaxton, V\. Wood, R\. S\. Abraham, M\. I\. Achatz, P\. Ajuyah, J\. S\. Amberger, L\. Babb, J\. Baker, J\. P\. Balhoff, J\. S\. Berg, A\. Bhalla, X\. B\. Ros, I\. R\. Braun, E\. C\. Broeren, B\. K\. Byer, A\. B\. Byrne, T\. J\. Callahan, L\. C\. Carmody, L\. E\. Chan, A\. R\. Clause, J\. S\. Cohen, M\. DeLuca, N\. T\. Deuitch, M\. Flowers, J\. Fraser, T\. Fujiwara, V\. Gitau, J\. L\. Goldstein, D\. Gration, T\. Groza, B\. M\. Gyori, W\. Hankey, J\. A\. Hilton, D\. S\. Himmelstein, S\. S\. Hong, C\. T\. Hoyt, R\. Huether, E\. Hurwitz, J\. O\. B\. Jacobsen, A\. Kikuchi, S\. Köhler, D\. R\. Korn, D\. Lagorce, B\. J\. Laraway, J\. Y\. Li, A\. J\. Malheiro, J\. McLaughlin, B\. H\. M\. Meldal, S\. Mohan, S\. A\. T\. Moxon, M\. C\. Munoz\-Torres, T\. H\. Nelson, F\. W\. Nicholas, D\. Ochoa, D\. Olson, T\. I\. Oprea, T\. T\. Oskotsky, D\. Osumi\-Sutherland, K\. Paris, H\. E\. Parkinson, Z\. M\. Pendlington, X\. P\. Peng, A\. Pizzino, S\. E\. Plon, B\. C\. Powell, J\. C\. Ratliff, H\. L\. Rehm, L\. Remennik, E\. R\. Riggs, S\. Roberts, P\. N\. Robinson, J\. E\. Ross, K\. Schaper, B\. M\. Schilder, J\. L\. Schmidt, E\. W\. Sharp, M\. N\. Similuk, D\. Smedley, T\. P\. Sneddon, R\. Sparks, R\. Stefancsik, G\. S\. Stupp, S\. Sundar, T\. Takatsuki, I\. Tammen, K\. C\. Tshering, D\. R\. Unni, E\. Valasek, A\. Vanderver, A\. H\. Wagner, R\. F\. Webb, D\. Welter, D\. Yaya\-Stupp, A\. Zankl, X\. A\. Zhang, J\. A\. McMurry, C\. G\. Chute, A\. Hamosh, C\. J\. Mungall, M\. A\. H\. C\. DICER1, miRNA\-Processing Gene Variant Curation Expert Panel, C\. H\. G\. C\. E\. Panel, C\. M\. C\. G\. C\. E\. Panel, C\. M\. M\. V\. C\. E\. Panel, C\. T\. V\. C\. E\. Panel, and C\. X\. I\. R\. D\. V\. C\. E\. Panel \(2026\)Mondo: integrating disease terminology across communities\.Genetics232\(4\),pp\. iyaf215\.External Links:[Document](https://dx.doi.org/10.1093/genetics/iyaf215)Cited by:[5th item](https://arxiv.org/html/2605.20537#S3.I1.i5.p1.1)\. - A\. Wang, Y\. Pruksachatkun, N\. Nangia, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2019\)SuperGLUE: a stickier benchmark for general\-purpose language understanding systems\.InAdvances in Neural Information Processing Systems,Vol\.32\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1)\. - A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. R\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,T\. Linzen, G\. Chrupała, and A\. Alishahi \(Eds\.\),Brussels, Belgium,pp\. 353–355\.External Links:[Link](https://aclanthology.org/W18-5446/),[Document](https://dx.doi.org/10.18653/v1/W18-5446)Cited by:[§2](https://arxiv.org/html/2605.20537#S2.p2.1)\. - C\. Wei, A\. Allot, P\. Lai, R\. Leaman, S\. Tian, L\. Luo, Q\. Jin, Z\. Wang, Q\. Chen, and Z\. Lu \(2024\)PubTator 3\.0: an AI\-powered literature resource for unlocking biomedical knowledge\.Nucleic Acids Research52\(W1\),pp\. W540–W546\.External Links:ISSN 0305\-1048,[Document](https://dx.doi.org/10.1093/nar/gkae235),[Link](https://doi.org/10.1093/nar/gkae235),https://academic\.oup\.com/nar/article\-pdf/52/W1/W540/58436124/gkae235\.pdfCited by:[§3\.4](https://arxiv.org/html/2605.20537#S3.SS4.p1.1)\. - C\. Wei, B\. R\. Harris, H\. Kao, and Z\. Lu \(2013\)tmVar: a text mining approach for extracting sequence variants in biomedical literature\.\.Bioinformatics29\(11\),pp\. 1433–1439\.External Links:[Document](https://dx.doi.org/10.1093/bioinformatics/btt156)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1),[§2](https://arxiv.org/html/2605.20537#S2.p1.1)\. - C\. Wei, Y\. Peng, R\. Leaman, A\. P\. Davis, C\. J\. Mattingly, J\. Li, T\. C\. Wiegers, and Z\. Lu \(2016\)Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical\-disease relation \(CDR\) task\.\.Database \(Oxford\)2016,pp\. baw032\.External Links:[Document](https://dx.doi.org/10.1093/database/baw032)Cited by:[§1](https://arxiv.org/html/2605.20537#S1.p2.1)\.
Similar Articles
BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
BeLink introduces a set-wise instruction-tuning formulation for generative re-ranking in biomedical entity linking, achieving 3-24% accuracy improvements and faster inference compared to state-of-the-art systems.
GENEB: Why Genomic Models Are Hard to Compare
GENEB is a large-scale diagnostic benchmark that evaluates 40 genomic foundation models across 100 tasks in 13 functional categories under a unified probing protocol, exposing that aggregate leaderboards are unstable and that architectural alignment often outweighs model scale. The work addresses the fragmented evaluation landscape in genomic machine learning, analogous to what MTEB did for NLP.
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.