PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
Summary
This paper introduces PlantMarkerBench, a multi-species benchmark for evaluating language models' ability to interpret evidence for plant marker genes from scientific literature across four species. It highlights that while frontier models perform well on direct evidence, they struggle with functional and indirect evidence types.
View Cached Full Text
Cached at: 05/13/26, 12:20 AM
Paper page - PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
Source: https://huggingface.co/papers/2605.10032
Abstract
PlantMarkerBench presents a multi-species benchmark for evaluating literature-based plant marker evidence interpretation, assessing models on identifying valid marker evidence and categorizing evidence types across four plant species.
Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval,hybrid search, species-awarebiological grounding,structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550sentence-level evidence instancesannotated formarker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight andclosed-source language modelsacross species andprompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode.Open-weight modelsadditionally exhibit elevatedfalse-positive ratesunder ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework forliterature-grounded biological evidenceattribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.10032
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10032 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10032 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10032 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
BAGEL is a new benchmark for evaluating animal-related knowledge in large language models, constructed from diverse scientific sources and covering taxonomy, morphology, habitat, behavior, and species interactions through closed-book question-answer pairs. The benchmark enables fine-grained analysis across taxonomic groups and knowledge categories, providing insights into model strengths and failure modes for biodiversity applications.
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.
PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
This paper introduces PlanBench-V, the first comprehensive benchmark for evaluating Vision-Language Models on spatial planning map interpretation, including an expert-annotated dataset and a four-dimension evaluation framework. Experiments show significant progress but highlight persistent challenges in implementation-oriented tasks.
Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems
This paper introduces Partial-Evidence-Bench, a deterministic benchmark for measuring 'authorization-limited evidence' failures in agentic AI systems. It evaluates how models handle tasks where access control restricts visibility, assessing their ability to recognize and report incomplete information rather than silently producing seemingly complete but incomplete answers.
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
WildTableBench introduces the first question-answering benchmark for real-world table images, revealing that existing multimodal foundation models struggle significantly with structural perception and numerical reasoning, with only one model exceeding 50% accuracy.