PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Hugging Face Daily Papers Papers

Summary

This paper introduces PlantMarkerBench, a multi-species benchmark for evaluating language models' ability to interpret evidence for plant marker genes from scientific literature across four species. It highlights that while frontier models perform well on direct evidence, they struggle with functional and indirect evidence types.

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.
Original Article
View Cached Full Text

Cached at: 05/13/26, 12:20 AM

Paper page - PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Source: https://huggingface.co/papers/2605.10032

Abstract

PlantMarkerBench presents a multi-species benchmark for evaluating literature-based plant marker evidence interpretation, assessing models on identifying valid marker evidence and categorizing evidence types across four plant species.

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval,hybrid search, species-awarebiological grounding,structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550sentence-level evidence instancesannotated formarker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight andclosed-source language modelsacross species andprompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode.Open-weight modelsadditionally exhibit elevatedfalse-positive ratesunder ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework forliterature-grounded biological evidenceattribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.10032

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10032 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10032 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10032 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

arXiv cs.CL

BAGEL is a new benchmark for evaluating animal-related knowledge in large language models, constructed from diverse scientific sources and covering taxonomy, morphology, habitat, behavior, and species interactions through closed-book question-answer pairs. The benchmark enables fine-grained analysis across taxonomic groups and knowledge categories, providing insights into model strengths and failure modes for biodiversity applications.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

arXiv cs.CL

MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

arXiv cs.CL

This paper introduces PlanBench-V, the first comprehensive benchmark for evaluating Vision-Language Models on spatial planning map interpretation, including an expert-annotated dataset and a four-dimension evaluation framework. Experiments show significant progress but highlight persistent challenges in implementation-oriented tasks.

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

arXiv cs.AI

This paper introduces Partial-Evidence-Bench, a deterministic benchmark for measuring 'authorization-limited evidence' failures in agentic AI systems. It evaluates how models handle tasks where access control restricts visibility, assessing their ability to recognize and report incomplete information rather than silently producing seemingly complete but incomplete answers.