PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

This paper introduces PlantMarkerBench, a multi-species benchmark for evaluating language models' ability to interpret evidence for plant marker genes from scientific literature across four species. It highlights that while frontier models perform well on direct evidence, they struggle with functional and indirect evidence types.

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

Original Article

View Cached Full Text

Cached at: 05/13/26, 12:20 AM

Paper page - PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Source: https://huggingface.co/papers/2605.10032

Abstract

PlantMarkerBench presents a multi-species benchmark for evaluating literature-based plant marker evidence interpretation, assessing models on identifying valid marker evidence and categorizing evidence types across four plant species.

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval,hybrid search, species-awarebiological grounding,structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550sentence-level evidence instancesannotated formarker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight andclosed-source language modelsacross species andprompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode.Open-weight modelsadditionally exhibit elevatedfalse-positive ratesunder ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework forliterature-grounded biological evidenceattribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.10032

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10032 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10032 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10032 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Paper page - PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

Submit Feedback

Similar Articles

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild