Tag
This paper introduces PlantMarkerBench, a multi-species benchmark for evaluating language models' ability to interpret evidence for plant marker genes from scientific literature across four species. It highlights that while frontier models perform well on direct evidence, they struggle with functional and indirect evidence types.
Consensus, a research assistant with 8 million users, has launched Scholar Agent—a multi-agent system built on GPT-5 and OpenAI's Responses API—that can synthesize peer-reviewed literature across 220 million papers in minutes. The system uses coordinated Planning, Search, Reading, and Analysis agents to mirror how human researchers work, reducing hallucinations and improving reliability over previous approaches.