BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

arXiv cs.CL Papers

Summary

BAGEL is a new benchmark for evaluating animal-related knowledge in large language models, constructed from diverse scientific sources and covering taxonomy, morphology, habitat, behavior, and species interactions through closed-book question-answer pairs. The benchmark enables fine-grained analysis across taxonomic groups and knowledge categories, providing insights into model strengths and failure modes for biodiversity applications.

arXiv:2604.16241v1 Announce Type: new Abstract: Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
Source: https://arxiv.org/html/2604.16241

![Uncaptioned image](https://arxiv.org/html/2604.16241v1/logos/logo_nyu.png)![Uncaptioned image](https://arxiv.org/html/2604.16241v1/logos/logo_NYUSH.png)

Masato Hagiwara¹†, Milad Alizadeh¹, Ellen Gilsenan-McMahon¹, Marius Miron¹, David Robinson¹, Emmanuel Chemla¹, Sara Keen¹, Gagan Narula¹, Mathieu Laurière³‡, Matthieu Geist¹‡, Olivier Pietquin¹‡

###### Abstract

Large language models (LLMs) have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a Benchmark for evaluating Animal knowledge Generalization Expertise in Language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

†These authors contributed equally to this work
‡Co-supervising authors

## 1 Introduction

Instruction-tuned LLMs and chat-style assistants have rapidly improved on a wide range of knowledge and reasoning tasks [ouyang2022instructgpt,openai2023gpt4], leading to strong performance on broad-domain evaluations such as MMLU [hendrycks2021mmlu] and science-oriented benchmarks such as ScienceQA [lu2022learn]. These gains have fueled interest in using LLMs as general-purpose interfaces for scientific information access, synthesis, and question answering. However, strong aggregate performance on broad benchmarks does not by itself establish whether models reliably encode specialized long-tail knowledge about the natural world, especially when answering questions that require species-level facts, ecological relations, or natural-history reasoning. This gap is increasingly important because language and foundation models are already being explored for biodiversity and animal-related applications. In ecology, recent studies have used LLMs to extract structured ecological information from scientific literature, including host–pathogen records and large-scale species interactions [gougherty2024ecological,keck2025massive], and have begun to test ecological knowledge more directly, finding uneven performance across ecological tasks [dorm2025ecologicalknowledge]. In animal communication and bioacoustics, prior work has introduced several important resources, including BEANS, a benchmark covering a broad range of animal sound tasks [hagiwara2022beans]; ISPA, a text-like transcription scheme for animal sounds [hagiwara2024ispa]; and NatureLM-audio, an audio-language foundation model built upon text-only LLMs (Llama) for bioacoustics [robinson2024naturelmaudio]. More broadly, biodiversity-focused foundation models have also emerged in other modalities, such as BioCLIP for fine-grained recognition across the tree of life [stevens2024bioclip]. Despite this momentum, most prior work emphasizes information extraction, audio understanding, or visual recognition rather than evaluating whether text-only LLMs can answer closed-book questions about animals. As a result, it remains unclear how well current language models generalize across the heterogeneous forms of knowledge that matter for animal expertise, such as taxonomy, morphology, behavior, habitat, vocalization, geographic distribution, and species interactions. To address this gap, we introduce BAGEL, a Benchmark for closed-book evaluation of Animal knowledge Generalization Expertise in Language models.¹

¹Dataset release: https://huggingface.co/datasets/EarthSpeciesProject/BAGEL

BAGEL aggregates 11,852 multiple-choice questions derived from four complementary sources: Wikipedia, Global Biotic Interactions (GloBI) [poelen2014globi], bioRxiv, and Xeno-canto [vellinga2015xenocanto]. Together they target four animal-centered skills—encyclopedic knowledge about animals (Wikipedia), ecological interaction reasoning (GloBI), scientific-literature reasoning about animals (bioRxiv), and text-only bioacoustic-domain knowledge about animal vocalizations (Xeno-canto). By design, BAGEL tests not only overall accuracy but also robustness across source domains and source-specific dimensions, providing a more fine-grained view of what current models do and do not know about animals and natural history.

Table 1: Representative neighboring benchmarks and where BAGEL differs most clearly. Rows are illustrative rather than exhaustive; descriptions were checked against primary papers / official benchmark descriptions.

## 2 Related Work

##### LLMs and foundation models for biodiversity and animal-related applications

The use of language and foundation models in biodiversity-relevant settings is growing rapidly, but the literature is still fragmented across application areas and modalities. In ecology, LLMs have been explored as tools for extracting structured knowledge from text, including ecological variables from disease reports [gougherty2024ecological] and species interactions from large scientific corpora [keck2025massive]. Recent evaluation work has also begun to probe whether general-purpose LLMs possess ecological knowledge directly, reporting a substantial gap between relatively strong factual or taxonomic recall and weaker ecological reasoning or conservation-oriented judgment [dorm2025ecologicalknowledge]. In parallel, animal-centered foundation-model work has expanded in bioacoustics: BEANS established a public benchmark covering multiple animal-sound tasks [hagiwara2022beans]; ISPA proposed a text-based representation for transcribing animal sounds and connecting them to language-model-style methods [hagiwara2024ispa]; and NatureLM-audio introduced an audio-language foundation model tailored to bioacoustics, with strong zero-shot generalization across taxa and tasks [robinson2024naturelmaudio]. Outside text and audio, BioCLIP demonstrates that biodiversity-specific foundation models can substantially improve fine-grained recognition across a wide taxonomic range [stevens2024bioclip]. Adjacent Earth-science domains have likewise moved toward specialized language models, including K2 for geoscience and OceanGPT for ocean science [deng2023k2,bi2023oceangpt]. Together, these studies show clear momentum toward AI systems specialized for nature and environmental data, but they do not directly evaluate closed-book animal knowledge in text-only LLMs.

##### General knowledge and science benchmarks

Large language models are commonly evaluated using broad knowledge benchmarks such as MMLU [hendrycks2021mmlu], which measure multitask performance across many academic subjects. In science-focused evaluation, ScienceQA [lu2022learn] provides a large benchmark of science questions with associated explanations and multimodal context. These benchmarks are valuable for measuring general scientific competence, but they do not specifically target fine-grained knowledge of animals, biodiversity, or natural history.

##### Biomedical and scientific QA benchmarks

A separate line of work studies domain-specific evaluation in biomedicine and scientific question answering. BLURB [gu2021domain] aggregates multiple biomedical NLP tasks into a unified benchmark, while PubMedQA [jin2019pubmedqa] focuses on question answering over biomedical research abstracts. BioASQ [nentidis2023bioasq] has also established a long-running shared task centered on large-scale biomedical semantic indexing and question answering. These resources demonstrate the value of domain-specific evaluation, but they primarily focus on biomedical or clinical knowledge rather than biodiversity and natural history.

##### Environmental and ecological benchmarks

More recently, several benchmarks have moved closer to environmental and ecological applications. EnviroExam [huang2024enviroexam] evaluates environmental science knowledge of large language models using curriculum-based questions, and ELLE [guo2025elle] proposes a QA benchmark for eco-environment applications. Work on ecological knowledge evaluation is also beginning to emerge, with recent evidence that strong general-purpose LLMs retain only partial and task-dependent ecological competence [dorm2025ecologicalknowledge]. These efforts are important adjacent steps, but they emphasize broad environmental science, ecology, or sustainability topics rather than animal-centered expertise. In contrast, BAGEL foregrounds animal-centered evaluation under one protocol: encyclopedic species facts (Wikipedia), ecological interactions among taxa (GloBI), animal-relevant scientific literature reasoning (bioRxiv), and bioacoustic-domain textual knowledge (Xeno-canto), with accuracy reported per source.

##### Our Contributions

BAGEL complements prior work by focusing on a distinct but underexplored evaluation axis: closed-book question answering about animals and natural history grounded in heterogeneous biodiversity-relevant sources. Rather than testing general academic knowledge or biomedical reasoning alone, BAGEL is designed to measure how well language models handle species-level knowledge, ecological relations, and animal-focused factual generalization across multiple source domains. Table 1 makes the contrasts concrete with representative neighbors.

![Overview of the BAGEL benchmark curation pipeline: four source-specific preparation tracks feed domain prompts into a shared generator, followed by quality checks, four-option formatting, and option-order shuffling.](Figure 1)

## 3 Benchmark Construction

Figure 1 summarizes the end-to-end curation workflow; subsections below describe each source domain. The four corpora are not intended to exhaust real-world natural-history competence; they are public, machine-accessible anchors over complementary animal-centered skills: encyclopedic knowledge about animals, ecological interaction reasoning among taxa, scientific-literature reasoning on animal-focused preprints, and text-only bioacoustic knowledge derived from animal recordings.

### 3.1 Wikipedia

The Wikipedia subset targets *encyclopedic knowledge about animals*: English Wikipedia species articles supply the evidence, and each item probes closed-book recall of taxon-specific facts a reader would normally take from such an article, without access to the article at test time.²

²Portal: https://www.wikipedia.org/; taxa are linked through Wikidata and article plain-text extracts are retrieved via the MediaWiki API.

**Article retrieval.** For each candidate taxon, structured metadata (scientific name and higher-level classification) is used to resolve the English article title through Wikidata (linking the taxon name to an item and reading its English Wikipedia sitelink), after which the article plain-text extract is retrieved via the MediaWiki API. Pages with missing text or extracts shorter than 1,000 characters are excluded so that stubs and minimally informative articles do not dominate the benchmark.

**Text preparation.** Extracts longer than 180,000 characters are truncated before generation, with preference for paragraph- or sentence-boundary cuts so that the retained prefix remains coherent. This cap keeps prompts within practical context-window limits of the generation model. No additional manual editing of article text is performed beyond this cap.

**Question synthesis.** Questions are generated through the GPT-4o-mini API using the system–user template in Appendix A.1.1. The model may emit up to eight four-option, single-answer items per species, each assigned to one of eight thematic dimensions: *Taxonomy*; *Behavior* (including social behavior where applicable); *Communication*; *Morphology*; *Habitat*; *Cognition*; *Geographic Distribution*; and *Diet*. Every item must be justified solely by explicit statements in the supplied extract; dimensions not supported by the text are skipped. When the article mentions vocalization, geographic range, or feeding, the prompt encourages at least one item in *Communication*, *Geographic Distribution*, or *Diet*, respectively. Parsed outputs are kept only if they pass simple structural checks (valid dimension label, exactly four options, and a designated correct option that matches one option string exactly). At evaluation time, models see only the stem and answer choices; article text and construction metadata are withheld, consistent with the closed-book protocol in Section 4.

### 3.2 Global Biotic Interactions (GloBI)

The GloBI subset targets ecological interaction reasoning among species. Tabular records in the Global Biotic Interactions exchange format describe directed links between a source taxon and a target taxon together with an interaction-type label and optional locality, coordinates, observation time, life-stage or body-part fields, habitat, and bibliographic provenance.³

³GloBI portal and indexed interaction data: https://www.globalbioticinteractions.org/

**Preprocessing.** We read at most 10,000 rows from the record table and harmonize column names across common GloBI export conventions. Rows without both endpoint taxa and an interaction-type label are excluded. Each retained row is converted into a short natural-language summary of the interaction together with optional locality, date, and coordinate metadata, plus dataset- and reference-level provenance.

**Balanced subsampling.** From the annotated pool we select 3,500 interactions, stratifying by interaction type so that the empirical distribution over relation labels is more uniform than under uniform random sampling of rows. Within each type, rows with richer contextual metadata (locality, coordinates, dates) are preferred when ties arise, using a fixed random seed for reproducibility; the final subset is shuffled before question synthesis.

**Multiple-choice synthesis.** For each selected interaction, we prompt GPT-4o-mini through the OpenAI API with the natural-language interaction summary as the only textual evidence. The instruction format is given in Appendix A.1.1. The model must output exactly one closed-book, four-option item labeled as either *Masked participant identification* or *Masked interaction type inference*, obeying constraints that discourage trivial verb cues and encourage ecologically plausible distractors. Malformed or unparseable outputs are discarded.

### 3.3 bioRxiv

The bioRxiv subset targets *scientific-literature reasoning about animals*. We harvest animal-related articles directly from the public bioRxiv website⁴ using its month-indexed advanced search interface, restricting to posts dated from 2023 onward and to four subject areas: *Animal Behavior and Cognition*, *Ecology*, *Evolutionary B*

⁴bioRxiv server: https://www.biorxiv.org/

Similar Articles

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

arXiv cs.CL

RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.