BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

arXiv cs.CL Papers

Summary

BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.

arXiv:2605.05758v1 Announce Type: new Abstract: Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool-calling dataset designed for fine-tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high-quality, human-verified query-API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:42 AM

# BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models
Source: [https://arxiv.org/html/2605.05758](https://arxiv.org/html/2605.05758)
Xin Gao1Ruiyi Zhang111footnotemark:1Meixi Du1Peijia Qin1Pengtao Xie1, 2

1UC San Diego2MBZUAI \{xig022, ruz048, p1xie\}@ucsd\.edu

###### Abstract

Despite the success of large language models \(LLMs\) on general\-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory\. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows\. While recent general\-domain tool\-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in\-context learning and restrict models to a small set of tools\. To address this gap, we introduceBioTool, a comprehensive biomedical tool\-calling dataset designed for fine\-tuning LLMs\.BioToolcomprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high\-quality, human\-verified query–API call pairs spanning variation, genomics, proteomics, evolution, and general biology\. Fine\-tuning a 4\-billion\-parameter LLM onBioToolyields substantial improvements in biomedical tool\-calling performance, outperforming cutting\-edge commercial LLMs such as GPT\-5\.1\. Furthermore, human expert evaluations demonstrate that integrating aBioTool\-fine\-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness ofBioToolin enhancing the biomedical capabilities of LLMs\. The full dataset and evaluation code are available at[https://github\.com/gxx27/BioTool](https://github.com/gxx27/BioTool)\.

BioTool: A Comprehensive Tool\-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

Xin Gao1††thanks:Equal contribution\.Ruiyi Zhang111footnotemark:1Meixi Du1Peijia Qin1Pengtao Xie1, 2††thanks:Corresponding authors\.1UC San Diego2MBZUAI\{xig022, ruz048, p1xie\}@ucsd\.edu

## 1Introduction

The rapid advancement of large language models \(LLMs\) has revolutionized natural language processing, enabling unprecedented performance across a wide range of general\-purpose tasks\(OpenAI,[2023](https://arxiv.org/html/2605.05758#bib.bib21); Bai et al\.,[2023](https://arxiv.org/html/2605.05758#bib.bib5)\)\. However, their capabilities in biomedical domains remain limited, which hinders their deployment in high\-stakes, real\-world biomedical applications\(Chen et al\.,[2025](https://arxiv.org/html/2605.05758#bib.bib7); Li et al\.,[2025a](https://arxiv.org/html/2605.05758#bib.bib16)\)\. A key reason for this limitation is the insufficient ability of LLMs to effectively leverage specialized biomedical tools\(Jin et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib13)\)\. Unlike commonsense questions that can often be answered directly, biomedical problems typically require even expert researchers to consult external tools and databases before drawing reliable conclusions\(NCBI,[2017](https://arxiv.org/html/2605.05758#bib.bib20)\)\. For instance, even for human biologists, the biological function of a raw nucleotide sequence cannot be reliably inferred without the aid of computational tools, such as BLAST or other sequence similarity–based methods\(Altschul et al\.,[1990](https://arxiv.org/html/2605.05758#bib.bib2)\)\. As shown in Figure[1](https://arxiv.org/html/2605.05758#S1.F1), LLMs that lack access to or integration with such tools are therefore prone to hallucinations and imprecise generalizations, undermining their reliability for scientific discovery\.

![Refer to caption](https://arxiv.org/html/2605.05758v1/x1.png)Figure 1:Comparison between answers generated by LLMs without tools andBioTool\-augmented LLMs for biomedical queries\. LLMs without tools often hallucinate or produce imprecise answers \(left\), whereasBioTool\-augmented LLMs \(right\) generate API calls and retrieve critical information from biomedical databases, leading to higher\-quality responses\.Given these challenges, early attempts have integrated biomedical and chemistry tools into LLMs via in\-context learning\(Jin et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib13); Bran et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib6)\)\. Although these approaches show improvements, they are constrained to a small set of available tools due to limited context length\. Moreover, biomedical research tools often support diverse and complex usage scenarios that cannot be fully captured by a few lines of textual prompts, which hinders LLMs from fully realizing their potential in biomedical tool usage\. Furthermore, they require models to map natural\-language questions to highly specialized schemas, identifiers, and parameter conventions to reliably retrieve biologically relevant evidence\. Inspired by the success of instruction\-tuning–based tool\-calling datasets in the general NLP domain\(Liu et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib18); Patil et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib26)\), we address this gap by curating a comprehensive biomedical tool\-calling dataset,BioTool\.

BioToolis an instruction fine\-tuning–style biomedical tool\-calling dataset consisting of 7,040 high\-quality, human\-verified query–API call pairs\. It includes 34 frequently used tools from the NCBI\(NCBI,[2017](https://arxiv.org/html/2605.05758#bib.bib20)\), Ensembl\(Hubbard et al\.,[2002](https://arxiv.org/html/2605.05758#bib.bib11)\), and UniProt\(The UniProt Consortium,[2017](https://arxiv.org/html/2605.05758#bib.bib31)\)databases, spanning multiple subdomains such as variation, genomics, proteomics, evolution, and general biology\. To construct the dataset, we first manually select 34 tools from NCBI, Ensembl, and UniProt that are widely used in biomedical research\. We then collect official documentation for these tools from their respective websites and use them to generate diverse combinations of API parameters with the assistance of LLMs\. The synthesized API calls are executed and filtered to remove cases with unavailable or uninformative responses, resulting in 3,829 unique API calls\. Next, we prompt cutting\-edge reasoning models\(OpenAI,[2025](https://arxiv.org/html/2605.05758#bib.bib24)\)with these API calls and their corresponding responses to generate potential user queries\. These queries are subsequently evaluated by an LLM\-based judge to assess whether the API responses meaningfully support answering the queries, followed by a final round of human expert review focusing on biological relevance and correctness\. This process yields 7,040 high\-quality query–API call pairs, which is the finalBioTooldataset\.

We evaluate the quality and effectiveness ofBioToolthrough two sets of experiments\. First, we fine\-tune several open\-source LLMs with 4B to 8B parameters on theBioTooltraining split and compare them with cutting\-edge commercial LLMs, including GPT\-5\.1, Gemini\-3 Pro, and Claude\-4\.5\-Sonnet, using in\-context learning\. Results on the test split show that smaller LLMs fine\-tuned withBioToolsignificantly outperform commercial LLMs with hundreds of times more parameters in terms of tool\-calling quality\. For example, aBioTool\-fine\-tuned 4B Qwen\-3 model outperforms the best\-performing Claude\-4\.5\-Sonnet by 15\.0% in overall API\-calling quality\. Second, we conduct human evaluations to assess whetherBioTool\-enhanced LLMs produce higher\-quality answers from the perspective of biomedical researchers\. On 1,048 test queries, a GPT\-5\.1 model augmented with oracleBioToolAPI calls achieves 88\.4% higher normalized answer quality compared to the same model without tool usage, demonstrating the intrinsic quality of theBioTooldataset\. Moreover, a GPT\-5\.1 model augmented with aBioTool\-fine\-tuned API caller achieves 69% higher normalized answer quality compared to the raw GPT\-5\.1 model, highlighting the effectiveness ofBioToolin training tool\-using LLMs and enhancing their biomedical capabilities\.

## 2Related Works

Early general\-purpose tool\-calling models, such as ToolformerSchick et al\. \([2023](https://arxiv.org/html/2605.05758#bib.bib30)\)and GorillaPatil et al\. \([2024](https://arxiv.org/html/2605.05758#bib.bib26)\), established that LLMs can be trained to invoke external APIs, thereby grounding responses in retrieved data to mitigate hallucinations\. Subsequent frameworks like ToolBenchQin et al\. \([2023](https://arxiv.org/html/2605.05758#bib.bib27)\)and APIGenLiu et al\. \([2024](https://arxiv.org/html/2605.05758#bib.bib18)\)advanced this capability by introducing scalable pipelines for generating synthetic instruction\-tuning data\. Despite these advancements, generalist models often struggle with specialized scientific domains like biomedicine because they rely on broad datasets that include only a negligible fraction of corresponding tools and frequently fail to adhere to the rigorous schema constraints of scientific databases\. To address these limitations, domain\-specific agents have emerged\. GeneGPTJin et al\. \([2024](https://arxiv.org/html/2605.05758#bib.bib13)\)pioneered this shift by utilizing in\-context learningWei et al\. \([2023](https://arxiv.org/html/2605.05758#bib.bib33)\)to enable access to NCBI Web APIs\. Similarly, systems such as SciAgentLi et al\. \([2025b](https://arxiv.org/html/2605.05758#bib.bib17)\)and ChemCrowBran et al\. \([2024](https://arxiv.org/html/2605.05758#bib.bib6)\)have successfully integrated tool\-augmented agents for complex reasoning in scientific and chemical research\. While more recent entries like BiomniHuang et al\. \([2025](https://arxiv.org/html/2605.05758#bib.bib10)\)have introduced general\-purpose agents for biomedical tasks, they primarily focus on a restricted subset of tools\. Consequently, they lack the comprehensive, full\-list interface to primary authoritative biomedical databases\.

![Refer to caption](https://arxiv.org/html/2605.05758v1/x2.png)Figure 2:The systematic workflow ofBioToolspans from automated dataset construction to downstream application\. Panel \(a\) illustrates the multi\-stage construction pipeline, which includes initial tool selection from primary databases, automated API call generation, and a rigorous filtering process involving execution checks, heuristic validation, and LLM\-based informativeness assessment\. Panel \(b\) depicts the inference\-time application, where specialized API\-calling models fine\-tuned onBioToolenable base LLMs to retrieve grounded observations and generate verifiable biological answers\.
## 3The BioTool Dataset

This section details the development and composition ofBioTool\. We first present an example data entry fromBioToolto illustrate the structure of a query–API call pair\. Each entry includes auser queryfield, which contains a realistic clinical or biomedical question expressed in free\-form text\. Thetool informationfield provides descriptions of the tools required to answer the query, while theAPI argumentsspecify the input parameters for the corresponding API endpoint\. Executing the API endpoint with these arguments returns anobservations, which contains information used to augment the LLM’s response\. We note that the observation is fully determined by the API endpoint and its arguments; it is included in the dataset for completeness and user convenience\.

Next, we describe the sequential construction pipeline used to generate and verify biomedical tool calling pairs in Section[3\.1](https://arxiv.org/html/2605.05758#S3.SS1), illustrated in Figure[2](https://arxiv.org/html/2605.05758#S2.F2)\. We then provide a quantitative analysis of the resulting dataset, highlighting its functional utility and biological diversity in Section[3\.2](https://arxiv.org/html/2605.05758#S3.SS2)\.

Example of BioTool Data EntryUser QueryCould you provide concise definitions for the major severe immunodeficiency disorders?Tool InformationAPI ArgumentsObservations\(id: ‘‘DI\-00171’’, definition: ‘‘An autosomal recessive immunologic disorder characterized by the loss of expression of MHC class II antigens on antigen\-presenting cells\.\.\.’’\),\(id: ‘‘DI\-00305’’, definition: ‘‘A form of chronic granulomatous disease\.\.\.’’\),\.\.\.

![Refer to caption](https://arxiv.org/html/2605.05758v1/x3.png)Figure 3:Distribution analysis of the 7,040 samples withinBioToolacross four dimensions\. Panel \(a\) shows the distribution across source databases\. Panel \(b\) illustrates the distribution of samples by tool type\. Panel \(c\) presents the distribution across various biological domains\. Panel \(d\) delineates the distribution of user queries across the 34 distinct biological tools\.### 3\.1Dataset Construction Pipeline

#### Tool Selection

We select three major online API providers: the National Center for Biotechnology Information \(NCBI\), UniProt, and Ensembl as the tool source forBioTool, motivated by their roles as the authoritative repositories within the global biomedical research infrastructureSayers \([2010](https://arxiv.org/html/2605.05758#bib.bib29)\); Ahmad et al\. \([2025](https://arxiv.org/html/2605.05758#bib.bib1)\); Yates et al\. \([2014](https://arxiv.org/html/2605.05758#bib.bib35)\)\. These three platforms are widely considered the definitive standard because they offer expansive and highly interoperable data spanning the entire central dogma of biology, encompassing the full spectrum from raw genomic sequences to functional protein annotations\.

Across the three databases, we comprehensively review their websites and manually select tools that are critical for answering biomedical and clinical questions\. During this process, we exclude tools with limited biomedical relevance \(e\.g\., APIs that only return service or versioning information\) as well as deprecated or unstable tools\. As a result, we curate a diverse set of 34 tools comprising 124 API endpoints, each of which is frequently used in biomedical research workflows\. The complete list of selected tools is provided in Appendix[F](https://arxiv.org/html/2605.05758#A6)\. In addition, we collect the official documentation for each API endpoint from the corresponding website\. These documents specify API usage, input arguments, constraints, and example calls, and serve as essential resources for subsequent stages of API call synthesis and user query generation\.

#### API Call Synthesis and Verification

Based on the curated tool set and associated documentation, we manually select critical API arguments corresponding to biologically meaningful identifiers for each API endpoint\. These arguments, such as taxon IDs, gene symbols, and UniProt accession numbers, ensure that the synthesized API calls are biologically diverse and scientifically plausible\. Given the selected arguments, we follow prior work\(Liu et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib18)\)to randomly sample a large set of candidate API calls\. These candidates are then executed to filter out cases that result in client errors, timeouts, or empty responses\. To further improve data quality, we design a novel heuristic\-based filtering strategy to remove API calls that are overly similar to existing ones, as well as those whose returned observations lack biological significance\. Details of this heuristic filter are provided in Appendix[A](https://arxiv.org/html/2605.05758#A1)\. After this verification process, we obtain a collection of 6,391 unique API calls\.

#### User Query Generation

Given the synthesized API calls, we leverage cutting\-edge LLMs to generate corresponding user queries, following a self\-instruct–style paradigm established in prior work\(Wang et al\.,[2022](https://arxiv.org/html/2605.05758#bib.bib32); Patil et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib26); Liu et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib18)\)\. Specifically, LLMs are prompted with an API call, its documentation, and its corresponding observation, together with a small set of human\-crafted in\-context query–API call pairs, to generate realistic user queries\.

To further improve the quality and biological relevance ofBioTool, we introduce two novel adaptations to ensure both thenecessityandsufficiencyof the API observations\. First, to enforcenecessity, we apply Chain\-of\-Thought \(CoT\) prompting\(Wei et al\.,[2023](https://arxiv.org/html/2605.05758#bib.bib33)\)using a strong reasoning model \(OpenAI o3\(OpenAI,[2025](https://arxiv.org/html/2605.05758#bib.bib24)\)\) when generating user queries\. The model is first prompted to summarize the technical details of the API observation into a natural\-language description, which is then used to generate the final user query\. This procedure ensures that the observation is required to answer the query, while keeping the query realistic and avoiding explicit references to specific tools or API calls\. The detailed system and user prompts for this process are provided in Appendix[E\.1](https://arxiv.org/html/2605.05758#A5.SS1)\. Second, to ensuresufficiency, we employ another cutting\-edge LLM \(Claude Haiku 4\.5\(Anthropic,[2025](https://arxiv.org/html/2605.05758#bib.bib3)\)\) to perform informativeness\-based filtering, inspired by the LLM\-as\-a\-judge framework\(Zheng et al\.,[2023](https://arxiv.org/html/2605.05758#bib.bib36)\)\. The model is prompted to follow a structured rubric and classify a query–API call pair as informative if the observation contains at least one relevant fact or a partial summary that supports the user’s intent\. Pairs in which the observation is unrelated to the query or too vague to support a concrete response are discarded\. The specific judge prompts are provided in Appendix[E\.2](https://arxiv.org/html/2605.05758#A5.SS2)\.

#### Human Refinement

The final stage involves a comprehensive manual review conducted by human evaluators with at least a college\-level background in bioinformatics\. The evaluators first identify and remove low\-quality queries\. For the remaining samples, they refine pedantic or unnatural phrasing and ensure the accuracy of biological terminology and nomenclature\. After this round of filtering and correction, the finalBioTooldataset comprises 7,040 high\-quality samples\.

This instruction fine\-tuning–style dataset is primarily used to train open\-source LLMs as API\-calling models, following training paradigms established in general\-domain tool\-calling datasets\(Patil et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib26); Liu et al\.,[2024](https://arxiv.org/html/2605.05758#bib.bib18)\)\. ABioTool\-trained LLM can assist state\-of\-the\-art LLMs in generating grounded and scientifically accurate responses, as illustrated in the right panel of Figure[2](https://arxiv.org/html/2605.05758#S2.F2)\.

### 3\.2Data Statistics

TheBioTooldataset is derived from 34 distinct biological tools and 124 unique API endpoints, encompassing a wide array of scientific content categorized across several key dimensions\. As shown in Figure[3](https://arxiv.org/html/2605.05758#S3.F3)\(a\), the distribution of tools across databases is well balanced, with comparable proportions from NCBI, UniProt, and Ensembl\. Figure[3](https://arxiv.org/html/2605.05758#S3.F3)\(b\) illustrates the diversity of tool types included inBioTool, ranging from data retrieval \(e\.g\., nucleotide identifiers fetching\) and search and discovery \(e\.g\., phenotype\-based gene discovery\) to biological analysis and mapping \(e\.g\., cross\-referencing SNP identifiers\)\. Figure[3](https://arxiv.org/html/2605.05758#S3.F3)\(c\) highlights the dataset’s broad scientific scope, covering domains such as genomics \(e\.g\., gene tree querying\), proteomics \(e\.g\., protein sequence alignment\), variation analysis \(e\.g\., linkage disequilibrium analysis\), and evolutionary biology \(e\.g\., species\-level taxonomy identification\)\. Finally, Figure[3](https://arxiv.org/html/2605.05758#S3.F3)\(d\) shows thatBioToolincludes both frequently accessed general\-purpose tools and a long tail of specialized tools, all of which are essential for complex scientific discovery across the central dogma\.

## 4Experimental Results

To evaluate the effectiveness ofBioTool, we first compare the API\-calling capabilities of small open\-source LLMs fine\-tuned onBioToolagainst their vanilla counterparts and cutting\-edge proprietary LLMs using in\-context learning\. We then conduct human expert evaluations to compare the answer quality of baseline LLMs with that ofBioTool\-augmented LLMs\.

### 4\.1Experimental Setup

#### BioTool score

We define a BioTool performance score to automatically evaluate the capability of an LLM as an API caller on theBioTooldataset, especially the alignment of retrieved information with the user’s intent\. Specifically, assume we have the test setD=\{\(q1,o1\),…​\(qn,on\)\}D=\\\{\(q\_\{1\},o\_\{1\}\),\.\.\.\(q\_\{n\},o\_\{n\}\)\\\}, whereqiq\_\{i\}is theithi^\{\\text\{th\}\}user query andoio\_\{i\}is the observation obtained from ground\-truth API calling in the dataset\. The BioTool score on this test setS​\(D\)\\mathrm\{S\}\(D\)for a LLM API callerffis then defined as follows:

S​\(D\)=∑i=1nSim​\(f​\(qi\),oi\)\\mathrm\{S\}\(D\)=\\sum\_\{i=1\}^\{n\}\\mathrm\{Sim\}\\bigl\(f\(q\_\{i\}\),o\_\{i\}\\bigr\)\(1\)whereSim​\(o^,o\)\\mathrm\{Sim\}\(\\hat\{o\},o\)computes the semantic embedding similarity of two text strings: the ground truth observationooand the corresponding observationo^\\hat\{o\}from LLM API caller prediction\. In practice, we use a MedCPT model\(Jin et al\.,[2023](https://arxiv.org/html/2605.05758#bib.bib12)\)to get a sentence embedding for an observation\. API calls may fail due to incorrect model generation, yielding an empty stringo^=ε\\hat\{o\}=\\varepsilon\. In this case, we setSim​\(ε,o\)=0\\mathrm\{Sim\}\(\\varepsilon,o\)=0\. Intuitively, this score determines model performance by measuring whether the retrieved biological facts remain semantically similar to the required information, even when the technical implementation of the call differs from the reference\.

#### Additional Metrics

Based on the BioTool score, we define two additional metrics to further characterize model performance\. Similar metrics have been widely adopted in existing API\-calling benchmarks\(Patil et al\.,[2025](https://arxiv.org/html/2605.05758#bib.bib25)\)\. Firstly, we define API calling success rateAS\\mathrm\{AS\}as follows:

AS​\(D\)=1n​∑i=1n𝟏​\[Sim​\(f​\(qi\),oi\)\>0\]\\mathrm\{AS\}\(D\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{1\}\\\!\\left\[\\mathrm\{Sim\}\\bigl\(f\(q\_\{i\}\),o\_\{i\}\\bigr\)\>0\\right\]\(2\)where𝟏​\[⋅\]\\mathbf\{1\}\\left\[\\cdot\\right\]is the indicator function\. A zero similarity indicates API calling failure due to incorrect formatting, invalid API names, or improper parameter values\. Conceptually, this metric focuses on the model’s capability to generate API calls that execute correctly and return a valid response containing data\. Secondly, we define a exact match scoreEM\\mathrm\{EM\}as follows:

EM​\(D\)=1n​∑i=1n𝟏​\[Sim​\(f​\(qi\),oi\)=1\]\\mathrm\{EM\}\(D\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{1\}\\\!\\left\[\\mathrm\{Sim\}\\bigl\(f\(q\_\{i\}\),o\_\{i\}\\bigr\)=1\\right\]\(3\)which measures the proportion of predictions whose resulting observations exactly match the ground\-truth reference observation, requiring the model to correctly identify the API endpoint and provide all required parameters with values that exactly match the reference\.

#### Models

In this study, we use four cutting\-edge proprietary models, including GPT\-5\.1, GPT\-5\.1\-Codex, Gemini 3 Pro, and Claude 4\.5 SonnetOpenAI \([2025b](https://arxiv.org/html/2605.05758#bib.bib23),[a](https://arxiv.org/html/2605.05758#bib.bib22)\); Google \([2025](https://arxiv.org/html/2605.05758#bib.bib8)\); Anthropic \([2025](https://arxiv.org/html/2605.05758#bib.bib4)\)under an in\-context learning scheme\. We use four open\-source models, which are Llama3\.1\-8B\-Instruct, Qwen3\-8B, Qwen2\.5\-7B\-Instruct, and Qwen3\-4B\-InstructGrattafiori et al\. \([2024](https://arxiv.org/html/2605.05758#bib.bib9)\); Yang et al\. \([2025](https://arxiv.org/html/2605.05758#bib.bib34)\); Qwen et al\. \([2025](https://arxiv.org/html/2605.05758#bib.bib28)\), for both in\-context learning and BioTool\-based fine\-tuning\. We report the average performance across three independent runs\.

### 4\.2Results on Tool Calling Capability

In this section, we first fine\-tune small open\-source models on the training split of theBioTooldataset, which is randomly split under a four\-to\-one ratio\. We use the cutting\-edge proprietary model and base open\-source models as baselines, and the evaluation for all models was conducted equally on the held\-out test set consisting of 1,408 samples in terms of BioTool score\. As shown in Table[4\.2](https://arxiv.org/html/2605.05758#S4.SS2), there is a clear performance advantage forBioTool\-fine\-tuned models over much larger LLMs under in\-context learning\. The fine\-tuned 4B model achieved the highest overall BioTool score, representing a 15\.0% improvement over the strongest proprietary model, Claude 4\.5 Sonnet, and 68\.9% higher performance than GPT\-5\.1\. This gap suggests that the general\-purpose pre\-training of frontier LLMs together with in\-context learning is insufficient to navigate the specialized technical constraints and precise parameter mappings of biological repositories\. Instead, the high\-density training signals within theBioTooldataset allow significantly smaller models to acquire the necessary domain expertise that remains elusive to even the largest proprietary models\.

Table 1:Comparative evaluation of models on theBioTooldataset, measured by the BioTool score \(higher is better\)\. Scores are reported for each constituent database \(NCBI, UniProt, Ensembl\) and overall\. Model names with the suffix*Ins*denote instruction\-tuned variants\. Bold values indicate the best performance in each column\.### 4\.3Human Evaluation of Answer Quality

The ultimate criterion for assessing the usefulness of a tool\-calling dataset is its ability to improve the quality of LLM\-generated answers\. To evaluate this, we use GPT\-5\.1 as the base model and compare its performance under three settings: \(1\) no tool augmentation, \(2\) augmentation with ground\-truthBioToolAPI calls, and \(3\) augmentation with a BioTool\-fine\-tuned Qwen3\-4B\-Instruct tool\-calling model\. We evaluate these three settings across all test queries using side\-by\-side human judgments by two annotators with college\-level bioinformatics backgrounds\. Annotators compare settings \(1\) vs\. \(2\) and \(1\) vs\. \(3\), selecting the better answer based on informativeness and task fulfillment, while rejecting vague or scientifically incorrect responses\. The normalized win rates for the two comparisons are shown in Figure[4](https://arxiv.org/html/2605.05758#S4.F4)\. The reported win rates are the average of the two annotators’ individual results\. Raw preference results and normalization procedures are detailed in Appendix[D](https://arxiv.org/html/2605.05758#A4)\.

We observe that tool augmentation substantially improves the quality of biomedical answers, demonstrating that grounding LLMs in verifiable data from NCBI, Ensembl, and UniProt effectively mitigates domain\-specific hallucinations and imprecise generalizations\. The oracle configuration achieves a 94\.2% win rate over the base model, highlighting the high quality of theBioTooldataset\. Similarly, theBioTool\-fine\-tuned Qwen3\-4B\-Instruct model attains an 84\.5% win rate, indicating that a small, fine\-tuned model can improve the correctness and helpfulness of large commercial LLMs as judged by human evaluators, further demonstrating the practical utility ofBioTool\.

![Refer to caption](https://arxiv.org/html/2605.05758v1/x4.png)Figure 4:Human evaluation results comparing answer quality betweenBioTool\-augmented LLMs and LLMs without tool usage, using GPT\-5\.1 as the base model\. The augmented settings include GPT\-5\.1 with oracleBioTooldata \(left\) and GPT\-5\.1 with aBioTool\-fine\-tuned Qwen3\-4B\-Instruct tool caller \(right\)\.Table 2:Comparative evaluation of EM and AS metrics \(higher is better\)\. Model names with the suffix*Ins*denote instruction\-tuned variants\. Bold values indicate the best performance in each column\.### 4\.4Additional Results

We report results for the additional metrics under the same experimental settings in Table[4\.2](https://arxiv.org/html/2605.05758#S4.SS2)to provide further insights into model behavior and dataset characteristics\. As shown in Table[4\.3](https://arxiv.org/html/2605.05758#S4.SS3), there is a clear divergence between Exact Match \(EM\) and API Success \(AS\), particularly for proprietary models\. Although models such as Claude 4\.5 Sonnet and Gemini 3 Pro achieve high AS scores, their EM remains extremely low, indicating difficulty in producing parameterizations that exactly match reference specifications\. In contrast, theBioTool\-fine\-tuned Qwen3\-4B\-Instruct achieves an EM nearly six times that of the best proprietary model, highlighting the necessity of fine\-tuning for learning the precise syntax of biological APIs\. The EM–AS gap also reflects the varying complexity of biological repositories\. On the NCBI subset, proprietary models such as GPT\-5\.1 fail to achieve any exact matches and frequently encounter execution errors, likely due to strict identifier formats and nested parameters\. Fine\-tuned models, however, maintain high execution success, demonstrating thatBioTooltrains functionally robust models that produce valid and biologically meaningful API calls even without exact string matches\.

Table 3:Distribution of parameter\-level error categories among API failures\.Missing,Extra, andWrongdenote the proportion of the total test set attributable to failures involving missing, extra, and wrong\-value parameters, respectively\. Bold values indicate the lowest error rate in each column\.### 4\.5Error Analysis

To gain deeper insight into model failure modes, we conduct a systematic error analysis over all API call failures on the whole test set\. We categorize parameter\-level mistakes into three mutually non\-exclusive types\.Missing parametersrefers to cases where the predicted call omits one or more arguments present in the ground\-truth reference, thereby altering the biological scope of the retrieved result; for instance, the omission of thespeciesargument from an Ensembl comparative genomics call, which causes the endpoint to return homology data for an unintended reference organism\.Extra parametersrefers to cases where the predicted call includes arguments absent from the reference, potentially redirecting the query’s intent; like injecting acanonicalflag into a VEP annotation call, which restricts consequence reporting to canonical transcripts only and suppresses annotations for non\-canonical isoforms that may be biologically or clinically relevant\.Wrong parameter valuesrefers to cases where the argument name is correct, but the assigned value is semantically incorrect; for example, specifying"blastp"in place of"tblastn"as the BLAST program, which conflates protein\-against\-protein and protein\-against\-translated\-nucleotide search modes and yields entirely incompatible results\. The distribution of these error types across all evaluated models is reported in Table[4\.4](https://arxiv.org/html/2605.05758#S4.SS4)\.

As shown in the results, proprietary models and non\-fine\-tuned open\-source models exhibit pervasive failures in semantic parameter mapping, including incorrect database or program selections in NCBI Entrez and BLAST calls, misspecified biological identifiers such as taxonomic names, and erroneous traversal targets in Entrez link operations, all of which cause the retrieved data to be biologically misaligned with the user’s intent, regardless of whether the call itself executes successfully\.BioToolfine\-tuning substantially mitigates these semantic mapping failures, yielding dramatically reduced API failure rates across all trained model variants and confirming that correct biomedical API invocation requires domain\-specific schema grounding that neither general\-purpose pretraining nor in\-context learning can reliably provide\.

### 4\.6Case Study

To illustrate howBioToolenhances LLMs’ biomedical capabilities, we present an example comparing the API\-calling behavior and tool\-augmented answers of aBioTool\-fine\-tuned Qwen3\-4B\-Instruct model and Claude 4\.5 Sonnet on a complex genomic linkage query\. The ICL–based Claude model exhibits low precision in handling domain\-specific tool usage rules, directly passing literal natural\-language terms such as “goat” and “NextGen” from the user query into the API arguments\. In contrast, theBioTool\-fine\-tuned model correctly maps these concepts to the required API parameters\. This example demonstrates thatBioToolprovides high\-quality supervision for learning the implicit rules of biomedical tool usage, enabling reliable natural\-language\-to\-tool mapping that state\-of\-the\-art ICL\-based LLMs often fail to achieve\.

Comparison of API Call and Response to an Example User QueryUser QueryWithin 29:707234\-757234, which variant pairs show strong linkage in the NextGen goat population?BioTool\-fine\-tuned Qwen3\-4B API Callget\_ld\_region\(species=‘‘capra\_hircus’’, region=‘‘29:707234\-757234’’, population\_name=‘‘NextGen:All’’, d\_prime=0\.8\)BioTool\-fine\-tuned Qwen3\-4B Response\[ \{"variation1": ‘‘rs661133063’’, ‘‘variation2’’: ‘‘rs668584442’’, ‘‘d\_prime’’: 1\.0 \}, \.\.\. \]Claude API Callget\_ld\_region\(species=‘‘goat’’, region=‘‘29:707234\-757234’’, population\_name=‘‘NextGen’’\)Claude Response\{ ‘‘error’’: ‘‘Can not find internal name for species ’goat’ ’’ \}Oracle API Callget\_ld\_region\(species=‘‘capra\_hircus’’, region=‘‘29:707234\-757234’’, population\_name=‘‘NextGen:All’’, r2=0\.5\)

## 5Conclusion

In this work, we introduceBioTool, a comprehensive biomedical tool\-calling dataset comprising 7,040 human\-verified query\-API call pairs spanning 124 biomedical tools\. Fine\-tuning a 4\-billion\-parameter LLM onBioToolleads to substantial improvements in API\-calling performance, surpassing cutting\-edge proprietary LLMs\. Furthermore, human evaluations confirm thatBioTool\-augmented LLMs generate more helpful, informative, and scientifically accurate answers compared to the same base models without tool usage, shedding light on the development of reliable biomedical agents in the future\.

## Limitations

Despite the performance gains observed withBioTool, several limitations remain\. Our current framework focuses exclusively on one\-hop tool calling responses\. This ignores more complex biological problems that cannot be solved with a single API interaction and instead require multi\-hop search results or iterative reasoning across multiple tools\. Furthermore, we did not fine\-tune an independent, specialized biomedical agent\. This architectural choice was necessitated by the extreme context length of raw biological observations, which frequently exceed our resource limitations even after post\-processing and summarization\. Future work should explore long\-context architectures and multi\-step reasoning trajectories to better support the most intricate clinical and research workflows\.

## Acknowledgements

We acknowledge funding support from the National Science Foundation \(NSF\) under grants IIS\-2405974 and IIS\-2339216, and from the National Institutes of Health \(NIH\) under grant R35GM157217\.

## References

- Ahmad et al\. \(2025\)Shadab Ahmad, Leonardo Jose da Costa Gonzales, Emily H Bowler\-Barnett, Daniel L Rice, Minjoon Kim, Supun Wijerathne, Aurélien Luciani, Swaathi Kandasaamy, Jie Luo, Xavier Watkins, Edd Turner, Maria J Martin, and the UniProt Consortium\. 2025\.[The uniprot website api: facilitating programmatic access to protein knowledge](https://doi.org/10.1093/nar/gkaf394)\.*Nucleic Acids Research*, 53\(W1\):W547–W553\.
- Altschul et al\. \(1990\)Stephen F\. Altschul, Warren Gish, Webb Miller, Eugene W\. Myers, and David J\. Lipman\. 1990\.[Basic local alignment search tool](https://doi.org/10.1016/S0022-2836(05)80360-2)\.*Journal of Molecular Biology*, 215\(3\):403–410\.
- Anthropic \(2025\)Anthropic\. 2025\.[System card: Claude haiku 4\.5](https://assets.anthropic.com/m/99128ddd009bdcb/Claude-Haiku-4-5-System-Card.pdf)\.System Card\.
- Anthropic \(2025\)Anthropic\. 2025\.[System card: Claude sonnet 4\.5](https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf)\.
- Bai et al\. \(2023\)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K\. Lu, and 31 others\. 2023\.[Qwen technical report](https://api.semanticscholar.org/CorpusID:263134555)\.*ArXiv*, abs/2309\.16609\.
- Bran et al\. \(2024\)Andres M\. Bran, Sean Cox, Oliver Schilter, and 1 others\. 2024\.[Augmenting large language models with chemistry tools](https://doi.org/10.1038/s42256-024-00832-8)\.*Nature Machine Intelligence*, 6:525–535\.
- Chen et al\. \(2025\)Qiang Chen, Yifan Hu, Xiaohan Peng, and 1 others\. 2025\.[Benchmarking large language models for biomedical natural language processing applications and recommendations](https://doi.org/10.1038/s41467-025-56989-2)\.*Nature Communications*, 16:3280\.
- Google \(2025\)Google\. 2025\.[Gemini 3 pro model card](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf)\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others\. 2024\.[The llama 3 herd of models](https://arxiv.org/abs/2407.21783)\.*Preprint*, arXiv:2407\.21783\.
- Huang et al\. \(2025\)Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Junze Zhang, Yin Di, and 1 others\. 2025\.Biomni: A general\-purpose biomedical ai agent\.*bioRxiv*, pages 2025–05\.
- Hubbard et al\. \(2002\)Tim Hubbard, David Barker, Ewan Birney, Graham Cameron, Yong Chen, Lucy Clark, Tony Cox, James Cuff, Val Curwen, Thomas Down, Richard Durbin, Eduardo Eyras, James Gilbert, Matthew Hammond, Lukasz Huminiecki, Arek Kasprzyk, Heikki Lehvaslaiho, Peter Lijnzaad, Chris Melsopp, and 16 others\. 2002\.[The ensembl genome database project](https://doi.org/10.1093/nar/30.1.38)\.*Nucleic Acids Research*, 30\(1\):38–41\.
- Jin et al\. \(2023\)Qiao Jin, Won Kim, Qingyu Chen, Donald C Comeau, Lana Yeganova, W John Wilbur, and Zhiyong Lu\. 2023\.Medcpt: Contrastive pre\-trained transformers with large\-scale pubmed search logs for zero\-shot biomedical information retrieval\.*Bioinformatics*, 39\(11\):btad651\.
- Jin et al\. \(2024\)Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu\. 2024\.Genegpt: Augmenting large language models with domain tools for improved access to biomedical information\.*Bioinformatics*, 40\(2\):btae075\.
- Krippendorff \(2011\)Klaus Krippendorff\. 2011\.Computing krippendorff’s alpha\-reliability\.
- Landis and Koch \(1977\)J Richard Landis and Gary G Koch\. 1977\.The measurement of observer agreement for categorical data\.*biometrics*, pages 159–174\.
- Li et al\. \(2025a\)Mingchen Li, Zaifu Zhan, Han Yang, Yongkang Xiao, Huixue Zhou, Jiatan Huang, and Rui Zhang\. 2025a\.[Benchmarking retrieval\-augmented large language models in biomedical nlp: Application, robustness, and self\-awareness](https://doi.org/10.1126/sciadv.adr1443)\.*Science Advances*, 11\(47\):eadr1443\.
- Li et al\. \(2025b\)Xuchen Li, Ruitao Wu, Xuanbo Liu, Xukai Wang, Jinbo Hu, Zhixin Bai, Bohan Zeng, Hao Liang, Leheng Chen, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Xu\-Yao Zhang, Liu Liu, Jia Li, Kaiqi Huang, Jiahao Xu, Haitao Mi, Wentao Zhang, and Bin Dong\. 2025b\.[Sciagent: A unified multi\-agent system for generalistic scientific reasoning](https://arxiv.org/abs/2511.08151)\.*Preprint*, arXiv:2511\.08151\.
- Liu et al\. \(2024\)Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, and 1 others\. 2024\.Apigen: Automated pipeline for generating verifiable and diverse function\-calling datasets\.*arXiv preprint arXiv:2406\.18518*\.
- McNemar \(1947\)Quinn McNemar\. 1947\.Note on the sampling error of the difference between correlated proportions or percentages\.*Psychometrika*, 12\(2\):153–157\.
- NCBI \(2017\)NCBI\. 2017\.[Database resources of the national center for biotechnology information](https://doi.org/10.1093/nar/gkx1095)\.*Nucleic Acids Research*, 46\(D1\):D8–D13\.
- OpenAI \(2023\)OpenAI\. 2023\.[Gpt\-4 technical report](https://api.semanticscholar.org/CorpusID:257532815)\.
- OpenAI \(2025a\)OpenAI\. 2025a\.[Gpt\-5\.1\-codex\-max system card](https://cdn.openai.com/pdf/2a7d98b1-57e5-4147-8d0e-683894d782ae/5p1_codex_max_card_03.pdf)\.
- OpenAI \(2025b\)OpenAI\. 2025b\.[Gpt\-5\.1 instant and gpt\-5\.1 thinking system card addendum](https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf)\.
- OpenAI \(2025\)OpenAI\. 2025\.[Openai o3 and o4\-mini system card](https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf)\.System Card\.
- Patil et al\. \(2025\)Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng\-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E\. Gonzalez\. 2025\.[The berkeley function calling leaderboard \(BFCL\): From tool use to agentic evaluation of large language models](https://openreview.net/forum?id=2GmDdhBdDk)\.In*Forty\-second International Conference on Machine Learning*\.
- Patil et al\. \(2024\)Shishir G\. Patil, Tianjun Zhang, Xin Wang, and Joseph E\. Gonzalez\. 2024\.Gorilla: Large language model connected with massive apis\.
- Qin et al\. \(2023\)Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun\. 2023\.[Toolllm: Facilitating large language models to master 16000\+ real\-world apis](https://arxiv.org/abs/2307.16789)\.*Preprint*, arXiv:2307\.16789\.
- Qwen et al\. \(2025\)Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others\. 2025\.[Qwen2\.5 technical report](https://arxiv.org/abs/2412.15115)\.*Preprint*, arXiv:2412\.15115\.
- Sayers \(2010\)Eric Sayers\. 2010\.A general introduction to the e\-utilities\.*Entrez Programming Utilities Help \[Internet\]\. Bethesda \(MD\): National Center for Biotechnology Information \(US\)*\.
- Schick et al\. \(2023\)Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\. 2023\.[Toolformer: Language models can teach themselves to use tools](https://arxiv.org/abs/2302.04761)\.*Preprint*, arXiv:2302\.04761\.
- The UniProt Consortium \(2017\)The UniProt Consortium\. 2017\.[Uniprot: the universal protein knowledgebase](https://doi.org/10.1093/nar/gkw1099)\.*Nucleic Acids Research*, 45\(D1\):D158–D169\.
- Wang et al\. \(2022\)Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A\. Smith, Daniel Khashabi, and Hannaneh Hajishirzi\. 2022\.[Self\-instruct: Aligning language models with self\-generated instructions](https://api.semanticscholar.org/CorpusID:254877310)\.In*Annual Meeting of the Association for Computational Linguistics*\.
- Wei et al\. \(2023\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou\. 2023\.[Chain\-of\-thought prompting elicits reasoning in large language models](https://arxiv.org/abs/2201.11903)\.*Preprint*, arXiv:2201\.11903\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*Preprint*, arXiv:2505\.09388\.
- Yates et al\. \(2014\)Andrew Yates, Kathryn Beal, Stephen Keenan, William McLaren, Miguel Pignatelli, Graham R\. S\. Ritchie, Magali Ruffier, Kieron Taylor, Alessandro Vullo, and Paul Flicek\. 2014\.[The ensembl rest api: Ensembl data for any language](https://doi.org/10.1093/bioinformatics/btu613)\.*Bioinformatics*, 31\(1\):143–145\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\. 2023\.[Judging llm\-as\-a\-judge with mt\-bench and chatbot arena](https://arxiv.org/abs/2306.05685)\.*Preprint*, arXiv:2306\.05685\.

## Appendix AHeuristic Filter Detail

In this section, we provide a more granular explanation of the heuristic filtering strategies employed during the API call synthesis and verification phase\.

The specific filtering logic varies across the three integrated databases to account for differences in their API architectures and the nature of the biological data they provide\. For UniProt, which primarily provides functional protein annotations and sequence data, we implement a strict deduplication process by filtering out all API calls targeting the same unique identifier, such as a UniRef entry ID or keyword entry ID, within the same tool to prevent the over\-representation of specific proteins\. Furthermore, we validate every execution result by discarding any responses that return empty lists or "null" search results, thereby ensuring that every retained API call contains at least one valid, non\-empty biological observation\. Ensembl requires a more nuanced dual\-path approach to balance diversity and validity when handling complex genomic coordinates\. For endpoints with a restricted set of valid parameter combinations \(defined as fewer than 20\), where strict ID deduplication would yield insufficient data, we selectively retain entries where the optional parameters, such as species or variants, are not identical, while query IDs are the same\. Conversely, for "rich" APIs with an expansive range of possible inputs, we apply a strategy similar to UniProt by filtering out any samples where the combination of required parameters is identical to an existing entry to prevent the model from over\-fitting to specific genomic regions\. For NCBI, the strategy is optimized for high\-throughput tools and general metadata retrieval\. We apply specialized heuristics to the BLAST tool, only retaining parameter combinations that involve unique query sequences and return at least one significant alignment hit, while removing matchless queries that cannot support downstream scientific reasoning\. Other NCBI APIs are filtered using a standard heuristic method that removes identical identifier calls and verifies that the retrieved observations remain biologically informative\.

## Appendix BDataset Scale Analysis

To examine how performance scales with training data volume, we train Qwen3\-4B\-InstructYang et al\. \([2025](https://arxiv.org/html/2605.05758#bib.bib34)\)on six subsets of theBioTooltraining split, ranging from 10% to 100%\. Table[4](https://arxiv.org/html/2605.05758#A2.T4)presents the improvement over the untuned Qwen3\-4B\-Instruct baseline, whose overall Exact Match, API Success, and BioTool Score are 3\.6, 73\.3, and 63\.1, respectively\. The results show that even the smallest training subset yields substantial gains across all three metrics, confirming thatBioToolprovides strong supervision from the early stages of scaling\. As the training set expands, the gain in BioTool Score increases steadily, while Exact Match continues to improve throughout the full range, indicating that parameter\-level precision remains the principal source of additional benefit at larger scales\.

Table 4:Overall performance gains of Qwen3\-4B\-Instruction fine\-tuned on different fractions of theBioTooltraining set, measured as absolute improvements over the base Qwen3\-4B\-Instruction baseline\.Δ\\DeltaEM,Δ\\DeltaAS, andΔ\\DeltaBioTool denote gains in Exact Match, API Success, and BioTool Score, respectively\. Bold values indicate the largest gain in each column\.
## Appendix CGeneralization Capability

To examine whetherBioTool\-fine\-tuned models can generalize beyond the APIs observed during training, we construct a stricter evaluation split based on API identity, such that all samples associated with the same API function are assigned to a single partition, and every API in the test set is unseen during training\. We then compare the resulting performance of Qwen3\-4B\-Instruct against GPT\-5\.1 and GPT\-5\.1\-Codex\. As shown in Table[5](https://arxiv.org/html/2605.05758#A3.T5), Qwen3\-4B\-Instruct retains a clear advantage in Exact Match and also achieves the strongest BioTool Score, indicating thatBioToolfine\-tuning continues to improve the structural fidelity and overall quality of API calling even when evaluation is conducted on previously unseen functions\. Compared with the previously reported results under the standard random split, the performance gap becomes smaller in this setting, indicating that generalization to unseen APIs is substantially more challenging than generalization to new instances of previously observed APIs\. Nevertheless, the continued advantage of Qwen3\-4B\-Instruct shows thatBioToolfine\-tuning yields transferable gains that extend beyond memorization of the training API set\.

Table 5:Model performance on the unseen\-API evaluation split, where all instances associated with the same API function are assigned to a single partition\. EM, AS, and BioTool Score denote Exact Match, API Success, and BioTool Score, respectively\. Bold values indicate the best result in each column\.
## Appendix DHuman Evaluation Details

This section details the manual side\-by\-side assessment process and provides the raw preference data used to derive the winning rates reported in Section[4\.3](https://arxiv.org/html/2605.05758#S4.SS3)\. A total of 1,408 samples were evaluated for each comparison setting by researchers with biological backgrounds to compare the performance of tool\-augmented models against the base GPT\-5\.1 generator\. Table[6](https://arxiv.org/html/2605.05758#A4.T6)summarizes the distribution of these outcomes, including cases where both models performed well, or both failed to provide a satisfactory answer\.

Table 6:Raw human preference distribution for pairwise model evaluations\. Model A refers to the tool\-augmented configuration \(Qwen3\-4B or Oracle\), and Model B refers to the base GPT\-5\.1 model without tool access\.To provide a balanced comparison that accounts for samples where neither model showed a distinct advantage, we calculated the adjusted winning rate reported in Figure[4](https://arxiv.org/html/2605.05758#S4.F4)based on the logic of McNemar’s testMcNemar \([1947](https://arxiv.org/html/2605.05758#bib.bib19)\)\. In this framework, “Both Good” and “Both Bad” responses are collectively treated as ties \(nc=ng​o​o​d\+nb​a​dn\_\{c\}=n\_\{good\}\+n\_\{bad\}\) and adjusted by splitting them evenly between the two conditions\. Specifically, given the raw preference countsnan\_\{a\}andnbn\_\{b\}, the adjusted preference numbersna′n^\{\\prime\}\_\{a\}andnb′n^\{\\prime\}\_\{b\}were calculated asna′=na\+12​ncn^\{\\prime\}\_\{a\}=n\_\{a\}\+\\frac\{1\}\{2\}n\_\{c\}andnb′=nb\+12​ncn^\{\\prime\}\_\{b\}=n\_\{b\}\+\\frac\{1\}\{2\}n\_\{c\}\.

#### Annotation Reliability\.

To validate the quality of the labeling procedure, we computed inter\-annotator agreement between the two human annotators on the manually labeled samples per task\. Table[7](https://arxiv.org/html/2605.05758#A4.T7)reports Cohen’sκ\\kappaand Krippendorff’sα\\alphafor each task and for the combined dataset\. Both metrics exceed the commonly accepted threshold of 0\.667 for “acceptable” agreementKrippendorff \([2011](https://arxiv.org/html/2605.05758#bib.bib14)\), and fall within the “substantial” range \(κ≥0\.61\\kappa\\geq 0\.61\) under the Landis & Koch scaleLandis and Koch \([1977](https://arxiv.org/html/2605.05758#bib.bib15)\), indicating reliable annotation across all settings\.

Table 7:Inter\-annotator agreement between two human annotators\. Each comparison is between a tool\-augmented model \(Oracle or Qwen3\-4B\) and the base GPT\-5\.1 model without tool access\. Agreement is computed on the four raw preference categories\.

## Appendix EPrompts

### E\.1Prompt for creating user queries

The following prompt is used to generate natural language user queries\. It requires four distinct input streams: \(1\) the source document context, \(2\) API function specifications, \(3\) retrieved biological observations, and \(4\) in\-context few\-shot demonstrations\.

System PromptYou generate realistic biomedical questions that researchers naturally ask\.TASK OVERVIEW: TWO\-PHASE REASONINGPhase 1 – ANALYZE:Map technical parameters to natural language concepts \(Qualitative Mapping\)\.Phase 2 – GENERATE:Create TWO questions \(one Broad/Implicit, one Specific/Qualitative\)\.PHASE 1: PARAMETER MAPPING & ABSTRACTIONDo not simply list parameters\. You must translateDataintoLanguage\.•VERBATIM \(Keep Exact\):Unique identifiers \(gene symbols, rsIDs, accessions\), Raw sequences \(FASTA format\), and Coordinates \(e\.g\., “chr1:100\-200”\)\.•QUALITATIVE MAPPING \(Translate Numbers/Codes\):∘\\circThresholds:Map high numbers to adjectives like “strong” or “significant” \(e\.g\.,d\_prime=0\.8→\\rightarrow“strong linkage”\)\.∘\\circComplex Codes:Simplify technical strings to common terms \(e\.g\.,1000GENOMES\.\.\.→\\rightarrow“1000 Genomes data”\)\.•IMPLICIT DEFAULTS \(Selectively Omit\):If a parameter just ensures usable results \(e\.g\.,format=json\), OMIT it\. The user implies “good results” by asking\.PHASE 2: QUESTION GENERATION STRATEGYYour goal isTool Bias: The question should be specific enough thatthis toolis the logical choice, without naming it\.Question 1: The “Implicit” Question \(Natural & Broad\)\.A question a biologist asks a colleague\. Hide strict parameters; assume the tool’s filters represent the broad intent\.Question 2: The “Qualitative” Question \(Specific Demand\)\.A researcher asking for a specificquality\. Use adjectives to reflect parameter values \(e\.g\., “strong LD”\)\.RULES:•Ask onlyonequestion at a time\.•Keep the question concise and to the point\.•Do not use parentheses for supportive information\.OUTPUT FORMAT Return JSON only with keys:param\_analysis,observation\_check, andquestions\.

User PromptGENERATE TWO NATURAL BIOMEDICAL QUESTIONSAPI DOCUMENTATION \(Understand the tool’s specific bias and domain:\) \[API\_DOC\_TEXT\]STEP 1 – Classify parameters PARAMS: \[PARAMS\_JSON\]STEP 2 – Check observation \(Distinguish between AVAILABLE data and MISSING/EMPTY data\) OBSERVATION: \[OBSERVATION\_JSON\]STEP 3 – Write TWO questions•Question 1:Broad intent \(Natural tone, implies need for this specific tool\)\.•Question 2:Specific feature \(Focus on a field that HAS data\)\.•MUST include these identifiers:\[IDENTIFIERS\_BLOCK\]REFERENCE EXAMPLES \[FEW\_SHOTS\_TEXT\]Output JSON withparam\_analysis,observation\_check, andquestions\.

### E\.2Prompt for informative check

The following prompt is used to evaluate whether an observation is informative enough to answer a specific user query\. It requires \(1\) the natural language user query and \(2\) the JSON representation of the biological observation\.

System PromptYou are an evaluator for dataset filtering\.Goal:Decide whether the OBSERVATION is informative \(useful\) for answering the USER QUERY\.IMPORTANT CONTEXT:Observations are POSTPROCESSED SUMMARIES \(often partial\)\. This is NOT a strict completeness check\.Be concise and deterministic\.

User PromptUSER QUERY: \[USER\_QUERY\_TEXT\]OBSERVATION \(tool output / retrieved data\): \[OBSERVATION\]RUBRIC •informative=trueif the observation contains at least ONE relevant, non\-trivial fact that can be used to answer part of the query without inventing details\.∘\\circExamples/partial lists still count\.∘\\circCounts/aggregates/summaries still count\.∘\\circIf the query asks “which X” but the observation only gives a count or a few examples, that is stillinformative=true\(partial answer\)\.•informative=falseONLY when:∘\\circObservation is an error / empty / placeholder, OR∘\\circObservation content is clearly unrelated to the query intent, OR∘\\circObservation is too vague to support even a single concrete statement relevant to the query\.When writing the reason:•Focus on what CAN be answered using the observation \(even partially\)\.•If partial, put the missing parts into limitations, but do NOT flip to false just because it’s incomplete\.•Be concise and specific \(name the fields/signals you used\)\.OUTPUT FORMAT \(do NOT output JSON; output exactly these lines\) INFORMATIVE:true\|false REASON:<short reason\> LIMITATIONS:<optional; if none, write "none"\>

### E\.3Prompt for generating answers

The following prompts are used to generate the final natural language responses for the human expert evaluation\. These prompts require the original user query, the generated api call, and the corresponding biological observations as input\.

System Prompt \(Base Model\)You are a concise, accurate biomedical assistant\. Answer the user question as directly as possible in 2–5 sentences, using your general biomedical knowledge and reasonable domain assumptions\. Do NOT mention tools, APIs, databases, internet access, or that you cannot look things up\. Do NOT tell the user how to get the information\. Answer directly\. If the question asks for record\-level details you cannot know exactly, give the best plausible answer in a natural, helpful way without refusals or meta statements \(avoid phrasing like “I can’t”, “I don’t know”, “without risk of error”\)\.

User Prompt \(Base Model\)\[USER\_QUERY\]

System Prompt \(Tool\-Augmented\)You are a concise, accurate biomedical assistant\. You are given a user question plus a tool call and its observation output\. Answer in 2–6 sentences\. Use the observation as primary evidence and your general knowledge\. Write the answer directly as if you already know the facts\. If the observation is insufficient, you may add general biomedical context, but do not invent record\-level facts that should come from the observation\.

User Prompt \(Tool\-Augmented\)User question: \[USER\_QUERY\]API call \(for context\): \[API\_CALL\_JSON\]Observation \(tool output\): \[OBSERVATION\_TEXT\]

## Appendix FTool and API List

The following part enumerates all tools and their corresponding APIs used in this work, grouped by data source\.

NCBI Tools∙\\bulletESearch∘\\circesearch∙\\bulletELink∘\\circelink∙\\bulletEFetch∘\\circefetch∙\\bulletEInfo∘\\circeinfo∙\\bulletBLAST∘\\circblast

UniProt Tools∙\\bulletUniProtKB∘\\circget\_uniprotkb\_entry∘\\circsearch\_uniprotkb∘\\circstream\_uniprotkb∙\\bulletUniRef∘\\circget\_uniref\_by\_id∘\\circget\_uniref\_light∘\\circget\_uniref\_members∘\\circsearch\_uniref∘\\circstream\_uniref∙\\bulletUniParc∘\\circget\_uniparc\_by\_upi∘\\circget\_uniparc\_databases∘\\circget\_uniparc\_light∘\\circsearch\_uniparc∘\\circstream\_uniparc∙\\bulletGeneCentric∘\\circget\_genecentric\_by\_accession∘\\circget\_genecentric\_by\_proteome\_id∘\\circsearch\_genecentric∘\\circstream\_genecentric∙\\bulletProteomes∘\\circget\_proteome\_by\_upid∘\\circsearch\_proteomes∘\\circstream\_proteomes∙\\bulletLiterature citations∘\\circget\_citation\_by\_id∘\\circsearch\_literature\_citations∘\\circstream\_literature\_citations∙\\bulletKeywords∘\\circget\_keyword\_by\_id∘\\circsearch\_keywords∘\\circstream\_keywords∙\\bulletHuman diseases∘\\circget\_disease\_by\_id∘\\circsearch\_human\_diseases∘\\circstream\_human\_diseases∙\\bulletSubcellular locations∘\\circget\_location\_by\_id∘\\circsearch\_subcellular\_locations∘\\circstream\_subcellular\_locations∙\\bulletCross\-referenced databases∘\\circget\_crossref\_database\_by\_id∘\\circsearch\_crossref\_databases∘\\circstream\_crossref\_databases∙\\bulletTaxonomy∘\\circget\_taxonomy\_by\_id∘\\circsearch\_taxonomy∘\\circstream\_taxonomy∙\\bulletUniRule∘\\circget\_unirule\_by\_id∘\\circsearch\_unirule∘\\circstream\_unirule∙\\bulletARBA∘\\circget\_arba\_by\_id∘\\circsearch\_arba∘\\circstream\_arba∙\\bulletArchive∘\\circget\_archive\_id

Ensembl Tools∙\\bulletComparative Genomics∘\\circget\_alignment\_region∘\\circget\_cafe\_genetree\_by\_id∘\\circget\_cafe\_genetree\_by\_member\_id∘\\circget\_cafe\_genetree\_by\_member\_symbol∘\\circget\_genetree\_by\_id∘\\circget\_genetree\_member\_by\_id∘\\circget\_genetree\_member\_by\_symbol∘\\circget\_homology\_by\_id∘\\circget\_homology\_by\_symbol∙\\bulletCross References∘\\circget\_xrefs\_by\_id∘\\circget\_xrefs\_by\_symbol∘\\circlookup\_xref\_name∙\\bulletInformation∘\\circget\_info\_analysis∘\\circget\_info\_assembly∘\\circget\_info\_assembly\_region∘\\circget\_info\_biotypes∘\\circget\_info\_biotypes\_groups∘\\circget\_info\_biotypes\_name∘\\circget\_info\_compara\_methods∘\\circget\_info\_compara\_species\_sets∘\\circget\_info\_external\_dbs∘\\circget\_info\_genomes∘\\circget\_info\_genomes\_accession∘\\circget\_info\_genomes\_assembly∘\\circget\_info\_genomes\_division∘\\circget\_info\_genomes\_taxonomy∘\\circget\_info\_species∘\\circget\_info\_variation\_population\_name∘\\circget\_info\_variation\_populations∘\\circget\_info\_variation\_sources∙\\bulletLinkage Disequilibrium∘\\circget\_ld\_around\_variant∘\\circget\_ld\_pairwise∘\\circget\_ld\_region∙\\bulletLookup∘\\circlookup\_by\_id∘\\circlookup\_by\_symbol∙\\bulletMapping∘\\circmap\_assembly∘\\circmap\_cdna\_to\_genome∘\\circmap\_cds\_to\_genome∘\\circmap\_translation\_to\_genome∙\\bulletOntologies and Taxonomy∘\\circget\_ontology\_ancestors∘\\circget\_ontology\_ancestors\_chart∘\\circget\_ontology\_descendants∘\\circget\_ontology\_id∘\\circget\_ontology\_name∘\\circget\_taxonomy\_classification∘\\circget\_taxonomy\_id∘\\circget\_taxonomy\_name∙\\bulletOverlap∘\\circoverlap\_by\_id∘\\circoverlap\_by\_region∘\\circoverlap\_translation∙\\bulletPhenotype annotations∘\\circget\_phenotype\_by\_accession∘\\circget\_phenotype\_by\_gene∘\\circget\_phenotype\_by\_region∘\\circget\_phenotype\_by\_term∙\\bulletRegulation∘\\circget\_binding\_matrix∙\\bulletSequence∘\\circget\_sequence\_by\_id∘\\circget\_sequence\_by\_region∙\\bulletTranscript Haplotypes∘\\circget\_transcript\_haplotypes∙\\bulletVEP∘\\circvep\_by\_hgvs∘\\circvep\_by\_id∘\\circvep\_by\_region∙\\bulletVariation∘\\circget\_variation∘\\circget\_variation\_by\_pmcid∘\\circget\_variation\_by\_pmid∘\\circvariant\_recoder∙\\bulletVariation GA4GH∘\\circget\_ga4gh\_callsets∘\\circget\_ga4gh\_datasets∘\\circget\_ga4gh\_features∘\\circget\_ga4gh\_featuresets∘\\circget\_ga4gh\_references∘\\circget\_ga4gh\_referencesets∘\\circget\_ga4gh\_variantannotationsets∘\\circget\_ga4gh\_variants∘\\circget\_query\_beacon

Similar Articles

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL

This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

arXiv cs.CL

MedicalBench is a new benchmark for evaluating large language models on medical concept extraction from electronic health records, focusing on implicit reasoning and evidence grounding. It includes 823 expert-annotated examples and shows that current models perform modestly, highlighting the difficulty of extracting implicitly stated medical concepts.