ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

arXiv cs.AI Papers

Summary

ToolSense is an open-source diagnostic framework that generates three benchmarks (realistic retrieval, MCQ probing, QA probing) to audit LLMs' parametric tool knowledge, revealing a knowledge-retrieval dissociation where strong retrieval performance can coexist with poor factual understanding.

arXiv:2606.12451v1 Announce Type: new Abstract: Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:52 AM

# A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
Source: [https://arxiv.org/html/2606.12451](https://arxiv.org/html/2606.12451)
Ashutosh Hathidara Sai Shruthi Sistla11footnotemark:1Sebastian Schreiber Sahil Bansal SAP Labs \{ashutosh\.hathidara, sai\.shruthi\.sistla, sebastian\.schreiber, sahil\.bansal01\}@sap\.com

###### Abstract

Large language models deployed as agents over large tool catalogs face a critical tool\-retrieval bottleneck\. As embedding\-based retrieval approaches rely on compact encoders that may under\-capture specialized tool semantics, parametric tool retrieval\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\)addresses this by encoding each tool as a*virtual token*appended to the LLM vocabulary, fine\-tuned in two stages \(memorization then retrieval SFT\) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks\. Yet these benchmarks use verbose, fully\-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths — neither reveals whether the model actually*understands*its tools\. We introduceToolSense, an open\-source LLM\-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark \(RRB\) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark\. Applying ToolSense to ToolBench \(∼\\sim47k tools\) and evaluating five parametric model training configurations reveals a*knowledge\-retrieval dissociation*: on RRB queries, several configurations collapse by∼\\sim50\-64 percentage pointscompared to fully\-specified ToolBench benchmarks, falling below embedding\-model baseline\. Additionally, despite strong retrieval performance, some models score near\-random on factual probes, suggesting a knowledge\-retrieval dissociation\. We open\-source the ToolSense framework and the ToolBench diagnostic benchmarks at[https://github\.com/SAP/toolsense](https://github.com/SAP/toolsense)\.

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Ashutosh Hathidara††thanks:Equal contribution\.Sai Shruthi Sistla11footnotemark:1Sebastian Schreiber Sahil BansalSAP Labs\{ashutosh\.hathidara, sai\.shruthi\.sistla, sebastian\.schreiber, sahil\.bansal01\}@sap\.com

## 1Introduction

Large language models are increasingly deployed as autonomous agents over large software ecosystems, where selecting the right tool from catalogs of thousands of APIs is a core bottleneck\(Schicket al\.,[2023](https://arxiv.org/html/2606.12451#bib.bib2); Yaoet al\.,[2023](https://arxiv.org/html/2606.12451#bib.bib3); Qinet al\.,[2024a](https://arxiv.org/html/2606.12451#bib.bib4)\)\. The dominant approach encodes tool descriptions as dense vectors and retrieves top\-kkcandidates via approximate nearest\-neighbor search\(Karpukhinet al\.,[2020](https://arxiv.org/html/2606.12451#bib.bib5)\)\. This paradigm has known limitations: small retrieval encoders may underspecify the semantics of complex, overlapping APIs at scale\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\); retrieved descriptions must still be injected into the LLM context, compounding overhead across multi\-step agent loops; and retriever and generator are trained with separate objectives, preventing joint optimization toward task success\(Tayet al\.,[2022](https://arxiv.org/html/2606.12451#bib.bib6)\)\.

Parametric Tool Retrieval\.Wanget al\.\([2025](https://arxiv.org/html/2606.12451#bib.bib1)\)propose ToolGen, an alternative strategy to encode the entire tool catalog directly into LLM parameters\. Each tool is assigned a unique*virtual token*appended to the model vocabulary as a flat atomic identifier representing the full tool identity \(e\.g\.,<<TIKTOK&&GET\_TRENDING\_VIDEOS\>\>\)\. In Stage 1 \(memorization\), the model is fine\-tuned to associate each tool’s metadata with its virtual token\. In Stage 2 \(retrieval\), the model is further fine\-tuned on \(query→\\tovirtual token\) pairs, learning to generate the correct token given a natural language query\. At inference, a*DisjunctiveTrie*\(Caoet al\.,[2021](https://arxiv.org/html/2606.12451#bib.bib8)\)constrains beam search to valid token sequences, guaranteeing every output maps to a real tool\. Applied to ToolBench\(Qinet al\.,[2024b](https://arxiv.org/html/2606.12451#bib.bib7)\)\(∼\\sim47k tools\), this paradigm achieves over≈0\.90\\approx 0\.90recall on standard ToolBench benchmarks\.

The Diagnostic Gap\.This naturally raises a question:*does parametric training produce a model that understands its tools, or does it simply learn to match query patterns to token identifiers?*A model that has learned only surface\-level mappings may appear to work on distribution\-matched benchmarks while failing the moment a user phrases their query differently\. Yet existing evaluation is not designed to reveal this, for two reasons\. First, the queries in ToolBench’s standard evaluation splits \(G1/G2/G3\) are verbose and fully specified, unlike how a real user would phrase them \(examples in Appendix[A](https://arxiv.org/html/2606.12451#A1)\)\. Second, constrained decoding means the model need only rank paths through a fixed trie rather than freely recall a tool token, masking whether genuine memorization occurred\. This matters practically: downstream agentic fine\-tuning typically requires free\-form token generation without trie support, making undetected memorization gaps a silent failure risk\. Together, verbose queries and constrained decoding make it difficult to distinguish pattern matching from genuine tool knowledge\(Petroniet al\.,[2019](https://arxiv.org/html/2606.12451#bib.bib9)\), motivating a more diagnostic evaluation protocol\.

We introduceToolSense, an open\-source LLM\-powered diagnostic framework that takes any tool catalog as input and automatically generates three diagnostic benchmarks: \(1\) a*Realistic Retrieval Benchmark*\(RRB\) with queries in three ambiguity tiers reflecting how human users actually write queries; \(2\) an*MCQ probing benchmark*testing discriminative factual knowledge about tool capabilities; and \(3\) a*QA probing benchmark*testing inferential factual knowledge about tool properties\. Alongside the benchmarks, we establish a free\-form evaluation protocol that reports an*Internalization Score*\(IS@k = free@k / constrained@k\) as a trie\-dependency diagnostic\.

Applying ToolSense to ToolBench reveals a striking gap: configurations that appear highly capable on standard benchmarks collapse dramatically on realistic queries, falling below non\-parametric baselines\. Factual probing further reveals that Stage 2 training \(responsible for the retrieval gains\) nearly universally destroys the tool knowledge acquired in Stage 1\.

To summarize, our contributions include:\(i\)ToolSense, an open\-source diagnostic framework that auto\-generates RRB, MCQ, and QA benchmarks from any tool catalog;\(ii\)ToolBench diagnostic benchmarksRRB, MCQ, and QA released openly; and\(iii\)an empirical diagnosis of five parametric configurations on ToolBench, revealing the training choices that govern the knowledge\-retrieval dissociation\.

## 2Background and Related Work

Generative and parametric retrieval\.Tayet al\.\([2022](https://arxiv.org/html/2606.12451#bib.bib6)\)introduced Differentiable Search Index \(DSI\), encoding document IDs into transformer parameters and retrieving via constrained beam search\.Caoet al\.\([2021](https://arxiv.org/html/2606.12451#bib.bib8)\)applied autoregressive entity retrieval to knowledge\-grounded generation\.Wanget al\.\([2022](https://arxiv.org/html/2606.12451#bib.bib10)\)extended this to dense document corpora\.Wanget al\.\([2025](https://arxiv.org/html/2606.12451#bib.bib1)\)adapt the paradigm to tool catalogs with virtual tokens and two\-stage training\. All of these systems evaluate exclusively with constrained decoding; our free\-form IS protocol introduces a new diagnostic practice for this entire class of systems\.

Tool learning in LLMs\.Qinet al\.\([2024b](https://arxiv.org/html/2606.12451#bib.bib7)\)build ToolBench, a benchmark covering∼\\sim16k real\-world APIs \(∼\\sim47k tools\)\.Patilet al\.\([2024](https://arxiv.org/html/2606.12451#bib.bib15)\)fine\-tune LLMs to generate tool\-calls with retrieval augmentation\.Schicket al\.\([2023](https://arxiv.org/html/2606.12451#bib.bib2)\)train LLMs to self\-insert tool\-calls in context\. We focus specifically on the*retrieval*stage of parametric tool systems and ask whether that retrieval mechanism encodes tool semantics, a question orthogonal to downstream task performance studied by these works\.

Probing and interpretability\.Petroniet al\.\([2019](https://arxiv.org/html/2606.12451#bib.bib9)\)introduce LAMA, probing pre\-trained LMs for factual knowledge via cloze\-style queries\.Tenneyet al\.\([2019](https://arxiv.org/html/2606.12451#bib.bib11)\)probe BERT’s representations for linguistic structure\. Our MCQ and QA probes extend this tradition to a novel setting: tokens*learned during fine\-tuning*\(virtual tokens\), not pre\-trained representations\. We probe whether fine\-tuning creates genuinely semantic representations or only task\-relevant pointers\.

Knowledge retention in fine\-tuning\.McCloskey and Cohen \([1989](https://arxiv.org/html/2606.12451#bib.bib12)\)identify catastrophic interference as a fundamental failure mode when neural networks are trained sequentially on new tasks\.Huet al\.\([2022](https://arxiv.org/html/2606.12451#bib.bib13)\)propose LoRA to mitigate this by freezing backbone weights and learning low\-rank updates, preserving pre\-trained representations\.Bidermanet al\.\([2024](https://arxiv.org/html/2606.12451#bib.bib14)\)empirically show that LoRA retains more prior task performance than full fine\-tuning\. Our approach extends this line of work to the parametric retrieval setting\.

## 3The ToolSense Framework

The framework takes as input atool catalog

𝒞=\{τi=\(namei,desci\)\}i=1\|𝒞\|\\mathcal\{C\}=\\bigl\\\{\\,\\tau\_\{i\}=\(\\mathrm\{name\}\_\{i\},\\;\\mathrm\{desc\}\_\{i\}\)\\,\\bigr\\\}\_\{i=1\}^\{\|\\mathcal\{C\}\|\}\(1\)and automatically generates three diagnostic benchmark datasets\. Figure[1](https://arxiv.org/html/2606.12451#S3.F1)illustrates the pipeline\. In the following subsections, we explain the benchmark generation methodology in detail\.

![Refer to caption](https://arxiv.org/html/2606.12451v1/x1.png)Figure 1:The ToolSense diagnostic framework generates𝒟RRB\\mathcal\{D\}\_\{\\mathrm\{RRB\}\},𝒟MCQ\\mathcal\{D\}\_\{\\mathrm\{MCQ\}\}, and𝒟QA\\mathcal\{D\}\_\{\\mathrm\{QA\}\}from a tool catalog𝒞\\mathcal\{C\}\.### 3\.1Realistic Retrieval Benchmark \(RRB\)

RRB tests whether a retrieval system generalises beyond verbose\-style queries by presenting it with short, intent\-focused requests at three ambiguity levels\. Generation proceeds in four stages: seed sampling, hard\-negative pool construction, parallel generation tiers, and dual validation\.

Stratified seed sampling\.Rather than generating queries for every tool in𝒞\\mathcal\{C\}, we sampleNRRBN\_\{\\mathrm\{RRB\}\}anchor tools stratified by domain to ensure broad catalog coverage, yielding𝒞RRB⊂𝒞\\mathcal\{C\}\_\{\\mathrm\{RRB\}\}\\subset\\mathcal\{C\}with\|𝒞RRB\|=NRRB\|\\mathcal\{C\}\_\{\\mathrm\{RRB\}\}\|=N\_\{\\mathrm\{RRB\}\}\.

Hard\-negative pool\.For each anchorτ∈𝒞RRB\\tau\\in\\mathcal\{C\}\_\{\\mathrm\{RRB\}\}, letϕ:𝒞→ℝd\\phi:\\mathcal\{C\}\\to\\mathbb\{R\}^\{d\}be a sentence encoder\. We retrieve theKKnearest neighbours by cosine similarity,ℋK​\(τ\)=arg⁡top−⁡Kτ′∈𝒞∖\{τ\}​⟨ϕ​\(τ\),ϕ​\(τ′\)⟩\\mathcal\{H\}\_\{K\}\(\\tau\)=\\arg\\operatorname\{top\-\}K\_\{\\tau^\{\\prime\}\\in\\mathcal\{C\}\\setminus\\\{\\tau\\\}\}\\langle\\phi\(\\tau\),\\phi\(\\tau^\{\\prime\}\)\\rangle, and form the candidate pool𝒫​\(τ\)=\{τ\}∪ℋK​\(τ\)\\mathcal\{P\}\(\\tau\)=\\\{\\tau\\\}\\cup\\mathcal\{H\}\_\{K\}\(\\tau\)with\|𝒫​\(τ\)\|=K\+1\|\\mathcal\{P\}\(\\tau\)\|=K\{\+\}1\. The pool defines a hard\-negative context shown to the generator and the labels from which ground\-truth answersAAare drawn\.

Parallel generation tiers\.For each anchor, three sub\-pipelines run in parallel,t∈\{easy,medium,hard\}t\\in\\\{\\mathrm\{easy\},\\,\\mathrm\{medium\},\\,\\mathrm\{hard\}\\\}, each generating a batch of\(q,A\)\(q,A\)pairs from the same candidate pool𝒫​\(τ\)\\mathcal\{P\}\(\\tau\):

- •Easy:\|A\|=1\|A\|\{=\}1; a concise query pointing to exactly one tool\.
- •Medium:\|A\|∈\{2,3\}\|A\|\\in\\\{2,3\\\}; a cross\-functional request genuinely satisfiable by 2–3 tools\.
- •Hard:\|A\|≥4\|A\|\{\\geq\}4; a high\-level, ambiguous goal spanning 4 or more tools\.

The generator LLMθRRB\\theta\_\{\\mathrm\{RRB\}\}receives\(τ,𝒫​\(τ\),t,ℰ\)\(\\tau,\\,\\mathcal\{P\}\(\\tau\),\\,t,\\,\\mathcal\{E\}\), whereℰ=\{q^i\(e\)\}i=1ne\\mathcal\{E\}=\\\{\\hat\{q\}\_\{i\}^\{\(e\)\}\\\}\_\{i=1\}^\{n\_\{e\}\}are few\-shot query examples provided to guide the style and tone of intent\-focused query generation\.

Dual validation with feedback\.Each generated batch passes two sequential filters\. The programmatic filter𝒱R​R​B​\(q,A,τ,𝒫​\(τ\)\)\\mathcal\{V\}\_\{RRB\}\(q,A,\\tau,\\mathcal\{P\}\(\\tau\)\)rejects entries failing either of two checks: answers must be grounded in the pool \(A⊆𝒫​\(τ\)A\\subseteq\\mathcal\{P\}\(\\tau\)\), and no answer tool name may appear verbatim in the query \(∀τ′∈A:name​\(τ′\)∉q\\forall\\,\\tau^\{\\prime\}\\in A:\\mathrm\{name\}\(\\tau^\{\\prime\}\)\\notin q\)\. Entries passing𝒱R​R​B\\mathcal\{V\}\_\{RRB\}are then scored by an LLM judge𝒥R​R​B\\mathcal\{J\}\_\{RRB\}for naturalness, tier\-compliance, and label correctness\. Rejected entries are dropped, yielding the validated dataset𝒟RRB=\{\(qj,Aj,tj\)\}j=1NRRB\\mathcal\{D\}\_\{\\mathrm\{RRB\}\}=\\\{\(q\_\{j\},\\,A\_\{j\},\\,t\_\{j\}\)\\\}\_\{j=1\}^\{N\_\{\\mathrm\{RRB\}\}\}\.

### 3\.2Probing Benchmarks \(MCQ and QA\)

Both probing benchmarks follow the same pipeline structure: sample a seed set of tools, generate question–answer pairs via an LLM generator, validate with an LLM judge𝒥\\mathcal\{J\}, and filter entries for which no unambiguous question can be formed\. In all accepted entries, the question refers to the tool only as “this tool”, never by name, so the model must answer given the virtual tokenvτv\_\{\\tau\}\.

MCQ\(NMCQN\_\{\\mathrm\{MCQ\}\}tools\)\. For eachτ\\tau,θMCQ\\theta\_\{\\mathrm\{MCQ\}\}produces a factual questionzzabout the tool’s functionality, one correct answeroy⋆o\_\{y^\{\\star\}\}, and three plausible\-but\-wrong distractors, yielding𝒟MCQ=\{\(vτi,zi,\{oki\},yi⋆\)\}\\mathcal\{D\}\_\{\\mathrm\{MCQ\}\}=\\\{\(v\_\{\\tau\_\{i\}\},\\,z\_\{i\},\\,\\\{o\_\{k\}^\{i\}\\\},\\,y\_\{i\}^\{\\star\}\)\\\}\. At evaluation the model receives\(vτ,z,\{ok\}k=14\)\(v\_\{\\tau\},\\,z,\\,\\\{o\_\{k\}\\\}\_\{k=1\}^\{4\}\); constrained decoding restricts output to option letters \(A/B/C/D\) and the predictiony^\\hat\{y\}is compared withy⋆∈\{0,1,2,3\}y^\{\\star\}\\in\\\{0,1,2,3\\\}\.

QA\(NQAN\_\{\\mathrm\{QA\}\}tools\)\. For eachτ\\tau,θQA\\theta\_\{\\mathrm\{QA\}\}receives the description alongside a pre\-specified targetb∈\{Yes,No\}b\\in\\\{\\mathrm\{Yes\},\\mathrm\{No\}\\\}, alternated across tools to ensure label balance, and produces a binary questionwwabout a specific, verifiable tool property \(e\.g\., supported modality, domain, etc\.\), yielding𝒟QA=\{\(vτi,wi,bi\)\}\\mathcal\{D\}\_\{\\mathrm\{QA\}\}=\\\{\(v\_\{\\tau\_\{i\}\},\\,w\_\{i\},\\,b\_\{i\}\)\\\}\. At evaluation the model receives\(vτ,w\)\(v\_\{\\tau\},\\,w\); constrained decoding restricts output to Yes/No andb^\\hat\{b\}is compared withbb\.

### 3\.3Free\-form Evaluation and Internalization Score

LetA^c​\(q,k\)\\hat\{A\}\_\{c\}\(q,k\)andA^f​\(q,k\)\\hat\{A\}\_\{f\}\(q,k\)denote the top\-kkoutputs under*constrained*\(trie\-guided\) and*free\-form*\(unconstrained\) decoding respectively using beam search with beam widthBB\. Constrained and free\-form Recall@kkover query set𝒬\\mathcal\{Q\}are defined asRc​@​k=1\|𝒬\|​∑q∈𝒬𝟙​\[A​\(q\)∩A^c​\(q,k\)≠∅\]R\_\{c\}@k=\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\mathbbm\{1\}\[A\(q\)\\cap\\hat\{A\}\_\{c\}\(q,k\)\\neq\\emptyset\]andRf​@​k=1\|𝒬\|​∑q∈𝒬𝟙​\[A​\(q\)∩A^f​\(q,k\)≠∅\]R\_\{f\}@k=\\frac\{1\}\{\|\\mathcal\{Q\}\|\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\mathbbm\{1\}\[A\(q\)\\cap\\hat\{A\}\_\{f\}\(q,k\)\\neq\\emptyset\]\. TheInternalization Score\(IS\) is then:

IS@k=\{Rf​@​k/Rc​@​kif​Rc​@​k\>00otherwise\\text\{IS@k\}\\;=\\;\\begin\{cases\}R\_\{f\}@k\\;/\\;R\_\{c\}@k&\\text\{if \}R\_\{c\}@k\>0\\\\ 0&\\text\{otherwise\}\\end\{cases\}\(2\)IS≈1\\approx 1indicates the model generates correct tokens equally well without the trie \(retrieval is internalized\); IS≈0\\approx 0indicates complete trie\-dependence\. IS is a ratio of two standard measurements, which enables us to*always*reportRf​@​kR\_\{f\}@kalongsideRc​@​kR\_\{c\}@k, making trie\-dependence transparent\. \(see Appendix[F](https://arxiv.org/html/2606.12451#A6)for design rationale for IS\.\)

## 4Experimental Setup

We instantiate ToolSense on the ToolBench\(Qinet al\.,[2024b](https://arxiv.org/html/2606.12451#bib.bib7)\)tool catalog to empirically diagnose parametric tool retrieval across three dimensions: OOD generalization beyond training\-distribution queries, dependence on constrained decoding, and correspondence between retrieval performance and semantic tool knowledge\. All trained models follow the two\-stage ToolGen paradigm\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\): Stage 1 \(memorization\) fine\-tunes on \(tool metadata→\\rightarrowvirtual token\) pairs; Stage 2 \(retrieval\) fine\-tunes on \(query→\\rightarrowvirtual token\) pairs\.

#### Training and evaluation data\.

Stage 1 uses 46,980 RapidAPI tools provided by ToolBench, each with a api name & description as well as endpoint name & description\. Stage 2 uses∼\\sim195k verbose \(query, tool\) pairs from the ToolBench training split\(Qinet al\.,[2024b](https://arxiv.org/html/2606.12451#bib.bib7)\), generated by GPT\-4\(OpenAIet al\.,[2024](https://arxiv.org/html/2606.12451#bib.bib29)\)\. For standard evaluation, we retain the ToolGen evaluation splits: G1 \(593 queries\), G2 \(399\), and G3 \(100\)\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\); these match the verbose training distribution and serve as the in\-distribution baseline \(see Appendix[A](https://arxiv.org/html/2606.12451#A1)for a query style comparison\)\.

#### Diagnostic benchmarks\.

Applying the ToolSense generation pipelines \(§[3\.1](https://arxiv.org/html/2606.12451#S3.SS1)–§[3\.2](https://arxiv.org/html/2606.12451#S3.SS2)\) to this catalog yields: an RRB of 500 queries across three difficulty tiers \(167 easy / 167 medium / 166 hard\), an MCQ probe of 496 samples testing discriminative factual knowledge, and a QA probe of 500 items testing inferential knowledge\. Together, these evaluate retrieval generalization \(RRB\) and semantic understanding \(MCQ/QA\) as orthogonal dimensions of tool knowledge\.

### 4\.1Model Configurations

We primarily use open\-source instruction\-tunedGemma3\-4B\(Teamet al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib16)\), withQwen3\.5\-4B\(Yanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib17)\)for cross\-architecture validation andGemma3\-12Bfor scale ablations\. For each configuration we train both a full fine\-tuned \(FFT\) and a LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.12451#bib.bib13)\)variant; training details are in Appendix[D](https://arxiv.org/html/2606.12451#A4)\. In all LoRA variants, adapters are attached to linear layers only, while the full embedding layer remains trainable\.

We define five primary training configurations in Table[1](https://arxiv.org/html/2606.12451#S4.T1), each isolating one variable along three axes:*token format*\(flat single\-token vs\. hierarchical multi\-token\),*memorization format*\(single\-format vs\. multi\-format with reverse mappings and hard negatives\), and*training method*\(FFT vs\. LoRA\)\. Flat tokens follow the original ToolGen format \(§[1](https://arxiv.org/html/2606.12451#S1)\); hierarchical tokens decompose the identifier into a two\-step sequence<<API\>\><<Endpoint\>\>\(e\.g\.,<<TIKTOK\>\><<GET\_TRENDING\_VIDEOS\>\>\) that the model should generate autoregressively\.

TGreplicates the original ToolGen setup \(flat tokens, desc→\\totok only, no system prompt\) and serves as the baseline\.TG\-SPadds a task\-specific system prompt applied consistently across both training stages and inference\.TG\-3FMextends TG\-SP with two additional Stage 1 formats \(tok→\\todesc and Multi\-Choice Tool Selection \(MCTS, Appendix[B](https://arxiv.org/html/2606.12451#A2)\)\), testing whether richer memorization objectives produce more robust representations\.TG\-Hswitches from flat to hierarchical tokens while keeping all else matched to TG\-SP, isolating the effect of token structure\.TG\-5FMcombines hierarchical tokens with all five memorization formats \(Table[1](https://arxiv.org/html/2606.12451#S4.T1)\), where the additional formats train each hierarchical generation step independently\. Where applicable, we additionally train aLoRAvariant for the above configurations\. Full training details for each configuration are in Appendix[C](https://arxiv.org/html/2606.12451#A3)\.

Stage 2 training uses identical \(query→\\totok\) data across all configurations; the only variation is whether a system prompt is prepended, consistent with how each configuration was trained in Stage 1\.

ConfigTokensTool Memo\. FormatsSPTrainTGFlat1 \(desc→\\rightarrowtok\)✗FFT, LoRATG\-SPFlat1 \(desc→\\rightarrowtok\)✓FFT, LoRATG\-3FMFlat3 \(desc→\\rightarrowtok, tok→\\rightarrowdesc, MCTS\)✓FFT, LoRATG\-HHier\.1 \(desc→\\rightarrowtok\)✓FFTTG\-5FMHier\.5 \(all∗\)✓FFT, LoRA∗desc→\\rightarrowtok, desc→\\rightarrowapi\_tok, desc\+api\_tok→\\rightarrowendpoint\_tok, tok→\\rightarrowdesc, MCTS

Table 1:Diagnostic model configurations\.tokdenotes the tool token encoding the full tool identity;api\_tokandendpoint\_tokare the two hierarchical tokens corresponding to API and endpoint respectively\. SP = system prompt\.#### Retrieval baselines\.

We compare againstBM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.12451#bib.bib18)\)\(sparse lexical\) andtext\-embedding\-3\-large\(OpenAI,[2024](https://arxiv.org/html/2606.12451#bib.bib19)\)\(te3l; OpenAI dense embeddings\) on all evaluation splits\.

#### Evaluation metrics\.

Rc​@​kR\_\{c\}@k,Rf​@​kR\_\{f\}@k, and IS@kkfollow the formal definitions in §[3\.3](https://arxiv.org/html/2606.12451#S3.SS3)and are reported at beam widthB=50B\{=\}50unless otherwise specified\. MCQ and QA probing benchmarks are evaluated by exact\-match accuracy using constrained decoding: MCQ restricts generation to option letters \(A/B/C/D\) and QA restricts to Yes/No; random baselines are 25% and 50%, respectively\.

#### Benchmark quality validation\.

Three human annotators independently validated a stratified 100 samples from each benchmark: for RRB, annotators labeled from 14 candidate tools \(ground truths \+ hard negatives\) that matched each query’s intent; for MCQ/QA, annotators answered each item given the full tool description\. Figure[2\(a\)](https://arxiv.org/html/2606.12451#S4.F2.sf1)shows Fleiss’κ\\kappaacross all three benchmarks: MCQ achieves perfect agreement \(κ=1\.000\\kappa=1\.000\), QA near\-perfect \(κ=0\.973\\kappa=0\.973\), and RRB substantial agreement \(κ=0\.805\\kappa=0\.805\)\. Figure[2\(b\)](https://arxiv.org/html/2606.12451#S4.F2.sf2)breaks down RRB agreement by difficulty tier:κ\\kappadecreases gracefully from easy \(0\.840\) to hard \(0\.751\) while remaining above agreement threshold \(0\.70\), confirming that the tier structure reflects genuine task difficulty rather than benchmark artifacts\.

![Refer to caption](https://arxiv.org/html/2606.12451v1/x2.png)\(a\)Inter\-annotator agreement Fleiss’κ\\kappaper benchmark\.
![Refer to caption](https://arxiv.org/html/2606.12451v1/x3.png)\(b\)RRB breakdown by difficulty tier\.

Figure 2:Human annotation study \(N=100N=100items×\\times3 annotators\)\. RRB agreement decreases gracefully with difficulty while majority\-vote accuracy remains high \(85–94%\)\.

## 5Results

We present findings across retrieval generalization, constrained decoding dependence, and semantic tool knowledge\.

### 5\.1Generalization Collapse on Realistic Queries

G1 \(%\)G2 \(%\)G3 \(%\)RRB \(%\)ConfigS1S2S1S2S1S2S1S2Δ\\DeltappTG47\.0 \[43\.6,50\.4\]95\.7 \[94\.2,97\.0\]37\.0 \[33\.8,40\.4\]84\.5 \[81\.5,87\.3\]35\.0 \[27\.9,42\.7\]84\.6 \[80\.4,88\.5\]37\.8 \[34\.3,41\.2\]43\.8 \[40\.3,47\.7\]−\-52TG \(LoRA\)38\.8 \[35\.2,42\.4\]90\.4 \[88\.7,92\.2\]28\.4 \[25\.3,31\.6\]82\.8 \[79\.8,85\.5\]26\.8 \[20\.0,34\.0\]84\.9 \[79\.9,89\.6\]31\.8 \[28\.2,35\.2\]37\.0 \[33\.4,40\.7\]−\-53TG\-SP38\.6 \[35\.0,42\.4\]94\.5 \[92\.9,96\.0\]28\.5 \[25\.4,31\.5\]84\.8 \[81\.5,87\.6\]27\.0 \[20\.7,33\.7\]86\.6 \[81\.9,90\.8\]28\.5 \[25\.2,32\.0\]43\.2 \[39\.6,46\.7\]−\-51TG\-3FM43\.1 \[39\.3,46\.4\]95\.3 \[93\.9,96\.6\]34\.6 \[31\.2,38\.3\]85\.3 \[82\.4,88\.0\]31\.5 \[24\.5,38\.9\]86\.2 \[82\.3,90\.3\]26\.7 \[23\.4,30\.1\]44\.4 \[40\.7,48\.3\]−\-51TG\-3FM \(LoRA\)42\.5 \[38\.8,46\.1\]93\.0 \[91\.4,94\.6\]37\.4 \[33\.9,40\.9\]83\.5 \[80\.6,86\.4\]21\.9 \[15\.7,28\.4\]88\.4 \[84\.3,92\.0\]29\.8 \[26\.5,33\.3\]43\.2 \[39\.7,46\.6\]−\-50TG\-H23\.8 \[20\.5,27\.1\]90\.8 \[88\.9,92\.6\]12\.2 \[9\.9,14\.4\]78\.9 \[75\.5,82\.0\]7\.9 \[4\.5,11\.8\]87\.0 \[82\.9,90\.8\]18\.8 \[15\.9,21\.9\]27\.1 \[24\.2,30\.5\]−\-64TG\-5FM37\.8 \[34\.2,41\.6\]92\.3 \[90\.6,93\.9\]23\.1 \[20\.3,25\.9\]79\.0 \[76\.0,82\.1\]15\.8 \[10\.8,21\.1\]89\.2 \[85\.3,92\.6\]28\.1 \[24\.5,31\.5\]30\.9 \[27\.9,34\.2\]−\-61TG\-5FM \(LoRA\)48\.5 \[44\.8,52\.6\]92\.5 \[90\.7,94\.0\]29\.4 \[26\.0,32\.5\]80\.3 \[77\.4,83\.4\]27\.2 \[20\.4,34\.0\]86\.7 \[81\.4,91\.2\]41\.0 \[37\.0,44\.8\]31\.9 \[28\.4,35\.5\]−\-61BM25—27\.8 \[24\.2,31\.4\]—23\.6 \[19\.4, 27\.8\]—23\.0 \[14\.8, 31\.2\]—32\.4 \[28\.3, 36\.5\]—te3l—47\.0 \[43\.0, 51\.0\]—40\.6 \[35\.8, 45\.4\]—38\.0 \[28\.5, 47\.5\]—55\.6 \[51\.2, 60\.0\]—

Table 2:Rc​@​50R\_\{c\}@50\(in %\) for Gemma3\-4B across all evaluation splits\.S1= after Stage 1 memorization \(no retrieval SFT\);S2= after Stage 2 retrieval SFT \(full pipeline\)\.Δ\\Deltapp = G1→\\toRRB drop at S2\. BM25/te3l are retrieval baselines \(Hard S2 only\)\.Boldresults are best;underlinedare second best\. Values in brackets are 95% CIs\.Table[2](https://arxiv.org/html/2606.12451#S5.T2)reportsRc​@​50R\_\{c\}@50through both training stages and across evaluation splits\. Before any retrieval SFT, Stage 1 memorization alone already reaches 38–47% on G1 and 27–41% on RRB for flat configurations, denoting how well the model has absorbed tool semantics during memorization\. Stage 2 retrieval training then lifts G1 sharply into a 90–96% band, consistent with\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\), yet performance on RRB advances only modestly: most configurations gain 2\-18pp over their Stage 1 baselines, and TG\-5FM \(LoRA\) actually*regresses*\(41\.0%→\\to31\.9%\)\. The absolute G1→\\toRRB drop at Stage 2 is 50–52pp for flat and 61–64pp for hierarchical configurations, which suggests that parametric tool retrieval trained on synthetic verbose queries does not generalize to queries that differ in phrasing from the training distribution\.

The ranking reverses on RRB: the best parametric model \(TG\-3FM, 44\.4%\) falls below retrieval baseline \(te3l, 55\.6%\), and BM25 \(32\.4%\) remains competitive with several parametric configurations\. Both retrieval baselines actually*improve*from G1 to RRB \(BM25: 27\.8%→\\to32\.4%; te3l: 47\.0%→\\to55\.6%\), suggesting that the retrieval methods work better for natural\-language queries\.

The intermediate G2 and G3 splits trace the OOD gradient, both fall between G1 and RRB at Stage 2, confirming that the collapse is gradual rather than a sharp boundary\. We examine the structural and training\-dynamics reasons behind these findings in §[5\.2](https://arxiv.org/html/2606.12451#S5.SS2)and §[5\.4](https://arxiv.org/html/2606.12451#S5.SS4)\.

### 5\.2Trie\-Dependency of Hierarchical Tokens

Table 3:Internalization Score \(IS@50\) after Stage 2 training\. 95% CI in brackets\.![Refer to caption](https://arxiv.org/html/2606.12451v1/x4.png)Figure 3:IS@50 across Stage 2 training steps for Gemma3\-4B\. Shaded bands are 95% bootstrap CIs\. Full results across all splits & models are in Appendix[G](https://arxiv.org/html/2606.12451#A7)\.The Internalization Score quantifies whether a model generates valid tool tokens independently or relies on trie guidance for heavy lifting \(IS≈\\approx1\.0: independent; IS≈\\approx0: trie\-dependent\)\.

Table[3](https://arxiv.org/html/2606.12451#S5.T3)reports IS after Stage 2 \(B=50B=50\)\. The separation between token formats is large: flat configurations achieve RRB IS in the range 0\.75–0\.85, while hierarchical configurations fall to 0\.33–0\.79, a gap of Cohen’sd=1\.45d=1\.45\(Cohen,[1988](https://arxiv.org/html/2606.12451#bib.bib20)\), indicating the two groups occupy fundamentally different regions of trie\-dependency\. The gap is*larger*on RRB than on G1 for hierarchical tokens \(TG\-H: G1=0\.42, RRB=0\.33\), while flat tokens maintain or improve IS on RRB \(TG: G1=0\.75, RRB=0\.85\), suggesting that the distributional shift compounds with structural trie\-dependency\. The flat\-over\-hierarchical IS gap holds across architectures: Qwen3\.5\-4B flat configurations reach IS≥\\geq0\.98 on RRB, while TG\-H Qwen3\.5\-4B drops to 0\.28; full IS results across all architectures are in Appendix[G](https://arxiv.org/html/2606.12451#A7)\.

Figure[3](https://arxiv.org/html/2606.12451#S5.F3)traces IS evolution: TG\-H stays below IS==0\.35 throughout Stage2 training, while TG\-3FM\(LoRA\) declines only modestly \(1\.0→\\to0\.89\), confirming LoRA preserves free\-generation capacity from Stage 1\.

### 5\.3Knowledge\-Retrieval Dissociation

Table[4](https://arxiv.org/html/2606.12451#S5.T4)reports MCQ and QA probing accuracy corresponding to Gemma3\-4B model after Stage 1 and Stage 2 trainings\. The headline finding is a striking dissociation: a model \(TG\) achieving∼\\sim95%Rc​@​50R\_\{c\}@50on G1 scores only 31\.4% on 4\-way MCQ \(random=25%\) and exactly 50\.0% on binary QA \(random=50%\), suggesting the behavior to be more like a lookup table, not a knowledge base\. At Stage 2, full\-finetuned models score 20–31% on MCQ and 34–50% on QA; LoRA extends the upper end to 41\.7% MCQ and 56\.4% QA\.

The picture differs sharply by token format\. Flat configurations accumulate genuine knowledge during Stage 1 \(TG enters retrieval training at 55\.4% MCQ and TG\-3FM at 32\.9%\), which Stage 2 then systematically erodes \(TG:−\-24\.0pp; TG\-3FM:−\-3\.1pp\)\. Hierarchical tokens tell a different story: TG\-H and TG\-5FM score*below*random chance at Stage 1 itself \(23\.6% \[19\.8, 27\.2\] and 22\.8% \[18\.9, 26\.6\]\), before any retrieval training has occurred, suggesting that these configurations never established semantic grounding in the first place\. LoRA substantially buffers Stage 2 destruction for configurations where knowledge does take hold: TG\-3FM \(LoRA\) retains 41\.7% MCQ \(CI \[37\.7, 46\.2\]\) vs\. FFT at 29\.8% \(CI \[25\.6, 33\.9\]\) and achieves the highest QA score at Stage 2 \(56\.4% \[52\.2, 60\.6\]\)\.

Across all configurations and model families, Stage 1 MCQ accuracy strongly correlates to Stage 2 RRBRc​@​50R\_\{c\}@50\(Pearsonr=0\.79r=0\.79,p<0\.001p<0\.001,n=14n=14; Figure[4](https://arxiv.org/html/2606.12451#S5.F4)\): configurations that enter retrieval training with richer semantic representations generalize better to OOD queries, even after partial knowledge destruction\. \(cross\-architecture results in Appendix[H](https://arxiv.org/html/2606.12451#A8)\)

Table 4:MCQ and QA probing accuracy \(Gemma3\-4B\)\. Random: MCQ=25%, QA=50%\. 95% CI in brackets\.![Refer to caption](https://arxiv.org/html/2606.12451v1/x5.png)Figure 4:Correlation between Stage 1 MCQ accuracy and Stage 2 RRBRc​@​50R\_\{c\}@50across all model families \(n=14n=14,r=0\.79r=0\.79,p<0\.001p<0\.001\)\. Error bars show 95% CIs\. All 14 model variants are listed in Table[10](https://arxiv.org/html/2606.12451#A8.T10)\(Appendix[H](https://arxiv.org/html/2606.12451#A8)\)\.
### 5\.4Design Choice Ablations

We isolate which design choices mitigate or exacerbate the three findings above\.

Token format\.Flat tokens \(TG\-SP, Gemma3\-4B\) achieve RRBRc​@​50R\_\{c\}@50=43\.2% \(95% CI \[39\.6, 46\.7\]\) and IS=0\.75, vs\. hierarchical \(TG\-H, Gemma3\-4B\) at RRBRc​@​50R\_\{c\}@50=27\.1% \(CI \[24\.2, 30\.5\]\) and IS=0\.33 \(Cohen’sd=8\.23d=8\.23on recall,d=1\.45d=1\.45on IS\(Cohen,[1988](https://arxiv.org/html/2606.12451#bib.bib20)\)\)\. This replicates across architectures: Qwen3\.5\-4B flatRc​@​50R\_\{c\}@50=55\.8% vs\. hierRc​@​50R\_\{c\}@50=39\.2% \(\+16\.6pp\)\. Flat tokens require a single decoding step; hierarchical tokens require multi\-step composition where errors at the API level propagate to the endpoint level\.

Figure[5](https://arxiv.org/html/2606.12451#S5.F5)shows the gap persists across all beam widths: TG\-H plateaus below 9%Rf​@​kR\_\{f\}@kregardless of beam budget, while flat configs scale to 37%\.

![Refer to caption](https://arxiv.org/html/2606.12451v1/x6.png)Figure 5:RRBRc​@​kR\_\{c\}@kandRf​@​kR\_\{f\}@kvs\. beam widthkkfor five Gemma3\-4B Stage 2 configurations\. Shaded bands are 95% CIs\.Figure[6](https://arxiv.org/html/2606.12451#S5.F6)shows flat configurations declining progressively across RRB difficulty tiers \(Easy 58–64%→\\toRRB 26–30%\), while hierarchical configurations underperform at every level \(Easy: 37–43%, RRB: 21–22%\)\. Token format is the dominant separator, flat leads by 15–25 pp on Easy and Medium, with TG\-3FM \(LoRA\) achieving the highest single\-target retrievability \(Easy: 63\.5%\)\.

![Refer to caption](https://arxiv.org/html/2606.12451v1/x7.png)Figure 6:Rc​@​50R\_\{c\}@50by RRB difficulty tier for five Gemma3\-4B Stage 2 configurations\. Error bars are 95% bootstrap CIs\.LoRA as knowledge preservation\.For Gemma3\-4B TG\-3FM, LoRA preserves MCQ at 41\.7% \(95% CI \[37\.7, 46\.2\]\) vs\. FFT at 29\.8% \(CI \[25\.6, 33\.9\]\)\. For TG\-5FM, the trend is 28\.8% \(CI \[25\.0, 32\.9\]\) vs\. 21\.6% \(CI \[17\.7, 25\.2\]\)\. For hierarchical tokens, LoRA also improves IS: TG\-5FM \(LoRA\) achieves RRB IS=0\.79 vs\. FFT at 0\.64\. The effect is most pronounced where FFT causes severe destruction; for Qwen3\.5\-4B, which already preserves knowledge without LoRA \(TG\-3FM: 72\.8%\), adding LoRA yields only marginal additional gain \(74\.2%\)\.

Virtual tool token embedding drift\.We measure relative L2 drift of virtual tool tokens vs\. base\-vocabulary tokens across Stage 1→\\to2: virtual tool tokens drift22–3×3\\timesfarther under FFT and22\.9×22\.9\\timesfarther under LoRA with embedding layer fully\-finetuned \(Mann\-Whitneyp<2×10−308p<2\{\\times\}10^\{\-308\}\), yet within\-cluster cosine similarity barely changes \(\|Δ​cosim\|<0\.002\|\\Delta\\mathrm\{cosim\}\|<0\.002\), suggesting tokens shift as coherent clusters rather than diverging to encode distinct knowledge content\. Hierarchical tokens enter Stage 2 near\-orthogonal to each other \(cosimS​1≈0\.038\\mathrm\{cosim\}\_\{S1\}\{\\approx\}0\.038vs\.0\.370\.37for flat tokens\) and require77–24×24\\timesless repositioning, yet still fail MCQ probes just as severely — suggesting that token geometry is not the source of the knowledge\-retrieval dissociation \(full analysis in Appendix[I](https://arxiv.org/html/2606.12451#A9)\)

![Refer to caption](https://arxiv.org/html/2606.12451v1/x8.png)Figure 7:Relative L2 drift \(dreld\_\{\\mathrm\{rel\}\}\) of virtual vs\. randomly sampled vocabulary tokens across Stage 1→\\toStage 2 for Gemma3\-4B configurations\.Multi\-format training\.Multi\-format memorization TG\-3FM combined with LoRA achieves the highest Stage 2 MCQ accuracy across all architectures: 41\.7% \(Gemma3\-4B\), 74\.2% \(Qwen3\.5\-4B\), 76\.4% \(Gemma3\-12B\)\. Seeing the same tool tokens in multiple different contexts during Stage 1 appears to build relatively more robust representations that are harder to overwrite during Stage 2 retrieval fine\-tuning\.

System prompt\.During Stage 2, TG \(no SP\) achieves RRBRc​@​50R\_\{c\}@50=43\.8% \(95% CI \[40\.3, 47\.7\]\) vs\. TG\-SP at 43\.2% \(CI \[39\.6, 46\.7\]\), fully overlapping intervals, suggesting that the system prompt has no measurable effect once retrieval training is complete\.

Training data distribution\.To test whether the knowledge\-retrieval dissociation is a consequence of Stage 2 query distribution rather than the training objective itself, we trained a TG variant whose Stage 2 data consists entirely of RRB\-style queries generated by ToolSense RRB generator over the same∼\\sim47k tool catalog\. Retraining TG Stage 2 on RRB\-style queries \(disjoint from evaluation set; see Appendix[E](https://arxiv.org/html/2606.12451#A5)\) shows that RRBRc​@​50R\_\{c\}@50rises from 28\.7% to 64\.1% \(\+\+35\.4pp\), but G1Rc​@​50R\_\{c\}@50drops from 95\.7% to 86\.6%, suggesting that the model over\-specializes to query style that appears in Stage 2 training\. MCQ accuracy drops from 31\.4% to 26\.0%, almost near\-random for a 4\-choice task\. The knowledge\-retrieval dissociation thus persists regardless of Stage 2 training data distribution, pointing to the SFT objective itself as the deeper root cause\.

### 5\.5Cross\-Architecture Validation

Table[5](https://arxiv.org/html/2606.12451#S5.T5)confirms both findings generalize: the RRB\-recall collapse persists across all model families, and flat tokens consistently outperform hierarchical by 16–18 pp \(IS: 0\.75–0\.98 flat vs\. 0\.28–0\.33 hierarchical\)\. Knowledge retention is not explained by model capacity alone: Qwen3\.5\-4B retains 62–73% MCQ despite matching Gemma3\-4B in size \(29–31%\), and Gemma3\-12B TG\-3FM LoRA variant achieves the highest MCQ retention \(76\.4%\) with competitive RRB recall \(48\.4%\)\.

Table 5:Cross\-architecture results \(Stage 2, beam=50\)\.

## 6Conclusion

We introducedToolSense, a diagnostic framework that auto\-generates three benchmarks—RRB, MCQ, and QA—from any tool catalog to probe parametric retrieval systems along orthogonal axes: OOD generalization, trie\-dependency, and factual knowledge retention\. Applying ToolSense to ToolBench reveals that high constrained recall on verbose in\-distribution queries is a poor proxy for real\-world capability: performance collapses on realistic queries, hierarchical tokens exhibit deep trie\-dependency, and Stage 2 retrieval fine\-tuning systematically erodes the tool knowledge built during Stage 1, leaving models that behave more like lookup tables than knowledge bases\. LoRA combined with multi\-format Stage 1 memorization best mitigates these failure modes, and Stage 1 MCQ accuracy is a reliable early predictor of downstream OOD generalization\. We release the ToolSense framework and the ToolBench diagnostic benchmarks to encourage further research\.

## Limitations

While ToolSense’s benchmarks are designed to generalize to other parametric retrieval designs, our empirical study focuses on ToolGen’s two\-stage paradigm\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\)applied to ToolBench; extending the diagnosis to other catalog\-agnostic parametric designs is a natural next step\. The RRB, MCQ, and QA benchmarks are LLM\-generated and validated by human annotators on 100 samples each \(κ≥0\.805\\kappa\\geq 0\.805\); scaling the human study beyond 100 samples per benchmark would further strengthen confidence in the benchmark quality, though doing so at catalog scale involves non\-trivial annotation cost and time\. Our analysis focuses on the retrieval stage in isolation; end\-to\-end agentic evaluation, where retrieval quality interacts with planning and execution, is a promising direction for future work\. The open\-source base models we study \(Gemma3, Qwen\) may have encountered RapidAPI tool documentation during pre\-training, potentially inflating absolute Stage 1 knowledge scores relative to truly novel tools; the relative forgetting patterns across Stage 1→\\toStage 2 remain unaffected by this confound, though our findings may not directly transfer to private enterprise tool catalogs whose APIs never appear in any public pre\-training corpus\. The Internalization Score is a ratio metric and can exhibit high variance whenRc​@​kR\_\{c\}@kis low; while we discuss and handle the degenerateRc​@​k=0R\_\{c\}@k=0case explicitly and reported confidence intervals already capture this variance, future work should evaluate log\-ratio or calibration\-based formulations to confirm the stability of IS findings across catalog sizes\. Finally, our model experiments cover the 4B–12B parameter range; studying very large models \(≥\\geq30B\) may reveal additional nuances in how model scale affects knowledge retention under sequential fine\-tuning\.

## Ethical Considerations

We conducted experiments within the provisions of the ACL Ethics Policy and relevant research\-integrity guidelines\. There are, to the best of our knowledge, no remaining ethical risks that have not been addressed\.

## References

- D\. Biderman, J\. Portes, J\. J\. G\. Ortiz, M\. Paul, P\. Greengard, C\. Jennings, D\. King, S\. Havens, V\. Chiley, J\. Frankle, C\. Blakeney, and J\. P\. Cunningham \(2024\)LoRA learns less and forgets less\.Transactions on Machine Learning Research\.Note:Featured CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=aloEru2qCG)Cited by:[5th item](https://arxiv.org/html/2606.12451#A3.I1.i5.p1.1),[§2](https://arxiv.org/html/2606.12451#S2.p4.1)\.
- N\. D\. Cao, G\. Izacard, S\. Riedel, and F\. Petroni \(2021\)Autoregressive entity retrieval\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5k8F6UU39V)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px3.p1.8),[§1](https://arxiv.org/html/2606.12451#S1.p2.3),[§2](https://arxiv.org/html/2606.12451#S2.p1.1)\.
- J\. Cohen \(1988\)Statistical power analysis for the behavioral sciences\.2nd edition,Lawrence Erlbaum Associates,Hillsdale, NJ\.Cited by:[§5\.2](https://arxiv.org/html/2606.12451#S5.SS2.p2.3),[§5\.4](https://arxiv.org/html/2606.12451#S5.SS4.p2.6)\.
- D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei \(2022\)Knowledge neurons in pretrained transformers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8493–8502\.External Links:[Link](https://aclanthology.org/2022.acl-long.581/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.581)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px5.p1.1)\.
- K\. Ethayarajh \(2019\)How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT\-2 embeddings\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 55–65\.External Links:[Link](https://aclanthology.org/D19-1006/),[Document](https://dx.doi.org/10.18653/v1/D19-1006)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px3.p1.8)\.
- J\. Gao, D\. He, X\. Tan, T\. Qin, L\. Wang, and T\. Liu \(2019\)Representation degeneration problem in training natural language generation models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkEYojRqtm)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px3.p1.8)\.
- E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2606.12451#S2.p4.1),[§4\.1](https://arxiv.org/html/2606.12451#S4.SS1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 6769–6781\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.550/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.550)Cited by:[§1](https://arxiv.org/html/2606.12451#S1.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 3045–3059\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.243/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.243)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px1.p1.4)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 4582–4597\.External Links:[Link](https://aclanthology.org/2021.acl-long.353/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.353)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px1.p1.4)\.
- M\. McCloskey and N\. J\. Cohen \(1989\)Catastrophic interference in connectionist networks: the sequential learning problem\.G\. H\. Bower \(Ed\.\),Psychology of Learning and Motivation, Vol\.24,pp\. 109–165\.External Links:ISSN 0079\-7421,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0079-7421%2808%2960536-8),[Link](https://www.sciencedirect.com/science/article/pii/S0079742108605368)Cited by:[§2](https://arxiv.org/html/2606.12451#S2.p4.1)\.
- K\. Meng, D\. Bau, A\. J\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=-h6WAS6eE4)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px5.p1.1)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian, J\. Belgum, I\. Bello, J\. Berdine, G\. Bernadett\-Shapiro, C\. Berner, L\. Bogdonoff, O\. Boiko, M\. Boyd, A\. Brakman, G\. Brockman, T\. Brooks, M\. Brundage, K\. Button, T\. Cai, R\. Campbell, A\. Cann, B\. Carey, C\. Carlson, R\. Carmichael, B\. Chan, C\. Chang, F\. Chantzis, D\. Chen, S\. Chen, R\. Chen, J\. Chen, M\. Chen, B\. Chess, C\. Cho, C\. Chu, H\. W\. Chung, D\. Cummings, J\. Currier, Y\. Dai, C\. Decareaux, T\. Degry, N\. Deutsch, D\. Deville, A\. Dhar, D\. Dohan, S\. Dowling, S\. Dunning, A\. Ecoffet, A\. Eleti, T\. Eloundou, D\. Farhi, L\. Fedus, N\. Felix, S\. P\. Fishman, J\. Forte, I\. Fulford, L\. Gao, E\. Georges, C\. Gibson, V\. Goel, T\. Gogineni, G\. Goh, R\. Gontijo\-Lopes, J\. Gordon, M\. Grafstein, S\. Gray, R\. Greene, J\. Gross, S\. S\. Gu, Y\. Guo, C\. Hallacy, J\. Han, J\. Harris, Y\. He, M\. Heaton, J\. Heidecke, C\. Hesse, A\. Hickey, W\. Hickey, P\. Hoeschele, B\. Houghton, K\. Hsu, S\. Hu, X\. Hu, J\. Huizinga, S\. Jain, S\. Jain, J\. Jang, A\. Jiang, R\. Jiang, H\. Jin, D\. Jin, S\. Jomoto, B\. Jonn, H\. Jun, T\. Kaftan, Ł\. Kaiser, A\. Kamali, I\. Kanitscheider, N\. S\. Keskar, T\. Khan, L\. Kilpatrick, J\. W\. Kim, C\. Kim, Y\. Kim, J\. H\. Kirchner, J\. Kiros, M\. Knight, D\. Kokotajlo, Ł\. Kondraciuk, A\. Kondrich, A\. Konstantinidis, K\. Kosic, G\. Krueger, V\. Kuo, M\. Lampe, I\. Lan, T\. Lee, J\. Leike, J\. Leung, D\. Levy, C\. M\. Li, R\. Lim, M\. Lin, S\. Lin, M\. Litwin, T\. Lopez, R\. Lowe, P\. Lue, A\. Makanju, K\. Malfacini, S\. Manning, T\. Markov, Y\. Markovski, B\. Martin, K\. Mayer, A\. Mayne, B\. McGrew, S\. M\. McKinney, C\. McLeavey, P\. McMillan, J\. McNeil, D\. Medina, A\. Mehta, J\. Menick, L\. Metz, A\. Mishchenko, P\. Mishkin, V\. Monaco, E\. Morikawa, D\. Mossing, T\. Mu, M\. Murati, O\. Murk, D\. Mély, A\. Nair, R\. Nakano, R\. Nayak, A\. Neelakantan, R\. Ngo, H\. Noh, L\. Ouyang, C\. O’Keefe, J\. Pachocki, A\. Paino, J\. Palermo, A\. Pantuliano, G\. Parascandolo, J\. Parish, E\. Parparita, A\. Passos, M\. Pavlov, A\. Peng, A\. Perelman, F\. de Avila Belbute Peres, M\. Petrov, H\. P\. de Oliveira Pinto, Michael, Pokorny, M\. Pokrass, V\. H\. Pong, T\. Powell, A\. Power, B\. Power, E\. Proehl, R\. Puri, A\. Radford, J\. Rae, A\. Ramesh, C\. Raymond, F\. Real, K\. Rimbach, C\. Ross, B\. Rotsted, H\. Roussez, N\. Ryder, M\. Saltarelli, T\. Sanders, S\. Santurkar, G\. Sastry, H\. Schmidt, D\. Schnurr, J\. Schulman, D\. Selsam, K\. Sheppard, T\. Sherbakov, J\. Shieh, S\. Shoker, P\. Shyam, S\. Sidor, E\. Sigler, M\. Simens, J\. Sitkin, K\. Slama, I\. Sohl, B\. Sokolowsky, Y\. Song, N\. Staudacher, F\. P\. Such, N\. Summers, I\. Sutskever, J\. Tang, N\. Tezak, M\. B\. Thompson, P\. Tillet, A\. Tootoonchian, E\. Tseng, P\. Tuggle, N\. Turley, J\. Tworek, J\. F\. C\. Uribe, A\. Vallone, A\. Vijayvergiya, C\. Voss, C\. Wainwright, J\. J\. Wang, A\. Wang, B\. Wang, J\. Ward, J\. Wei, C\. Weinmann, A\. Welihinda, P\. Welinder, J\. Weng, L\. Weng, M\. Wiethoff, D\. Willner, C\. Winter, S\. Wolrich, H\. Wong, L\. Workman, S\. Wu, J\. Wu, M\. Wu, K\. Xiao, T\. Xu, S\. Yoo, K\. Yu, Q\. Yuan, W\. Zaremba, R\. Zellers, C\. Zhang, M\. Zhang, S\. Zhao, T\. Zheng, J\. Zhuang, W\. Zhuk, and B\. Zoph \(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§4](https://arxiv.org/html/2606.12451#S4.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2024\)New embedding models and API updates\.Note:Accessed: 2026\-05\-23External Links:[Link](https://openai.com/blog/new-embedding-models-and-api-updates)Cited by:[§4\.1](https://arxiv.org/html/2606.12451#S4.SS1.SSS0.Px1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive APIs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=tBRNC6YemY)Cited by:[§2](https://arxiv.org/html/2606.12451#S2.p2.2)\.
- F\. Petroni, T\. Rocktäschel, S\. Riedel, P\. Lewis, A\. Bakhtin, Y\. Wu, and A\. Miller \(2019\)Language models as knowledge bases?\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 2463–2473\.External Links:[Link](https://aclanthology.org/D19-1250/),[Document](https://dx.doi.org/10.18653/v1/D19-1250)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px5.p1.1),[§1](https://arxiv.org/html/2606.12451#S1.p3.1),[§2](https://arxiv.org/html/2606.12451#S2.p3.1)\.
- A\. Petrov, P\. Torr, and A\. Bibi \(2024\)When do prompting and prefix\-tuning work? a theory of capabilities and limitations\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JewzobRhay)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px5.p1.1)\.
- Y\. Qin, S\. Hu, Y\. Lin, W\. Chen, N\. Ding, G\. Cui, Z\. Zeng, X\. Zhou, Y\. Huang, C\. Xiao, C\. Han, Y\. R\. Fung, Y\. Su, H\. Wang, C\. Qian, R\. Tian, K\. Zhu, S\. Liang, X\. Shen, B\. Xu, Z\. Zhang, Y\. Ye, B\. Li, Z\. Tang, J\. Yi, Y\. Zhu, Z\. Dai, L\. Yan, X\. Cong, Y\. Lu, W\. Zhao, Y\. Huang, J\. Yan, X\. Han, X\. Sun, D\. Li, J\. Phang, C\. Yang, T\. Wu, H\. Ji, G\. Li, Z\. Liu, and M\. Sun \(2024a\)Tool learning with foundation models\.ACM Comput\. Surv\.57\(4\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3704435),[Document](https://dx.doi.org/10.1145/3704435)Cited by:[§1](https://arxiv.org/html/2606.12451#S1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, dahai li, Z\. Liu, and M\. Sun \(2024b\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dHng2O0Jjr)Cited by:[Appendix L](https://arxiv.org/html/2606.12451#A12.SS0.SSS0.Px1.p1.1),[§C\.2](https://arxiv.org/html/2606.12451#A3.SS2.p1.1),[§1](https://arxiv.org/html/2606.12451#S1.p2.3),[§2](https://arxiv.org/html/2606.12451#S2.p2.2),[§4](https://arxiv.org/html/2606.12451#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.12451#S4.p1.2)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Found\. Trends Inf\. Retr\.3\(4\),pp\. 333–389\.External Links:ISSN 1554\-0669,[Link](https://doi.org/10.1561/1500000019),[Document](https://dx.doi.org/10.1561/1500000019)Cited by:[§4\.1](https://arxiv.org/html/2606.12451#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessi, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=Yacmpz84TH)Cited by:[§1](https://arxiv.org/html/2606.12451#S1.p1.1),[§2](https://arxiv.org/html/2606.12451#S2.p2.2)\.
- Y\. Tay, V\. Q\. Tran, M\. Dehghani, J\. Ni, D\. Bahri, H\. Mehta, Z\. Qin, K\. Hui, Z\. Zhao, J\. Gupta, T\. Schuster, W\. W\. Cohen, and D\. Metzler \(2022\)Transformer memory as a differentiable search index\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=Vu-B0clPfq)Cited by:[§1](https://arxiv.org/html/2606.12451#S1.p1.1),[§2](https://arxiv.org/html/2606.12451#S2.p1.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[Appendix L](https://arxiv.org/html/2606.12451#A12.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.12451#S4.SS1.p1.1)\.
- I\. Tenney, D\. Das, and E\. Pavlick \(2019\)BERT rediscovers the classical NLP pipeline\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 4593–4601\.External Links:[Link](https://aclanthology.org/P19-1452/),[Document](https://dx.doi.org/10.18653/v1/P19-1452)Cited by:[§2](https://arxiv.org/html/2606.12451#S2.p3.1)\.
- W\. Timkey and M\. van Schijndel \(2021\)All bark and no bite: rogue dimensions in transformer language models obscure representational quality\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 4527–4546\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.372/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.372)Cited by:[Appendix I](https://arxiv.org/html/2606.12451#A9.SS0.SSS0.Px3.p1.8)\.
- R\. Wang, X\. Han, L\. Ji, S\. Wang, T\. Baldwin, and H\. Li \(2025\)ToolGen: unified tool retrieval and calling via generation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XLMAMmowdY)Cited by:[Appendix B](https://arxiv.org/html/2606.12451#A2.p1.1),[§C\.1](https://arxiv.org/html/2606.12451#A3.SS1.SSS0.Px1.p1.1),[§C\.2](https://arxiv.org/html/2606.12451#A3.SS2.p1.1),[§1](https://arxiv.org/html/2606.12451#S1.p1.1),[§1](https://arxiv.org/html/2606.12451#S1.p2.3),[§2](https://arxiv.org/html/2606.12451#S2.p1.1),[§4](https://arxiv.org/html/2606.12451#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.12451#S4.p1.2),[§5\.1](https://arxiv.org/html/2606.12451#S5.SS1.p1.3),[Limitations](https://arxiv.org/html/2606.12451#Sx1.p1.5)\.
- Y\. Wang, Y\. Hou, H\. Wang, Z\. Miao, S\. Wu, H\. Sun, Q\. Chen, Y\. Xia, C\. Chi, G\. Zhao, Z\. Liu, X\. Xie, H\. Sun, W\. Deng, Q\. Zhang, and M\. Yang \(2022\)A neural corpus indexer for document retrieval\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=fSfcEYQP_qc)Cited by:[§2](https://arxiv.org/html/2606.12451#S2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Appendix L](https://arxiv.org/html/2606.12451#A12.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.12451#S4.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.12451#S1.p1.1)\.

## Appendix AToolBench Standard Splits vs\. RRB: Query Style Comparison

![Refer to caption](https://arxiv.org/html/2606.12451v1/x9.png)Figure 8:Side\-by\-side comparison of ToolBench standard evaluation split queries \(G1/G2/G3\) and RRB queries for the same tools\. ToolBench queries are verbose, story\-wrapped, and enumerate specific parameters\. Whereas, RRB queries are short, intent\-focused, and reflect how a real user would naturally phrase a request\.Figure[8](https://arxiv.org/html/2606.12451#A1.F8)illustrates a fundamental difference in query style between the ToolBench standard evaluation splits \(G1/G2/G3\) and our Realistic Retrieval Benchmark \(RRB\)\. For the same underlying tools, ToolBench queries are verbose narrative prompts that effectively paraphrase the API documentation\. A model that has memorized tool descriptions at Stage 1 can retrieve the correct token by matching surface\-level vocabulary, without genuine semantic understanding\.

RRB queries, by contrast, are generated byθRRB\\theta\_\{\\text\{RRB\}\}with an explicit instruction to produce*intent\-focused*natural language: a concise statement of what the user wants to accomplish, stripped of API\-specific terminology\. The four examples shown span distinct domains, yet in each case, the RRB query is how a real user would type the request into a chat interface: “Show me pre\-game odds for today’s matches,” “Create a branded short link for our product page,” “Find me streaming options for Stranger Things,” and “What are the different marketplace categories I can browse for items?” This shift from documentation\-paraphrase to user\-intent query is precisely the distribution gap that causes retrieval collapse in our experiments \(Section[5](https://arxiv.org/html/2606.12451#S5)\): a parametric model trained exclusively on Stage 2 \(query→\\tovirtual token\) pairs derived from verbose ToolBench\-style prompts has not learned to bridge the lexical gap to short, colloquial, intent\-focused user utterances\.

## Appendix BMulti\-Choice Tool Selection \(MCTS\) Memorization Dataset

The standard Stage 1 memorization format trains the model on \(tool metadata→\\tovirtual token\) pairs\(Wanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib1)\)\. While effective at associating a tool with its token, this objective does not explicitly teach the model to*distinguish*semantically similar tools — a critical requirement in large catalogs where many tools share overlapping descriptions\.

#### Format\.

MCTS is a discriminative Stage 1 objective\. For each toolτ\\tau, we construct a multiple\-choice instance by pairing its descriptiondesc​\(τ\)\\mathrm\{desc\}\(\\tau\)withKKhard\-negative tool descriptionsℋK​\(τ\)=\{τ1−,…,τK−\}\\mathcal\{H\}\_\{K\}\(\\tau\)=\\\{\\tau\_\{1\}^\{\-\},\\ldots,\\tau\_\{K\}^\{\-\}\\\}retrieved by embedding\-based similarity\. At training time, the model receives the description ofτ\\taualongsideK\+1K\{\+\}1tool tokens corresponding to the candidates \(shuffled\) and must generate the virtual tokenvτv\_\{\\tau\}corresponding to the ground\-truth tool\. The training signal is therefore a next\-token loss conditioned on a*contrast set*, forcing the model to resolve fine\-grained semantic boundaries rather than simply recall a memorized association\.

#### Hard\-negative mining\.

Candidates are retrieved using a dense embedding model \(text\-embedding\-3\-large\) indexed over the full tool catalog via chromadb\. For each anchor tool, we over\-fetchK\+5K\{\+\}5nearest neighbors by cosine similarity and remove the anchor itself, yielding a pool ofKKhard negatives\. Because these tools are semantically proximate but functionally distinct \(e\.g\., two overlapping enterprise search APIs, or two QR\-code generation services\), they represent the most plausible confusables — exactly the cases where a model relying on shallow lexical overlap would fail\. For our experiments, we setK=4K=4\.

#### Effect on Stage 1 representations\.

By requiring the model to correctly identifyvτv\_\{\\tau\}against its nearest semantic neighbors, MCTS encourages the embedding ofvτv\_\{\\tau\}to be positioned not just*near*its own description, but*farther*from confusable alternatives\. This discriminative geometry is more resistant to the overwriting that occurs during Stage 2 retrieval SFT, as reflected in the higher Stage 2 MCQ scores observed for configurations that include MCTS in their Stage 1 training mix \(TG\-3FM, TG\-5FM; Table[1](https://arxiv.org/html/2606.12451#S4.T1)\)\.

## Appendix CTraining Configuration Details

### C\.1Stage 1: Memorization Formats

Stage 1 fine\-tunes the model to associate each tool’s virtual token with its metadata\. We define five memorization formats, each presenting the task from a different angle\. All formats share the same tool catalog but differ in what is given as input and what the model is trained to generate\. For applicable model configurations, exact system prompts are provided in Appendix[K\.1](https://arxiv.org/html/2606.12451#A11.SS1)\.

#### desc→\\totok \(standard\)\.

The model receives the full tool description and is trained to generate the virtual token\. This is the format introduced byWanget al\.\([2025](https://arxiv.org/html/2606.12451#bib.bib1)\)and used in*all*configurations, regardless of token scheme\. The form oftokdiffers by scheme: for flat tokens it is a single atomic identifier encoding the full tool identity \(<<API&&ENDPOINT\>\>\); for hierarchical tokens it is the joint two\-token sequence \(<<API\>\><<ENDPOINT\>\>\), generated autoregressively as two successive decode steps\.

> Input:“Manages and administers search queries within the Enterprise Search Administration system…” Target \(flat\):<<ESH\_ADMIN\_SRV\_0001&&SearchQueries\>\> Target \(hierarchical\):<<ESH\_ADMIN\_SRV\_0001\>\><<SearchQueries\>\>

#### tok→\\todesc \(reverse\)\.

The input and target are swapped: the model receives the virtual token and is trained to generate the tool description\. This forces bidirectional association—the model must encode not only*which description maps to which token*but also*which token maps to which description*—making the embedding more resistant to overwriting during Stage 2\.

> Input \(flat\):<<ESH\_ADMIN\_SRV\_0001&&SearchQueries\>\> Input \(hierarchical\):<<ESH\_ADMIN\_SRV\_0001\>\><<SearchQueries\>\> Target:“Manages and administers search queries…”

#### desc→\\toapi\_tok \(hierarchical only\)\.

The model receives the tool description and is trained to generate only the API\-level token—the first step of the two\-token hierarchical sequence\. This format is specific to hierarchical token configuration TG\-5FM and provides an additional supervised signal that trains the first generation step in isolation, supplementing the primarydesc→\\totokformat which already trains both tokens jointly end\-to\-end\.

> Input:“Manages and administers search queries…” Target:<<ESH\_ADMIN\_SRV\_0001\>\>

#### desc\+api\_tok→\\toendpoint\_tok \(hierarchical only\)\.

The model receives the tool description with the API\-level token and is trained to generate the endpoint\-level token, the second step of the hierarchical sequence\. Together with desc→\\toapi\_tok, this provides two additional supervised signals that train each hierarchical step independently, complementing the jointdesc→\\totokobjective already present in TG\-5FM\. This decomposition reinforces each level of the hierarchy with a dedicated loss signal\.

> Input:“Manages and administers search queries…<<ESH\_ADMIN\_SRV\_0001\>\>” Target:<<SearchQueries\>\>

#### MCTS \(discriminative\)\.

See Appendix[B](https://arxiv.org/html/2606.12451#A2)for a full description\. In brief, the model must identify the correct virtual token from a set ofK\+1K\+1tool tokens, which include ground truth tool token as well asKKhard\-negative candidates retrieved by embedding similarity, adding a discriminative objective alongside the generative formats above\.

### C\.2Stage 2: Retrieval Training Data

Stage 2 fine\-tunes the model on \(query→\\tovirtual token\) pairs drawn from the ToolBench training split\(Qinet al\.,[2024b](https://arxiv.org/html/2606.12451#bib.bib7)\), followingWanget al\.\([2025](https://arxiv.org/html/2606.12451#bib.bib1)\)\. The Stage 2 data is*identical*across all five configurations; no format\-specific data augmentation is applied at this stage\. The only variation is whether a system prompt is prepended to each training instance: TG omits it entirely, while all other configurations \(TG\-SP, TG\-3FM, TG\-H, TG\-5FM\) include the same task\-specific system prompt used during Stage 1, ensuring consistency between the training stages and inference\. For applicable model configurations, exact system prompts are provided in Appendix[K\.2](https://arxiv.org/html/2606.12451#A11.SS2)\.

### C\.3Design Rationale per Configuration

The five configurations isolate one variable at a time against a shared baseline:

- •TG vs\. TG\-SP: isolates the effect of the system prompt, all else equal\.
- •TG\-SP vs\. TG\-3FM: isolates the effect of enriching Stage 1 with reverse \(tok→\\todesc\) and discriminative \(MCTS\) objectives, holding token format and system prompt constant\.
- •TG\-SP vs\. TG\-H: isolates the effect of switching from flat to hierarchical tokens, holding memorization format \(desc→\\totok only\) and system prompt constant\.
- •TG\-H vs\. TG\-5FM: isolates the effect of adding all five memorization formats to the hierarchical token baseline\.
- •FFT vs\. LoRA\(within any configuration\): isolates the effect of constraining parameter update magnitude on knowledge retention across the sequential training stages\(Bidermanet al\.,[2024](https://arxiv.org/html/2606.12451#bib.bib14)\)\.

## Appendix DTraining Hyperparameters

All experiments were run on a single node with 8 NVIDIA H200 GPUs\. Each model was trained on a single GPU, with the AdamW optimizer, bf16 mixed precision, and a maximum sequence length of 1024 tokens\. The effective batch size is 1024 for all runs \(achieved via gradient accumulation: 1 GPU×\\times2 samples×\\times512 accumulation steps for 4B models; 2 GPUs×\\times2×\\times256 for 12B models\)\.

Table 6:Training hyperparameters for both stages\.### LoRA Configuration

Adapters are applied to all linear projection layers \(q\_proj,k\_proj,v\_proj,o\_proj,gate\_proj,up\_proj,down\_proj\) with rankr=64r\{=\}64,α=128\\alpha\{=\}128, and zero dropout\. The full embedding layer \(embed\_tokens\) is kept fully trainable in all LoRA runs; weight tying betweenembed\_tokensandlm\_headis preserved\. LoRA Stage 2 runs initialise from the merged Stage 1 LoRA checkpoint\.

## Appendix ERealistic\-Query Stage 2 Training

To test whether the generalization collapse is caused by query distribution shift or a deeper property of the Stage 2 training objective, we retrained the TG \(Gemma3\-4B\) configuration using RRB\-style queries in place of the standard verbose ToolBench queries\.

#### Training data\.

We used ToolSense to generate realistic queries for all∼\\sim47k tools in the ToolBench catalog\. Since RRB queries can have multiple correct tools \(Medium and Hard tiers require 2–3 and 4\+ tools respectively\), each generated sample\(q,\[t1,…,tn\]\)\(q,\[t\_\{1\},\\ldots,t\_\{n\}\]\)is expanded intonnindividual training pairs\(q,t1\),…,\(q,tn\)\(q,t\_\{1\}\),\\ldots,\(q,t\_\{n\}\), yielding 284,567 training samples in total\. The 500\-query RRB evaluation set is disjoint from this training corpus\. All other Stage 2 hyperparameters are identical to the standard TG run \(see Appendix[D](https://arxiv.org/html/2606.12451#A4)\)\.

#### Results\.

Table[7](https://arxiv.org/html/2606.12451#A5.T7)compares TG trained on standard vs\. RRB\-style Stage 2 data\. RRBRc​@​50R\_\{c\}@50rises by\+\+44\.0pp \(43\.8%→\\to87\.8%\), demonstrating that the model has sufficient capacity to handle realistic queries when trained on them\. The cost is a 9\.1pp drop on G1 verbose recall \(95\.7%→\\to86\.6%\), confirming that the two query styles induce complementary but non\-overlapping retrieval behaviors and the model specializes to whichever distribution it trains on\. Critically, MCQ accuracy drops from 31\.5% to 26\.0% \(near\-random\), and QA from 50\.0% to 46\.6%, despite the substantial improvement in retrieval\. This demonstrates that Stage 2 SFT erodes parametric tool knowledge regardless of whether training queries are verbose or realistic, implicating the retrieval fine\-tuning objective itself rather than the query distribution\.

Table 7:TG Gemma3\-4B: standard vs\. RRB\-style Stage 2 training \(final checkpoint\)\. 95% bootstrap CIs in brackets\.

## Appendix FInternalization Score: Rationale and Degenerate\-Case Handling

#### What IS measures and what it does not\.

IS is a*conditional*metric: it asks, “the tool knowledge the model has already acquired \(as measured byRc​@​kR\_\{c\}@k\), how much of it has been internalized into the weights rather than delegated to the trie?” It is*not*a measure of how much the model has learned overall, as that role belongs toRc​@​kR\_\{c\}@kitself\. Consider the case whereRc​@​50=0\.10R\_\{c\}@50=0\.10andRf​@​50=0\.10R\_\{f\}@50=0\.10: recall scores are low, indicating the model has learned only about small fraction of tools; but whatever it did learn can be generated freely without constraint, giving IS=1\.0=1\.0\. This is not a failure of the metric, instead the correct reading would be:*whatever small amount was learned is fully internalized*\. The learning deficit is plainly visible inRcR\_\{c\}; IS tells us about the quality of that learning, not its quantity\. For this reason we always reportRfR\_\{f\},RcR\_\{c\}, and IS together \(see Tables[8](https://arxiv.org/html/2606.12451#A7.T8)and[9](https://arxiv.org/html/2606.12451#A7.T9)\) so the reader can distinguish low\-recall\-but\-internalized from high\-recall\-but\-trie\-dependent regimes at a glance\.

#### Degenerate case:Rc​@​k=0R\_\{c\}@k=0\.

When constrained recall is exactly zero the ratio is undefined; we define IS=0=0in this case \(reflected in Eq\.[2](https://arxiv.org/html/2606.12451#S3.E2)\)\. Operationally,Rc=0R\_\{c\}=0means the model cannot recall any correct tool even with full trie guidance, so there is nothing to internalize\. Assigning IS=0=0rather than leaving it undefined prevents spurious high\-IS readings caused by a model that is uniformly broken\.

#### Relationship to a difference\-based formulation\.

An alternative design would useΔ​@​k=Rf​@​k−Rc​@​k\\Delta@k=R\_\{f\}@k\-R\_\{c\}@k\. This directly measures the absolute gap but conflates two distinct phenomena: a model withRc=0\.90R\_\{c\}=0\.90,Rf=0\.80R\_\{f\}=0\.80\(IS=0\.89=0\.89,Δ=−0\.10\\Delta=\-0\.10\) and one withRc=0\.10R\_\{c\}=0\.10,Rf=0\.00R\_\{f\}=0\.00\(IS=0=0,Δ=−0\.10\\Delta=\-0\.10\) would produce the same difference score despite having very different failure modes\. The ratio, always paired withRcR\_\{c\}, is more interpretable: IS=0\.89=0\.89atRc=0\.90R\_\{c\}=0\.90means a high\-recall model is mostly but not fully internalized; IS=0=0atRc=0\.10R\_\{c\}=0\.10means a low\-recall model is entirely trie\-dependent\.

## Appendix GInternalization Score: Full Results with Confidence Intervals

Tables[8](https://arxiv.org/html/2606.12451#A7.T8)and[9](https://arxiv.org/html/2606.12451#A7.T9)report IS@50 for all evaluated configurations, including 95% bootstrap confidence intervals on all four evaluation splits\. CIs for IS are derived via the delta method from the bootstrap CIs ofRf​@​50R\_\{f\}@50andRc​@​50R\_\{c\}@50: lettingσ^f\\hat\{\\sigma\}\_\{f\}andσ^c\\hat\{\\sigma\}\_\{c\}be the half\-widths of the respective 95% bootstrap intervals divided by 1\.96,σIS≈IS⋅\(σ^f/Rf\)2\+\(σ^c/Rc\)2\\sigma\_\{\\text\{IS\}\}\\approx\\text\{IS\}\\cdot\\sqrt\{\(\\hat\{\\sigma\}\_\{f\}/R\_\{f\}\)^\{2\}\+\(\\hat\{\\sigma\}\_\{c\}/R\_\{c\}\)^\{2\}\}, and the CI is\[IS−1\.96​σIS,IS\+1\.96​σIS\]\[\\text\{IS\}\-1\.96\\,\\sigma\_\{\\text\{IS\}\},\\;\\text\{IS\}\+1\.96\\,\\sigma\_\{\\text\{IS\}\}\], clipped to\[0,1\]\[0,1\]\.

Table 8:IS@50 with 95% confidence intervals for training Gemma3\-4B \(Stage 2\)\. Values in brackets are 95% CIs\.Table 9:IS@50 with 95% confidence intervals for training Qwen3\.5\-4B and Gemma3\-12B \(Stage 2\)\.The horizontal rule within each table separates flat\-token configurations \(top\) from hierarchical\-token configurations \(bottom\)\. Figures[9](https://arxiv.org/html/2606.12451#A7.F9)and[10](https://arxiv.org/html/2606.12451#A7.F10)show the IS@50 training dynamics for different configurations across all splits, complementing the main\-body Figure[3](https://arxiv.org/html/2606.12451#S5.F3)\.

![Refer to caption](https://arxiv.org/html/2606.12451v1/x10.png)Figure 9:IS@50 over Stage 2 training for Gemma3\-4B across all four evaluation splits\. Shaded bands are 95% bootstrap CIs\.![Refer to caption](https://arxiv.org/html/2606.12451v1/x11.png)Figure 10:IS@50 over Stage 2 training for Qwen3\.5\-4B across all four evaluation splits\. Shaded bands are 95% bootstrap CIs\.Several patterns are consistent across architectures and sizes: \(i\) flat configurations typically achieve IS\>0\.65\{\>\}0\.65on both G1 and RRB, with LoRA variants systematically higher \(≥0\.80\\geq 0\.80\); \(ii\) hierarchical configurations show a pronounced penalty on RRB—TG\-H drops to IS==0\.33 on RRB vs\. 0\.42 on G1 for Gemma3\-4B, and this pattern holds for Qwen3\.5\-4B \(0\.28 on RRB\); \(iii\) G3 IS values are frequently 1\.00 or near\-1\.00 for flat configurations, suggesting that out\-of\-distribution single\-API queries are handled without trie support; \(iv\) CI widths are wider on RRB than on ToolBench evaluation splits \(G1/G2/G3\), reflecting the models cause greater instability in consistently producing the tool tokens in organic manner\.

## Appendix HCross\-Architecture Probing Results

Table[11](https://arxiv.org/html/2606.12451#A8.T11)reports MCQ and QA probing accuracy for Qwen3\.5\-4B and Gemma3\-12B, complementing the Gemma3\-4B results in Table[4](https://arxiv.org/html/2606.12451#S5.T4)\. Table[10](https://arxiv.org/html/2606.12451#A8.T10)lists all 14 model variants included in the Stage 1 MCQ vs\. Stage 2 RRB correlation analysis \(Figure[4](https://arxiv.org/html/2606.12451#S5.F4)\)\.

Table 10:All 14 model variants included in the Stage 1 MCQ vs\. Stage 2 RRBRc​@​50R\_\{c\}@50correlation plot \(Figure[4](https://arxiv.org/html/2606.12451#S5.F4)\)\.The cross\-architecture results reveal a consistent pattern: larger and more capable base models enter Stage 2 with substantially stronger parametric tool knowledge \(Stage 1 MCQ of 73–75% for Qwen3\.5\-4B and Gemma3\-12B, vs\. 23–55% for Gemma3\-4B\) and largely*retain*that knowledge after retrieval fine\-tuning\. In contrast to Gemma3\-4B FFT variants which suffer severe MCQ drops \(e\.g\., TG drops from 55\.4% to 31\.4%\), the Qwen3\.5\-4B and Gemma3\-12B configurations maintain or even slightly improve MCQ scores — confirming that knowledge destruction is more pronounced in smaller models fine\-tuned with full parameter updates\. Hierarchical tokenization \(TG\-H\) remains the most destructive configuration across architectures: Qwen3\.5\-4B TG\-H drops from 39\.7% to 31\.7% MCQ, reinforcing the earlier finding that collapsing tool tokens into path\-structured representations impedes semantic retention regardless of model scale\. LoRA consistently provides the best knowledge preservation for Qwen3\.5\-4B \(TG\-3FM LoRA retains 74\.2% MCQ through Stage 2\), confirming its role as the safest fine\-tuning strategy when knowledge retention is required\.

Table 11:MCQ and QA probing accuracy for Qwen3\.5\-4B and Gemma3\-12B\. Random baselines: MCQ=25%, QA=50%\. 95% CI in brackets\.
## Appendix IVirtual Token Embedding Drift Analysis

To understand what Stage 2 SFT modifies mechanistically, we measure the*relative L2 drift*of virtual token embeddings between Stage 1 and Stage 2 checkpoints\. For a token indexii, define:

drel​\(i\)=‖𝐄S​2​\[i\]−𝐄S​1​\[i\]‖2‖𝐄S​1​\[i\]‖2d\_\{\\mathrm\{rel\}\}\(i\)=\\frac\{\\\|\\mathbf\{E\}\_\{S2\}\[i\]\-\\mathbf\{E\}\_\{S1\}\[i\]\\\|\_\{2\}\}\{\\\|\\mathbf\{E\}\_\{S1\}\[i\]\\\|\_\{2\}\}\(3\)where𝐄S​1,𝐄S​2∈ℝV×d\\mathbf\{E\}\_\{S1\},\\mathbf\{E\}\_\{S2\}\\in\\mathbb\{R\}^\{V\\times d\}are the embedding matrices at Stages 1 and 2 \(d=3840d\{=\}3840for Gemma\-3\-4B\-IT\)\. We compare virtual tokens against 5,000 randomly sampled base\-vocabulary tokens using a one\-sided Mann\-Whitney U test\. Results across all seven Gemma3\-4B configurations are shown in Figure[7](https://arxiv.org/html/2606.12451#S5.F7)\.

#### Finding 1: Stage 2 selectively repositions virtual tokens \(all configs,p<2×10−308p<2\{\\times\}10^\{\-308\}\)\.

Drift ratios range from1\.9×1\.9\\timesto22\.9×22\.9\\timesdepending on training regime\. This rules out uniform global drift: if Stage 2 were broadly fine\-tuning all weights, virtual and regular tokens would experience similar relative displacement\. In FFT models \(22–3×3\\timesratio\), virtual tokens appear in*every*training example as generation targets while regular tokens appear sparsely in user queries; by the law of large numbers over mini\-batches, virtual tokens accumulate proportionally larger gradient updates\(Li and Liang,[2021](https://arxiv.org/html/2606.12451#bib.bib21); Lesteret al\.,[2021](https://arxiv.org/html/2606.12451#bib.bib22)\)\.

#### Finding 2: LoRA creates extreme isolation \(22\.9×22\.9\\times\)\.

Under LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.12451#bib.bib13)\), adapters are attached to linear layers while the full embedding layer remains trainable\. Despite this, base vocabulary tokens drift by onlyDrel​\(RAND\)=0\.0015D\_\{\\mathrm\{rel\}\}\(\\mathrm\{RAND\}\)\{=\}0\.0015— effectively numerical noise — because they appear solely in input queries and their gradients must back\-propagate through the frozen backbone layers before reaching the embedding table, arriving with negligible magnitude\. Virtual tokens, as generation targets in every Stage 2 training example, receive direct strong gradient from the output cross\-entropy loss\. This input/output asymmetry, amplified by backbone freezing, is responsible for the22\.9×22\.9\\timesdrift ratio\.

#### Finding 3: Hierarchical tokens occupy a near\-isotropic geometric regime\.

Flat tokens have mean pairwise cosine similaritycosimS​1≈0\.37\\mathrm\{cosim\}\_\{S1\}\{\\approx\}0\.37, a signature of embedding anisotropy produced when all tool tokens share similar Stage 1 generation objectives, pushing them into a common semantic cone\(Gaoet al\.,[2019](https://arxiv.org/html/2606.12451#bib.bib23); Ethayarajh,[2019](https://arxiv.org/html/2606.12451#bib.bib24); Timkey and van Schijndel,[2021](https://arxiv.org/html/2606.12451#bib.bib25)\)\. Hierarchical tokens havecosimS​1≈0\.038\\mathrm\{cosim\}\_\{S1\}\{\\approx\}0\.038, approaching the isotropic baseline expected for randomly initialised unit vectors\. For Gemma\-3’s weight\-tied architecture \(𝐖head=𝐄\\mathbf\{W\}\_\{\\mathrm\{head\}\}\{=\}\\mathbf\{E\}\), the trie\-routing logit margin isli−lj=\(𝐄​\[i\]−𝐄​\[j\]\)⊤​𝐡l\_\{i\}\-l\_\{j\}=\(\\mathbf\{E\}\[i\]\-\\mathbf\{E\}\[j\]\)^\{\\top\}\\mathbf\{h\}; near\-orthogonal embeddings maximise this margin, so hierarchical tokens are better conditioned for constrained decoding from the start\(Caoet al\.,[2021](https://arxiv.org/html/2606.12451#bib.bib8)\)\. Consequently, hierarchical tokens require77–24×24\\timesless Stage 2 repositioning \(TG\-3FM flat:drel=0\.024d\_\{\\mathrm\{rel\}\}\{=\}0\.024; TG\-5FM hier:drel=0\.003d\_\{\\mathrm\{rel\}\}\{=\}0\.003\)\.

#### Finding 4: No representation collapse \(\|Δ​cosim\|<0\.002\|\\Delta\\mathrm\{cosim\}\|<0\.002\)\.

Despite targeted repositioning, pairwise cosine similarity of virtual tokens is essentially unchanged by Stage 2 across all configurations\. This rules out the hypothesis that Stage 2 collapses virtual tokens into undifferentiated routing codes\. Each token retains its Stage\-1\-acquired geometric identity; what changes is absolute position \(drift\), not relative configuration \(cosim\)\. The trie objective structurally prevents collapse: if all tokens became identical, constrained beam search could not route to specific tools\.

#### Connection to knowledge\-retrieval dissociation\.

Despite the targeted and statistically significant Stage 2 updates, MCQ/QA accuracy on tool knowledge degrades, confirming that Stage 2 optimizes virtual token positions as trie\-routing codes rather than encoding semantic tool knowledge\. Even hierarchical tokens, geometrically superior in every metric and requiring minimal Stage 2 adaptation, fail MCQ probes, closing the confound that poor token geometry explains the dissociation\. This is consistent with causal analyses locating factual recall in MLP layers of mid\-to\-upper transformer blocks\(Menget al\.,[2022](https://arxiv.org/html/2606.12451#bib.bib27); Daiet al\.,[2022](https://arxiv.org/html/2606.12451#bib.bib28); Petroniet al\.,[2019](https://arxiv.org/html/2606.12451#bib.bib9)\): Stage 2 targets the input embedding layer, leaving factual recall pathways untouched\(Petrovet al\.,[2024](https://arxiv.org/html/2606.12451#bib.bib26)\)\.

## Appendix JBenchmark Generation Details

All three ToolSense benchmarks \(RRB, MCQ, QA\) were generated usingclaude\-4\.5\-sonnetfor both generation and judging calls, accessed via a LiteLLM proxy\. Hard\-negative retrieval for the RRB pipeline used OpenAItext\-embedding\-3\-largeembeddings stored in a ChromaDB vector database\. Judge calls were made at temperature=0\.0=0\.0; generation calls used the model’s default sampling temperature\.

### J\.1Realistic Retrieval Benchmark \(RRB\) Generation Pipeline

#### Seed selection\.

Anchor tools are sampled from the ToolBench catalog using stratified sampling across service domains \(parsed from theService&&Methodname format\), with proportional allocation and a minimum of one seed per domain\.

#### Hard\-negative pool construction\.

For each anchor toolτ\\tau, 13 hard negatives are retrieved from the full catalog via cosine similarity overtext\-embedding\-3\-largeembeddings of tool descriptions\. The candidate pool is𝒫​\(τ\)=\{τ\}∪\{13 hard negatives\}\\mathcal\{P\}\(\\tau\)=\\\{\\tau\\\}\\cup\\\{\\text\{13 hard negatives\}\\\}, giving 14 total candidates\.

#### Generation and validation pipeline\.

Generation is implemented as a LangGraphStateGraphwith three tier subgraphs \(easy, medium, hard\) running in parallel via fan\-out/fan\-in\. Each subgraph follows:*generate*→\\to*programmatic filter*→\\to*LLM judge*→\\to\(accept or retry with feedback\)\. Up to three retries are allowed per sample; on retry, the judge’s rejection reason is injected into the next generation prompt\. The programmatic filter checks that all answer tool names appear verbatim in the candidate pool, preventing hallucinated tool names\.

#### Tier specifications\.

Easy tier queries must resolve to exactly 1 tool, remain under 20 words, and be phrased in business language\. Medium tier queries must be genuinely ambiguous across 2–3 tools \(overlapping functional domains\), under 25 words\. Hard tier queries describe a high\-level business goal spanning 4\+ tools, in 1–3 sentences\. Default sample targets per seed: 3 \(easy\), 3 \(medium\), 2 \(hard\)\.

### J\.2MCQ and QA Probing Benchmark Generation

Both benchmarks follow the same two\-step generate\-then\-judge pattern: \(1\) a generation call produces a candidate item; \(2\) a judge call at temperature=0\.0=0\.0accepts or rejects it\. Items failing the judge or flagged by the generator as unanswerable \(skip=true\) are discarded\.

#### MCQ\.

For each tool, the generator produces a 4\-way multiple\-choice question: one correct answer and three plausible\-but\-wrong distractors, all as short phrases \(2–8 words\)\. Questions must use“this tool”as a placeholder and test a specific, verifiable property\. The judge verifies all five criteria: specificity, correctness, distractor plausibility, option distinctiveness, and placeholder usage\. Final dataset: 496 items\.

#### QA\.

For each tool, the generator produces one binary yes/no question with a pre\-specified target answer \(alternated across tools for label balance\)\. Questions must use“this tool”as a placeholder and test a verifiable property \(modality, domain, input/output type, etc\.\)\. The judge verifies four criteria: specificity, correctness, placeholder usage, and verifiability\. Final dataset: 500 items\.

## Appendix KSystem Prompts

We report the exact user\-turn prompts used for experiments conducted in this study\. Placeholders in\{braces\}are filled at runtime with the corresponding value\.

### K\.1Memorization Format Prompts

Thedesc→\\totokformat \(Figure[11](https://arxiv.org/html/2606.12451#A11.F11)\), used in all configurations, asks the model to predict the virtual token from the tool’s JSON description\. For flat tokens the expected output is a single atomic identifier \(<<API&&ENDPOINT\>\>\); for hierarchical tokens it is the joint two\-token sequence \(<<API\>\><<ENDPOINT\>\>\) generated autoregressively\.

Thetok→\\todescformat \(Figure[12](https://arxiv.org/html/2606.12451#A11.F12)\), used in TG\-3FM and TG\-5FM, reverses the mapping: the model receives the virtual token and is asked to generate the plain\-language tool description\. This bidirectional objective makes the association more resistant to overwriting during Stage 2\.

Thedesc→\\toapi\_tokformat \(Figure[13](https://arxiv.org/html/2606.12451#A11.F13)\), used only in TG\-5FM, restricts the target to the API\-level token alone, providing an additional supervised signal that trains the first hierarchical generation step independently\.

Thedesc\+api\_tok→\\toendpoint\_tokformat \(Figure[14](https://arxiv.org/html/2606.12451#A11.F14)\), also TG\-5FM only, gives the model both the tool description and the API\-level token and asks it to predict the endpoint\-level token, training the second hierarchical step independently\.

The MCTS format \(Figure[15](https://arxiv.org/html/2606.12451#A11.F15)\), used in TG\-3FM and TG\-5FM, presents a discriminative selection task: the model must choose the correct virtual token fromK\+1K\+1hard\-negative candidates retrieved by embedding similarity\. See Appendix[B](https://arxiv.org/html/2606.12451#A2)for construction details\.

You will be provided with the information about the tool in the JSON format\. Your task is to predict the tool that corresponds to the given tool information\. You only need to predict the tool\. Please don’t include any additional text in your response\. The tool information is provided below:\{tool\_info\}Figure 11:desc→\\totokprompt \(all configurations\)\.You will be provided with a tool representation tokens\. Your task is to describe what this tool does in plain language\. Please don’t include any additional text in your response other than the description\. The tool representation is provided below:\{tool\_token\}Figure 12:tok→\\todescprompt \(TG\-3FM, TG\-5FM\)\.You will be provided with the information about the tool in the JSON format\. Your task is to identify the API this tool belongs to and produce its corresponding API token\. Only respond with the API token and nothing else\. The tool information is provided below:\{tool\_info\}Figure 13:desc→\\toapi\_tokprompt \(TG\-5FM only\)\.You will be provided with the information about the tool in the JSON format along with its API\. Your task is to identify the Endpoint this tool belongs to within that API and produce its corresponding Endpoint token\. Only respond with the Endpoint token and nothing else\. The tool information and API are provided below:Tool Information: \{tool\_info\}API: \{api\_token\}Figure 14:desc\+api\_tok→\\toendpoint\_tokprompt \(TG\-5FM only\)\.You will be provided with the description of a tool along with line\-separated multiple choice options for the tool tokens\. Your task is to select the correct tool token that corresponds to the given tool description\. Please respond with only the correct tool token without any additional text\. The tool description and options are provided below:\[START OF TOOL DESCRIPTION\]\{tool\_description\}\[END OF TOOL DESCRIPTION\]\[START OF OPTIONS\]\{tool\_options\}\[END OF OPTIONS\]From the above new line\-separated options, select the correct tool\. Only respond with the tool representation \(sequence of tokens that represent tool\) and nothing else\.Figure 15:MCTS discriminative prompt \(TG\-3FM, TG\-5FM\)\.
### K\.2Retrieval Training Prompt

Figure[16](https://arxiv.org/html/2606.12451#A11.F16)shows the Stage 2 prompt shared by all configurations that use a system prompt \(TG\-SP, TG\-3FM, TG\-H, TG\-5FM\)\. The model receives a natural\-language user query and must predict the corresponding virtual tool token\. The no\-prompt baseline TG omits this prompt entirely; all other configurations use it consistently across Stage 2 training and inference\.

You will be provided with the user query\. Your task is to predict the tool that can best fulfill the user’s request\. You only need to predict the tool token\. Please don’t include any additional text in your response\. The user query is provided below:\{query\}

Figure 16:Stage 2 retrieval training and inference prompt\.
### K\.3Benchmark Generation Prompts

All benchmark generation prompts were used withclaude\-4\.5\-sonnetvia a LiteLLM proxy\. Figures[17](https://arxiv.org/html/2606.12451#A11.F17)and[18](https://arxiv.org/html/2606.12451#A11.F18)show the generation and judge prompts for the QA probing benchmark; Figures[19](https://arxiv.org/html/2606.12451#A11.F19)and[20](https://arxiv.org/html/2606.12451#A11.F20)show the corresponding prompts for MCQ\. All judge prompts run at temperature=0\.0=0\.0\. Figure[21](https://arxiv.org/html/2606.12451#A11.F21)shows the full RRB easy\-tier generation prompt \(Jinja2 template\); the medium\- and hard\-tier prompts share the same structure with tier\-specific rules substituted in\. Figure[22](https://arxiv.org/html/2606.12451#A11.F22)shows the shared RRB judge prompt, parameterised by the\{\{complexity\}\}variable\.

You are building a factual probing benchmark for AI tool retrieval research\.Below is the description of an API or software tool:\{description\}Your task: generate one yes/no question that tests knowledge of a specific, verifiable functionality or capability of this tool\. The correct answer MUST be “\{target\_answer\}”\.Rules:1\. The answer to your question must be exactly “\{target\_answer\}” based on the description\.2\. Use “this tool” in the question — never include the actual tool name or service name\.\(The model will see only a token at inference time, so the name must not be a hint\.\)3\. The question must be specific to THIS tool\. “Does this tool provide an API?” is too generic\.Good examples for Yes: “Does this tool process image inputs?”, “Is this tool designed for financial data?”Good examples for No: “Does this tool support voice/audio input?”, “Does this tool return results in XML?”For No questions: ask about a plausible capability or data type that the description does NOT mention\(e\.g\. a different modality, format, or domain that a user might expect but this tool doesn’t handle\)\.4\. Set skip=true if you cannot form a specific, unambiguous question with the required answer\.\{format\_instructions\}

Figure 17:QA probing benchmark: generation prompt\.You are validating a yes/no Q&A entry for an AI tool probing benchmark\.Tool description:\{description\}Generated entry:Question : \{question\}Answer : \{answer\}Check ALL of the following:1\. The question is specific to THIS tool — not generically answerable for any API\.2\. The answer \(“\{answer\}”\) is directly and unambiguously supported by the description\.3\. The question uses “this tool” as a placeholder — the actual tool name does not appear\.4\. The question tests a verifiable property \(domain, capability, input/output type, format, etc\.\)\.Set accept=true only if ALL four checks pass\. Otherwise set accept=false and state which check failed\.\{format\_instructions\}

Figure 18:QA probing benchmark: judge prompt \(temperature=0\.0=0\.0\)\.You are building a multiple\-choice probing benchmark for AI tool retrieval research\.Below is the description of an API or software tool:\{description\}Your task: generate one specific factual question about this tool’s properties and provide one correct short answer plus three wrong\-but\-plausible alternatives\.Rules:1\. Use “this tool” in the question — never include the actual tool name or service name\.\(The model will see only a virtual token at inference time, so the name must not be a hint\.\)2\. The question must be specific to THIS tool — not answerable for a generic API\.Good topics: primary output type, domain/industry, key input, core capability, supported format\.Bad question: “Does this tool provide an API?” \(too generic\)\.3\. Each answer option must be a short phrase \(2–8 words\), not a full sentence\.4\. The correct\_answer must be directly supported by the description above\.5\. The three wrong answers must be plausible alternatives — the kind of answer a user mightexpect from a tool in the same domain — but clearly incorrect for THIS specific tool\.6\. All four options \(correct \+ wrong\) must be meaningfully distinct from each other\.7\. Set skip=true if you cannot form a specific, unambiguous factual question from this description\.\{format\_instructions\}

Figure 19:MCQ probing benchmark: generation prompt\.You are validating a multiple\-choice question entry for an AI tool probing benchmark\.Tool description:\{description\}Generated entry:Question : \{question\}Correct answer: \{correct\_answer\}Wrong answers : \{wrong\_answers\}Check ALL of the following:1\. The question is specific to THIS tool — not generically answerable for any API\.2\. The correct answer is directly and unambiguously supported by the description\.3\. Each of the three wrong answers is plausible for a tool in the same domain but clearly incorrect for THIS tool\.4\. All four options are meaningfully distinct from each other\.5\. The question uses “this tool” as a placeholder — the actual tool name does not appear\.Set accept=true only if ALL five checks pass\. Otherwise set accept=false and state which check failed\.\{format\_instructions\}

Figure 20:MCQ probing benchmark: judge prompt \(temperature=0\.0=0\.0\)\.You are a data generation expert creating an evaluation benchmark for a tool retrieval model\.Your task is to generate \{\{n\_samples\}\} evaluation samples, each consisting of a concise enterprise user query that points to EXACTLY ONE specific tool\.\[START OF TASK DESCRIPTION\]The goal is to simulate how real enterprise users phrase requests — NOT how developers write API calls\.Unlike typical benchmark queries \(which are overly specific and technical\), your queries should be:\- Concise: 1–2 sentences, ideally under 20 words\- Written in natural business language \(NOT using API names, technical identifiers, or schema terms\)\- Realistic: something a business analyst, manager, or power user would actually typeEach sample must contain:1\. A query — concise, business\-language question pointing to the target tool2\. An answer — a list containing exactly ONE tool name \(the target tool\)\[END OF TASK DESCRIPTION\]\[START OF GENERATION RULES\]1\. Queries must be CONCISE — avoid verbose, over\-specified phrasing2\. Use BUSINESS LANGUAGE — never include API method names, OData syntax, or system\-specific technical identifiers in the query3\. The query should naturally lead to the target tool without explicitly naming it4\. The query should NOT trivially match every other tool in the pool — it must be specific enough to distinguish the target5\. Do NOT use OData query syntax \($filter, $select, $expand, etc\.\) in queries6\. Do NOT hallucinate tool names — only use names exactly as they appear in the tool list below7\. Every query must be distinct — vary phrasing and angle of attackGOOD query examples \(concise, business language\):\- “Show me all open purchase orders”\- “Which customers have overdue invoices?”\- “I need to track employee time\-off requests”\- “List products that are low on stock”BAD query examples \(too technical or verbose\):\- “Retrieve all PurchaseOrders where Status eq ‘Open’ using the GetPurchaseOrders endpoint”\- “I need to call the employee leave management API to fetch pending time\-off approval requests for the current fiscal quarter”\[END OF GENERATION RULES\]\[START OF TARGET TOOL\]Tool Name: \{\{anchor\_tool\_name\}\}Description: \{\{anchor\_tool\_description\}\}\[END OF TARGET TOOL\]\[START OF TOOL LIST\]These are all available tools\. Your answer must contain exactly one tool name from this list:\{\{candidate\_pool\_str\}\}\[END OF TOOL LIST\]\{% if previous\_feedback and previous\_feedback \!= “None” %\}\[START OF FEEDBACK FROM PREVIOUS ATTEMPT\]\{\{previous\_feedback\}\}Address these issues in your new generation\.\[END OF FEEDBACK FROM PREVIOUS ATTEMPT\]\{% endif %\}\[START OF OUTPUT FORMAT\]\{\{format\_instructions\}\}\[END OF OUTPUT FORMAT\]Generate \{\{n\_samples\}\} easy\-tier eval samples\. Each query must be concise \(under 20 words\), in business language, and point unambiguously to exactly one tool\.

Figure 21:RRB easy\-tier generation prompt \(Jinja2 template;\{\{var\}\}= template variable,\{% \.\.\.%\}= control block\)\. The medium\-tier prompt is identical except: queries target 2–3 tools with genuinely overlapping functional domains, length limit is 25 words, and the answer is a list of 2–3 tool names\. The hard\-tier prompt targets 4\+ tools spanning a high\-level business goal, in 1–3 sentences\.You are a quality validator for an enterprise tool retrieval evaluation benchmark\. Given a batch of generated \(query, answer\) samples, assess whether each sample is suitable as an evaluation example\.\[START OF TASK DESCRIPTION\]Validate each sample and return one verdict per sample \(in the same order\)\.For each sample, determine:1\. Is the query CONCISE and written in BUSINESS LANGUAGE \(not technical API language\)?2\. Does the query sound like something a real enterprise user would actually ask?3\. For \{\{complexity\}\} tier: is the number of answer tools and the ambiguity level appropriate?\- easy: query should be specific enough that one tool is the clear answer; the phrasing should not be technical\- medium: query should be genuinely ambiguous between 2–3 tools; ambiguity must arise from overlapping domains, not vague phrasing\- hard: query should describe a high\-level business goal that legitimately spans 4\+ tools4\. Are all answer tools plausibly relevant to the query?\[END OF TASK DESCRIPTION\]\[START OF VALIDATION RULES\]A sample PASSES \(validation\_result: true\) if ALL of the following hold:\- The query is CONCISE \(1–3 sentences, not verbose or over\-specified\)\- The query uses BUSINESS LANGUAGE — no API method names, technical identifiers, OData operators \($filter, $select, $expand, $orderby, etc\.\)\- The query reads naturally — it sounds like a real enterprise user asking a question, not a developer writing a specification\- All answer tools are plausibly relevant to the stated query\- The ambiguity level matches the complexity tierA sample FAILS \(validation\_result: false\) if ANY of the following hold:\- The query is verbose or reads like a technical specification\- The query contains API names, OData syntax, or schema property names\- The query is too generic to be meaningful \(e\.g\. “get data”, “manage system entities”\)\- Answer tools are not justified by the query content\- Wrong complexity \(e\.g\. an easy query matching 3\+ tools, or a hard query where only 1–2 tools fit\)\[END OF VALIDATION RULES\]\[START OF SAMPLES TO VALIDATE\]\{\{samples\_str\}\}\[END OF SAMPLES TO VALIDATE\]\[START OF OUTPUT FORMAT\]\{\{format\_instructions\}\}\[END OF OUTPUT FORMAT\]Validate each sample in order and return one verdict per sample\.

Figure 22:RRB LLM judge prompt, shared across all three tiers \(temperature=0\.0=0\.0\)\. The\{\{complexity\}\}variable is instantiated aseasy,medium, orhardper tier\.

## Appendix LArtifact Licenses and Intended Use

#### Datasets used\.

ToolBench\(Qinet al\.,[2024b](https://arxiv.org/html/2606.12451#bib.bib7)\)provides tool documentation metadata crawled from RapidAPI’s publicly accessible API catalog\. The ToolBench repository and dataset are released under the Apache\-2\.0 License\. Our use conforms to these terms: we do not redistribute the raw metadata, instead releasing the generated diagnostic benchmarks \(RRB, MCQ, QA\) derived from it\.

#### Opensource instruction\-tuned models\.

Gemma 3\(Teamet al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib16)\)is released by Google under the Gemma Terms of Use, which permits research and commercial use subject to the stated conditions\. Qwen3\.5\(Yanget al\.,[2025](https://arxiv.org/html/2606.12451#bib.bib17)\)is released by Alibaba under the Apache 2\.0 License\.

#### Artifacts released by this work\.

The ToolSense benchmark generation framework and the ToolBench diagnostic benchmarks \(RRB, MCQ, QA\) are released under theApache 2\.0 License\. The intended use is research on parametric tool\-retrieval systems, multi\-stage fine\-tuning evaluation, and related NLP benchmarking tasks\. We do not recommend using the benchmarks to make production deployment decisions without independent validation on the target tool catalog\.

## Appendix MDataset Documentation

#### PII and offensive content\.

The ToolBench tool metadata consists entirely of API endpoint names, descriptions, and parameter schemas scraped from RapidAPI’s public\-facing documentation\. This data contains no personally identifiable information \(PII\), no sensitive categories \(health, finance, biometric, etc\.\), and no offensive content\. The RRB, MCQ, and QA benchmarks are fully synthetic; all items were generated by an LLM conditioned on the above API documentation, with no user data or personal information involved\.

#### Dataset statistics\.

Table[12](https://arxiv.org/html/2606.12451#A13.T12)summarises the benchmarks created in this work\. The ToolBench tool catalog used for training and evaluation contains∼\\sim47k tools spanning∼\\sim16k API services\.

Table 12:Benchmark item counts\. The RRB evaluation set \(500 items total\) is stratified across three difficulty tiers\. MCQ and QA are single\-split probing benchmarks\.

## Appendix NComputational Resources and Software

#### Hardware\.

All training and evaluation experiments were run on a single node equipped with 8 NVIDIA H200 SXM5 \(141 GB\) GPUs\. All model configurations used a single GPU\.

#### Software versions\.

All training code uses the following package versions: Python 3\.13, PyTorch 2\.10\.0, HuggingFace Transformers 5\.5\.0, PEFT 0\.19\.0, Accelerate 1\.11\.0, TRL 1\.2\.0, Datasets 4\.7\.0, and DeepSpeed 0\.18\.2\. Benchmark generation used LiteLLM≥\\geq1\.60, LangGraph≥\\geq0\.2\.0, ChromaDB≥\\geq0\.5\.0, and Pydantic≥\\geq2\.0\.0\.

## Appendix OHuman Annotation Study

#### Task instructions\.

Three annotators independently validated a stratified random sample of 100 items per benchmark following the instructions below\.

ForRRB: annotators were shown each query and a pool of 14 candidate tools \(the ground\-truth tool\(s\) plus hard\-negative neighbors from the generation pipeline\)\. They were asked: “Select all tools from the candidate list that you believe are a correct and relevant match for this query’s intent\.” Annotators were instructed to rely solely on the tool names and descriptions provided and not to use external search\.

ForMCQ: annotators were shown the full tool description and the four answer options and asked to select the single correct option \(A/B/C/D\) based on the description\.

ForQA: annotators were shown the full tool description and the yes/no question and asked to answer Yes or No based solely on the description\.

#### Recruitment and compensation\.

Annotators were full\-time employees of the authoring organization; no additional compensation beyond their regular employment was provided\.

#### Consent and data protection\.

Annotators were informed of the study’s purpose before participating and provided voluntary consent to have their annotations used for research\. No personal data belonging to the annotators or any third parties was collected or stored\.

#### Ethical review\.

This annotation study involves expert evaluation of AI\-generated benchmark items and does not constitute human subjects research involving personal data collection\.

#### Annotator demographics\.

All three annotators are native or proficient English speakers with deep professional expertise in enterprise\-scale tool catalogs and API ecosystems\.

## Appendix PUse of AI Assistants

Claude Code with claude\-4\.5\-opus \(Anthropic\) was used to assist with prototyping parts of the experimental codebase and with refining the writing and presentation of this paper\.

Similar Articles

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL

This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arXiv cs.AI

This paper introduces a model-adaptive definition of tool necessity for LLMs, revealing a substantial mismatch between when a model should use a tool and when it actually does. The authors decompose tool use into cognition and action stages, finding that the majority of errors occur in translating recognition into action, identifying a 'knowing-doing gap' in LLM tool use.