CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

arXiv cs.AI Papers

Summary

CoHyDE introduces an iterative co-training procedure for an LLM rewriter and a dense encoder to improve tool retrieval from large API catalogs. It outperforms single-component baselines, especially on vague queries, by training both components together using InfoNCE and DPO.

arXiv:2605.29271v1 Announce Type: new Abstract: Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:16 AM

# Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval
Source: [https://arxiv.org/html/2605.29271](https://arxiv.org/html/2605.29271)
Vaishali Senthil Ashutosh Hathidara11footnotemark:1Sebastian Schreiber SAP Labs \{vaishali\.senthil, ashutosh\.hathidara, sebastian\.schreiber\}@sap\.com

###### Abstract

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own\. The two dominant training approaches, contrastive encoder fine\-tuning and HyDE\-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine\-tuned encoder excels when the query’s surface form already matches the catalog but collapses when it does not, while zero\-shot HyDE is more robust to underspecified queries yet generates catalog\-unaware hypothetical descriptions that degrade retrieval when queries are well\-formed\. We introduceCoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co\-evolving system: the encoder is retrained with InfoNCE on catalog\-style hypothetical descriptions produced by the rewriter, and the rewriter is preference\-aligned via DPO against the encoder’s retrieval scores, with both sides warm\-started on the tool catalog before the loop begins\. On a∼\\sim10k tool subset of the ToolBench catalog\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2)\), three rounds of CoHyDE improve over the strongest single\-component baseline by\+2\.5 ppNDCG@5 on standard queries and\+6\.3 ppon held\-out vague queries, with gains as large as\+8 ppon the hardest vague tier\. Ablations confirm that co\-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well\-formed and vague queries, with losses of up to\-8 ppon vague queries\.

CoHyDE: Iterative Co\-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Vaishali Senthil††thanks:Equal contribution\.Ashutosh Hathidara11footnotemark:1Sebastian SchreiberSAP Labs\{vaishali\.senthil, ashutosh\.hathidara, sebastian\.schreiber\}@sap\.com

## 1Introduction

Modern language model agents act in the world by calling external tools drawn from catalogs that increasingly number in the tens of thousands\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2); Patilet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib4)\)\. No agent can fit every tool’s documentation into its context window, and the quality of an agent’s actions is bounded above by an upstream*tool retrieval*step that selects a small candidate set per user query\.

The dominant retrieval recipe embeds queries and tools into a shared vector space and returns the top\-kkmost similar tools by nearest\-neighbor lookup\. Two largely disjoint research directions have grown around this recipe\.Direction 1: query expansion with a frozen LLM\.HyDE\-style methods\(Gaoet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib64); Wanget al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib6)\)prompt a frozen LLM to generate a hypothetical document for the query and search a frozen encoder against its embedding\.Direction 2: encoder fine\-tuning with no query rewriting\.Dense\-retrieval methods fine\-tune the encoder on \(query, tool\) pairs with contrastive losses\(Karpukhinet al\.,[2020](https://arxiv.org/html/2605.29271#bib.bib9); Xiaoet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib13)\)\.

Both directions have a complementary failure mode\. A trained dense encoder is, in essence, a similarity function shaped by the \(anchor, positive\) pairs it sees during training\. When the query is in\-distribution \(i\.e\., sharing lexical surface with the catalog\), the contrastive signal is sufficient; when surface form drifts, the encoder has no world\-knowledge or reasoning machinery to bridge the gap and falls back on residual lexical cues\(Thakuret al\.,[2021](https://arxiv.org/html/2605.29271#bib.bib19); Chenet al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib17)\)\. Query\-expansion approaches fail symmetrically: the LLM brings the reasoning needed to handle vague queries\(Weiet al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib23)\), but its generated output does not match the catalog’s vocabulary, so on well\-formed queries, it hurts more than it helps\(Leiet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib24)\)\. This raises a natural question:*can the two training modes be combined into a single procedure that is stronger than either component alone?*

We introduceCoHyDE, an iterative co\-training procedure that treats the dense encoder and the LLM rewriter as a single co\-evolving system\. In each round, the LLM generates catalog\-style hypothetical descriptions for each query; the encoder is then retrained via contrastive learning on these descriptions, and the LLM is preference\-aligned via DPO using the encoder’s own retrieval scores as reward signal\. This alternating update cycle is repeated for multiple iterations, with each component progressively adapting to the other\.

We apply CoHyDE on a∼\\sim10k\-tool subset of the ToolBench catalog\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2)\)\. After three rounds of co\-training, CoHyDE improves over the strongest single\-component baseline by\+2\.5 ppNDCG@5 on standard queries and\+6\.3 ppon held\-out vague queries, with gains as large as\+8 ppon the hardest vague tier\.

To summarize our contributions: \(i\) We introduce CoHyDE, an iterative co\-training procedure that jointly optimizes a dense encoder and an LLM rewriter for tool retrieval\. \(ii\) We empirically characterize the complementary failure modes of encoder fine\-tuning and zero\-shot HyDE, motivating the need to train both components jointly\.

## 2Related Work

#### Tool retrieval\.

Dense tool\-retrieval methods fine\-tune an encoder on \(query, tool\) pairs with contrastive supervision\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2); Ananthaet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib14); Quet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib15); Shiet al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib16)\); a parallel line treats retrieval as a frozen black\-box via LLM\-based expansion or generative indexing\(Patilet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib4); Chenet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib7); Lumeret al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib8); Wanget al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib30)\)\. The closest prior work isShaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\), which iteratively rewrites user instructions and retrains the encoder on \(rewritten\-instruction, tool\) pairs\. CoHyDE differs: the rewriter is preference\-aligned via DPO against the encoder it feeds, rewrites target*catalog\-description style*rather than query style, and the encoder retrain uses no real \(query, tool\) pairs\. A concurrent line of work\(Anonymous,[2026](https://arxiv.org/html/2605.29271#bib.bib1)\)audits*parametric*tool retrieval, where tools are embedded as virtual tokens in an LLM’s vocabulary\(Wanget al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib30)\); this paradigm is orthogonal to CoHyDE, which improves dense encoder retrieval\.

#### Query expansion and trained rewriters\.

HyDE\(Gaoet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib64)\)searches a frozen index against a hypothetical document embedding; Query2doc\(Wanget al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib6)\)concatenates the pseudo\-document to the original query\. CSQE\(Leiet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib24)\)patches corpus\-misalignment of LLM expansions at test time by injecting retrieved sentences; we address the same misalignment at training time\. Trained query rewriters like Rewrite\-Retrieve\-Read\(Maet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib36)\), RaFe\(Maoet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib37)\), and LeReT\(Hsuet al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib38)\)use RL or DPO with a*frozen*retriever; a complementary thread\(Nogueiraet al\.,[2019](https://arxiv.org/html/2605.29271#bib.bib39); Daiet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib40); Bonifacioet al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib41); Wanget al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib42)\)trains the retriever on LLM\-generated synthetic queries with the generator frozen\. All these methods freeze at least one component, whereas CoHyDE co\-trains both\.

#### Dense retriever robustness and joint retriever\-generator training\.

Dense retrievers are brittle off\-distribution\(Thakuret al\.,[2021](https://arxiv.org/html/2605.29271#bib.bib19); Sciavolinoet al\.,[2021](https://arxiv.org/html/2605.29271#bib.bib18); Chenet al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib17); Yuet al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib20)\); domain\-adaptation via synthetic queries\(Wanget al\.,[2022](https://arxiv.org/html/2605.29271#bib.bib42); Daiet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib40); Menget al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib51); Linet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib52)\)runs the generation loop once with a frozen generator\. Joint retriever–generator frameworks like RAG\(Lewiset al\.,[2020](https://arxiv.org/html/2605.29271#bib.bib53)\), Atlas\(Izacardet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib54)\), REPLUG\(Shiet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib57)\), RA\-DIT\(Linet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib58)\), Self\-RAG\(Asaiet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib59)\)train the generator to produce better*final answers*, not better retrieval inputs\. Prior work has therefore never co\-trained a generator whose output*is*the retrieval input with the encoder that consumes it, the precise gap CoHyDE fills\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.29271v1/x1.png)Figure 1:Overview of CoHyDE: a dense encoder and an LLM rewriter are co\-trained in an alternating loop, with each component iteratively adapted to the other\.### 3\.1Problem Formulation

Let𝒯=\{t1,…,tN\}\\mathcal\{T\}=\\\{t\_\{1\},\\ldots,t\_\{N\}\\\}denote a tool catalog of sizeNN, where each toolt∈𝒯t\\in\\mathcal\{T\}carries a structured record \(api name & description as well as tool title & description\)\. We writeϕ:𝒯→Σ∗\\phi:\\mathcal\{T\}\\to\\Sigma^\{\*\}for a fixed*rendering*function that serialises a tool into a single text string\. Given a queryq∈Σ∗q\\in\\Sigma^\{\*\}and a budgetk∈ℕk\\in\\mathbb\{N\}, the tool\-retrieval problem is to return a ranked setT^k​\(q\)⊆𝒯\\hat\{T\}\_\{k\}\(q\)\\subseteq\\mathcal\{T\}with\|T^k​\(q\)\|=k\|\\hat\{T\}\_\{k\}\(q\)\|=kthat maximally overlaps the gold tool setTq∗⊆𝒯T^\{\*\}\_\{q\}\\subseteq\\mathcal\{T\}\.

We restrict attention to single\-vector dense encoder retrieval, the dominant architecture in tool retrieval\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2); Ananthaet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib14); Quet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib15); Shiet al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib16)\)\. A parameterised encoderfθ:Σ∗→ℝdf\_\{\\theta\}:\\Sigma^\{\*\}\\to\\mathbb\{R\}^\{d\}maps any text into add\-dimensional unit\-norm vector and retrieval is performed by approximate nearest\-neighbour search\(Johnsonet al\.,[2021](https://arxiv.org/html/2605.29271#bib.bib3)\),

T^k​\(q;θ\)=topkt∈𝒯​⟨fθ​\(q\),fθ​\(ϕ​\(t\)\)⟩\\hat\{T\}\_\{k\}\(q;\\theta\)=\\mathrm\{topk\}\_\{t\\in\\mathcal\{T\}\}\\,\\bigl\\langle f\_\{\\theta\}\(q\),\\;f\_\{\\theta\}\\\!\\bigl\(\\phi\(t\)\\bigr\)\\bigr\\rangle\(1\)We additionally consider a*rewriter\-augmented*variant in which a generatorgψ:Σ∗→Σ∗g\_\{\\psi\}:\\Sigma^\{\*\}\\to\\Sigma^\{\*\}produces a hypothetical tool descriptiond~=gψ​\(q\)\\tilde\{d\}=g\_\{\\psi\}\(q\)that is encoded*in place of*the query:

T^kgψ​\(q;θ\)=topkt∈𝒯​⟨fθ​\(gψ​\(q\)\),fθ​\(ϕ​\(t\)\)⟩\.\\hat\{T\}\_\{k\}^\{g\_\{\\psi\}\}\(q;\\theta\)=\\mathrm\{topk\}\_\{t\\in\\mathcal\{T\}\}\\,\\bigl\\langle f\_\{\\theta\}\\\!\\bigl\(g\_\{\\psi\}\(q\)\\bigr\),\\;f\_\{\\theta\}\\\!\\bigl\(\\phi\(t\)\\bigr\)\\bigr\\rangle\.\(2\)The goal of CoHyDE is to find parameters\(θ∗,ψ∗\)\(\\theta^\{\*\},\\psi^\{\*\}\)such that the two components reinforce each other, which we achieve through an alternating sequence of encoder and rewriter updates described in §[3\.5](https://arxiv.org/html/2605.29271#S3.SS5)\.

### 3\.2Data

#### Tool catalog\.

The ToolBench API pool\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2)\)contains\|𝒯full\|=46,980\|\\mathcal\{T\}\_\{\\mathrm\{full\}\}\|=46\{,\}980tools, partitioned into three evaluation tiers: single\-domain \(G1\), cross\-domain same\-category \(G2\), and cross\-domain different\-category \(G3\), with 1,092 official evaluation queries \(593 / 399 / 100 over G1/G2/G3\)\. We work with a stratified subset𝒯\\mathcal\{T\}ofN=10,000N=10\{,\}000tools sized for training tractability: the subset retains every tool referenced by the gold sets of the evaluation queries, and stratified\-samples the remaining slots to preserve the per\-tier proportions of𝒯full\\mathcal\{T\}\_\{\\mathrm\{full\}\}\.

#### Training set\.

The training set𝒟train=\{\(qi,Tqi∗\)\}i=1M\\mathcal\{D\}\_\{\\mathrm\{train\}\}=\\\{\(q\_\{i\},T^\{\*\}\_\{q\_\{i\}\}\)\\\}\_\{i=1\}^\{M\}consists ofM=104,224M=104\{,\}224\(query, gold\-tool\-set\) pairs \(44,873 / 35,402 / 23,949 over G1/G2/G3\); most queries have multiple gold tools \(\|Tq∗\|\>1\|T^\{\*\}\_\{q\}\|\>1for 93–99% ofqq\)\. For contrastive training, we flatten these to individual \(query, tool\) pairs𝒟q=\{\(q,ϕ​\(t\)\):\(q,Tq∗\)∈𝒟train,t∈Tq∗\}\\mathcal\{D\}\_\{\\mathrm\{q\}\}=\\\{\(q,\\phi\(t\)\):\(q,T^\{\*\}\_\{q\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\},\\,t\\in T^\{\*\}\_\{q\}\\\}\.

#### Tool rendering\.

We represent each tool under a family of five rendering conventionsΦ=\{ϕ1,…,ϕ5\}\\Phi=\\\{\\phi\_\{1\},\\ldots,\\phi\_\{5\}\\\}spanning its natural information axes:ϕ1\\phi\_\{1\}\(title only\),ϕ2\\phi\_\{2\}\(\+API name\),ϕ3\\phi\_\{3\}\(\+tool description\),ϕ4\\phi\_\{4\}\(title, API name, API description\), andϕ5\\phi\_\{5\}\(full record\)\. At training time,ϕ∼Unif​\(Φ\)\\phi\\sim\\mathrm\{Unif\}\(\\Phi\)is sampled independently per \(query, tool\) pair, so each tool is seen under all five surface forms over an epoch\. This format mixture encourages the encoder to learn representations invariant to catalog\-side surface variation, including the longer multi\-sentenceϕ5\\phi\_\{5\}that most closely matches the rewriter’s output style\. At inference, the catalog is indexed underϕ5\\phi\_\{5\}\.

#### Vague\-query split\.

We adopt the vague\-query evaluation protocol ofChenet al\.\([2026](https://arxiv.org/html/2605.29271#bib.bib63)\)to probe robustness under query\-side distribution shift\. Eachq∈𝒬evalq\\in\\mathcal\{Q\}\_\{\\mathrm\{eval\}\}is paraphrased to replace surface tokens with conversational alternatives, while preserving the original gold tool set\. We follow the protocol ofChenet al\.\([2026](https://arxiv.org/html/2605.29271#bib.bib63)\)exactly, substituting claude\-4\.5\-opus for the GPT\-4o paraphraser used in the original work\.𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}does not enter any training procedure; two\-pass validation \(LLM self\-check on every paraphrase plus an author spot\-check on 50 samples\) is described in Appendix[A](https://arxiv.org/html/2605.29271#A1)\.

### 3\.3Encoder

fθf\_\{\\theta\}is initialised from BGE\-large\-en\-v1\.5\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib13)\)\(≈\\approx335M parameters,d=1024d=1024\)\. Givenx∈Σ∗x\\in\\Sigma^\{\*\}we definefθ​\(x\)=hCLSθ​\(x\)/‖hCLSθ​\(x\)‖2f\_\{\\theta\}\(x\)=h^\{\\theta\}\_\{\\mathrm\{CLS\}\}\(x\)/\\\|h^\{\\theta\}\_\{\\mathrm\{CLS\}\}\(x\)\\\|\_\{2\}, and the same encoder is applied to queries, tool renderings, and rewriter outputs \(a*symmetric*bi\-encoder\)\. Training minimises the symmetric InfoNCE loss\(van den Oordet al\.,[2019](https://arxiv.org/html/2605.29271#bib.bib60)\)with temperatureτ=0\.05\\tau=0\.05and in\-batch negatives; full loss expression and optimisation hyperparameters are in Appendix[F](https://arxiv.org/html/2605.29271#A6)\.

We define two contrastive training datasets, differing only in what serves as the anchor:𝒟q\\mathcal\{D\}\_\{\\mathrm\{q\}\}pairs user queries with tool renderings, while𝒟d\(ψ\)\\mathcal\{D\}^\{\(\\psi\)\}\_\{\\mathrm\{d\}\}pairs rewriter\-generated hypothetical descriptionsgψ​\(q\)g\_\{\\psi\}\(q\)with tool renderings\. In both cases the tool side is rendered under a renderingϕ\\phisampled uniformly fromΦ\\Phi\.

### 3\.4Rewriter

gψg\_\{\\psi\}is Qwen3\.5\-4B\(Yanget al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib61)\), an instruction\-tuned decoder\-only transformer\. We define a prompt operatorρHyDE\\rho\_\{\\mathrm\{HyDE\}\}that wraps a query with an instruction to enumerate the full tool description of tool capable of fulfilling the query’s intent, in catalog\-style description format \(Appendix[C](https://arxiv.org/html/2605.29271#A3)\)\. A deterministic cleaning operatorclean​\(⋅\)\\mathrm\{clean\}\(\\cdot\)strips reasoning\-trace blocks and conversational preambles before encoding \(Appendix[B](https://arxiv.org/html/2605.29271#A2)\)\. At inference, the rewriter producesd~=clean​\(gψ​\(ρHyDE​\(q\)\)\)\\tilde\{d\}=\\mathrm\{clean\}\(g\_\{\\psi\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)\)and retrieval proceeds againstd~\\tilde\{d\}alone, replacing the original query entirely\.

### 3\.5CoHyDE: Iterative Co\-training

We index encoder and rewriter checkpoints by training stage:θi\\theta\_\{i\},ψi\\psi\_\{i\}are the parameters after stageii\.θ0\\theta\_\{0\}denotes BGE\-large\-en\-v1\.5 pretrained weights;ψ0\\psi\_\{0\}denotes the opensource instruction\-tuned Qwen3\.5\-4B\. The pipeline has two parallel warmup steps \(S1a & S1b\) followed by a bootstrap data\-generation step \(S2\) and an alternating training loop \(S3, S4\) that may be unrolled for any number of roundsR≥1R\\geq 1\. Figure[1](https://arxiv.org/html/2605.29271#S3.F1)and Algorithm[1](https://arxiv.org/html/2605.29271#alg1)summarise the procedure\.

Algorithm 1CoHyDE: Iterative Co\-Training1:Pretrained encoder

θ0\\theta\_\{0\}, base rewriter

ψ0\\psi\_\{0\}, training pairs

𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}, prompt

ρHyDE\\rho\_\{\\mathrm\{HyDE\}\}, rendering family

Φ\\Phi, rounds

RR
2:Co\-trained encoder

θR\+1\\theta\_\{R\+1\}and rewriter

ψR\+1\\psi\_\{R\+1\}
3:\[S1a\]Encoder warmup:train

θ0\\theta\_\{0\}with InfoNCE on

\{\(q,ϕ5​\(t\)\)\}\\\{\(q,\\,\\phi\_\{5\}\(t\)\)\\\}from

𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}to obtain

θ1\\theta\_\{1\}
4:\[S1b\]Rewriter warmup:fine\-tune

ψ0\\psi\_\{0\}on catalog tools under

Φ\\Phito obtain

ψ1\\psi\_\{1\}
5:\[S2\]Bootstrap:generate

𝒟d\(ψ1\)=\{\(gψ1​\(ρHyDE​\(q\)\),ϕ​\(t\)\)\}\\mathcal\{D\}^\{\(\\psi\_\{1\}\)\}\_\{\\mathrm\{d\}\}=\\\{\(g\_\{\\psi\_\{1\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\),\\,\\phi\(t\)\)\\\}for

\(q,t\)∈𝒟train\(q,t\)\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}
6:for

r=1,…,Rr=1,\\ldots,Rdo

7:\[S3r\]Encoder retrain:train

θr\\theta\_\{r\}with InfoNCE on

𝒟d\(ψr\)\\mathcal\{D\}^\{\(\\psi\_\{r\}\)\}\_\{\\mathrm\{d\}\}to obtain

θr\+1\\theta\_\{r\+1\}
8:\[S4r\]Rewriter alignment:

9:Sample

NNdescriptions

\{d~\(j\)\}∼gψr​\(ρHyDE​\(q\)\)\\\{\\tilde\{d\}^\{\(j\)\}\\\}\\sim g\_\{\\psi\_\{r\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)for each

q∈𝒟trainq\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}
10:Score each

d~\(j\)\\tilde\{d\}^\{\(j\)\}by NDCG@5 under

θr\+1\\theta\_\{r\+1\}
11:Form preference pair:

d~q\+=arg⁡maxj⁡NDCG​@​5​\(d~\(j\)\)\\tilde\{d\}^\{\+\}\_\{q\}=\\arg\\max\_\{j\}\\,\\mathrm\{NDCG@5\}\(\\tilde\{d\}^\{\(j\)\}\),

d~q−=arg⁡minj\\tilde\{d\}^\{\-\}\_\{q\}=\\arg\\min\_\{j\}
12:

ψr\+1←arg⁡minψ⁡ℒDPO​\(ψ;ψr\)\\psi\_\{r\+1\}\\leftarrow\\arg\\min\_\{\\psi\}\\,\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}\(\\psi;\\,\\psi\_\{r\}\)
13:

𝒟d\(ψr\+1\)=\{\(gψr\+1​\(ρHyDE​\(q\)\),ϕ​\(t\)\)\}\\mathcal\{D\}^\{\(\\psi\_\{r\+1\}\)\}\_\{\\mathrm\{d\}\}=\\\{\(g\_\{\\psi\_\{r\+1\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\),\\,\\phi\(t\)\)\\\}
14:endfor

15:return

\(θR\+1,ψR\+1\)\(\\theta\_\{R\+1\},\\,\\psi\_\{R\+1\}\)

#### S1a: Encoder warmup\.

The encoder is trained with InfoNCE on \(query, tool\) pairs from𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}:

θ1=arg⁡minθ⁡𝔼\(q,t\)∼𝒟train​ℒNCE​\(θ;\(q,ϕ5​\(t\)\)\)\\theta\_\{1\}=\\arg\\min\_\{\\theta\}\\,\\mathbb\{E\}\_\{\(q,t\)\\sim\\mathcal\{D\}\_\{\\mathrm\{train\}\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}\\bigl\(\\theta;\(q,\\phi\_\{5\}\(t\)\)\\bigr\)\(3\)This is the standard contrastive tool\-retrieval recipe\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2); Ananthaet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib14); Shiet al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib16)\), and is observed to be the strongest encoder\-only baseline \(Table[1](https://arxiv.org/html/2605.29271#S4.T1)\)\. We initialise the loop fromθ1\\theta\_\{1\}rather than pretrained BGE so the encoder has a contrastive head start before description\-only retraining begins\.

#### S1b: Rewriter warmup\.

The rewriter is fine\-tuned on the catalog itself, with each toolttshown under all five renderingsϕ1,…,ϕ5\\phi\_\{1\},\\ldots,\\phi\_\{5\}from the format familyΦ\\Phi\(defined in §[3\.2](https://arxiv.org/html/2605.29271#S3.SS2)\):

ψ1=arg⁡minψ−∑t∈𝒯∑ϕi∈Φlog⁡pψ​\(ϕi​\(t\)\)\\psi\_\{1\}=\\arg\\min\_\{\\psi\}\\,\-\\\!\\\!\\sum\_\{t\\in\\mathcal\{T\}\}\\sum\_\{\\phi\_\{i\}\\in\\Phi\}\\log p\_\{\\psi\}\\bigl\(\\phi\_\{i\}\(t\)\\bigr\)\(4\)This teaches the rewriter the catalog’s vocabulary, naming conventions, and the multiple surface forms a tool can take\.

#### S2: Bootstrap data generation\.

Usingψ1\\psi\_\{1\}and the promptρHyDE\\rho\_\{\\mathrm\{HyDE\}\}, we generate the first round of \(description, tool\) training data:

𝒟d\(ψ1\)=\{\(gψ1​\(ρHyDE​\(q\)\),ϕ​\(t\)\):\(q,t\)∈𝒟train\}\\mathcal\{D\}^\{\(\\psi\_\{1\}\)\}\_\{\\mathrm\{d\}\}=\\bigl\\\{\(g\_\{\\psi\_\{1\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\),\\phi\(t\)\):\(q,t\)\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}\\bigr\\\}\(5\)withϕ∼Unif​\(Φ\)\\phi\\sim\\mathrm\{Unif\}\(\\Phi\)\. The 5\-format\-trained rewriter produces catalog\-style tool descriptions, used as the contrastive anchors for the next encoder training\.

#### S3r: Encoder retraining\.

For each roundr=1,…,Rr=1,\\ldots,R, the encoder is trained further on𝒟d\(ψr\)\\mathcal\{D\}^\{\(\\psi\_\{r\}\)\}\_\{\\mathrm\{d\}\}, continuing fromθr\\theta\_\{r\}:

θr\+1=arg⁡minθ⁡𝔼​ℒNCE​\(θ;\(gψr​\(ρHyDE​\(q\)\),ϕ​\(t\)\)\)\\theta\_\{r\+1\}=\\arg\\min\_\{\\theta\}\\,\\mathbb\{E\}\\,\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}\\bigl\(\\theta;\(g\_\{\\psi\_\{r\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\),\\phi\(t\)\)\\bigr\)\(6\)No real\(q,t\)\(q,t\)pair participates in this stage; the encoder is trained*only*on\(gψr​\(q\),ϕ​\(t\)\)\(g\_\{\\psi\_\{r\}\}\(q\),\\phi\(t\)\)pairs\.

#### S4r: DPO alignment of the rewriter\.

For eachq∈𝒟trainq\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}, sampleNNcandidate descriptions\{d~\(j\)\}∼gψr​\(ρHyDE​\(q\)\)\\\{\\tilde\{d\}^\{\(j\)\}\\\}\\sim g\_\{\\psi\_\{r\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)at and score them by NDCG@5 under the just\-trained encoderθr\+1\\theta\_\{r\+1\}\. Form a preference pair\(d~q\+,d~q−\)\(\\tilde\{d\}^\{\+\}\_\{q\},\\tilde\{d\}^\{\-\}\_\{q\}\)from the argmax and argmin of those scores, and minimise the standard DPO objective\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib62)\):

ℒDPO\(ψ;ψr\)=−𝔼qlogσ\(βlogpψ​\(d~q\+\|ρ​\(q\)\)pψr​\(d~q\+\|ρ​\(q\)\)−βlogpψ​\(d~q−\|ρ​\(q\)\)pψr​\(d~q−\|ρ​\(q\)\)\)\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}\(\\psi;\\psi\_\{r\}\)=\-\\mathbb\{E\}\_\{q\}\\,\\log\\sigma\\Biggl\(\\beta\\log\\frac\{p\_\{\\psi\}\(\\tilde\{d\}^\{\+\}\_\{q\}\|\\rho\(q\)\)\}\{p\_\{\\psi\_\{r\}\}\(\\tilde\{d\}^\{\+\}\_\{q\}\|\\rho\(q\)\)\}\\\\ \-\\beta\\log\\frac\{p\_\{\\psi\}\(\\tilde\{d\}^\{\-\}\_\{q\}\|\\rho\(q\)\)\}\{p\_\{\\psi\_\{r\}\}\(\\tilde\{d\}^\{\-\}\_\{q\}\|\\rho\(q\)\)\}\\Biggr\)\(7\)ψr\+1=arg⁡minψ⁡ℒDPO​\(ψ;ψr\)\\psi\_\{r\+1\}=\\arg\\min\_\{\\psi\}\\,\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}\(\\psi;\\psi\_\{r\}\)is then used to regenerate𝒟d\(ψr\+1\)\\mathcal\{D\}^\{\(\\psi\_\{r\+1\}\)\}\_\{\\mathrm\{d\}\}for the next round\. The encoder of roundrrsupervises the rewriter update, and the rewriter of roundr\+1r\+1produces the data for the next encoder update, both sides evolve along a coupled trajectory\.

#### Iteration\.

The loop\{S3r,S4r\}\\\{\\mathrm\{S3\}\_\{r\},\\mathrm\{S4\}\_\{r\}\\\}may be unrolled for any number of roundsRR\.

### 3\.6Evaluation Protocol

We report hit@kk, recall@kk, and NDCG@kkfork∈\{1,5,10,20\}k\\in\\\{1,5,10,20\\\}, averaged over each query split𝒬∈\{𝒬eval,𝒬vague\}\\mathcal\{Q\}\\in\\\{\\mathcal\{Q\}\_\{\\mathrm\{eval\}\},\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}\\\}and stratified by tier \(G1/G2/G3\)\. Catalog embeddings\{fθ​\(ϕ5​\(t\)\)\}t∈𝒯\\\{f\_\{\\theta\}\(\\phi\_\{5\}\(t\)\)\\\}\_\{t\\in\\mathcal\{T\}\}are precomputed once per encoderθ\\thetaunderϕ5\\phi\_\{5\}and reused across query splits; rewriter outputs are regenerated end\-to\-end for every reported configuration\. Metric definitions appear in Appendix[I](https://arxiv.org/html/2605.29271#A9); the fullkk\-sweep results in Appendix[J](https://arxiv.org/html/2605.29271#A10)\.

## 4Experiments & Results

### 4\.1Experimental Setup

#### Benchmark and evaluation splits\.

All experiments use the ToolBench\-derived catalog and query splits described in §[3\.2](https://arxiv.org/html/2605.29271#S3.SS2): a 10,000\-tool subset𝒯\\mathcal\{T\}with 1,092 evaluation queries stratified across three tiers \(G1/G2/G3\)\. Each query is evaluated on both the standard split𝒬eval\\mathcal\{Q\}\_\{\\mathrm\{eval\}\}— the original ToolBench queries — and the vague split𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}, which contains intent\-preserving paraphrases that replace surface tokens with conversational alternatives \(both splits share the same gold tool sets\)\.

#### Baselines\.

We compare against seven reference points spanning the space of design choices\.BM25over theϕ5\\phi\_\{5\}\-indexed catalog serves as a sparse lexical floor, requiring no training or LLM\.BGE \(vanilla\)andtext\-embedding\-3\-largeare frozen dense encoders that embed raw queries directly\.Query expansion \(LLM \+ BGE\)andHyDE \(vanilla LLM \+ BGE\)both pair the same vanilla BGE encoder with the same vanilla Qwen3\.5\-4B generator, but differ in generation strategy: query expansion paraphrases the user query \(anchor stays on the query side\), while HyDE generates a hypothetical catalog\-style tool description \(anchor moves to the document side\)\.BGE \(trained S1a\)is the BGE encoder fine\-tuned on \(query, tool\) pairs at the S1a warmup step described in §[3\.5](https://arxiv.org/html/2605.29271#S3.SS5)\.HyDE \(vanilla LLM \+ trained BGE S1a\)pairs the trained encoder with HyDE generation without any rewriter training, testing whether the two components can be composed after independent optimisation\. All baselines use theϕ5\\phi\_\{5\}catalog index for a fair comparison; all LLM\-based baselines use Qwen3\.5\-4B\(Yanget al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib61)\)as the generator\.

#### CoHyDE inference\.

At test time, the trained rewriter produces a hypothetical tool descriptiond~=clean​\(gψ​\(ρHyDE​\(q\)\)\)\\tilde\{d\}=\\mathrm\{clean\}\(g\_\{\\psi\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)\)via greedy decoding \(temperature=0, 150\-token budget\)\. The trained encoder takesd~\\tilde\{d\}as its query and retrieves the top\-kktools by nearest\-neighbour lookup against the catalog indexed underϕ5\\phi\_\{5\}\. Full training hyperparameters and infrastructure details are in Appendix[E](https://arxiv.org/html/2605.29271#A5)\.

#### Metrics\.

NDCG@5 is the primary metric; Recall@5 is reported as a secondary check that gains reflect more correct tools being retrieved and not merely reranking an already\-correct candidate set\. Both metrics are reported on𝒬eval\\mathcal\{Q\}\_\{\\mathrm\{eval\}\}and𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}, stratified by tier \(G1 / G2 / G3\), giving six \(metric×\\timessplit×\\timestier\) cells per configuration\.

### 4\.2CoHyDE Comparison with Baselines

†Vanilla LLM paraphrases the user query into a retrieval\-friendly form \(query\-side expansion\); the rewritten query is encoded by vanilla BGE\.

Table 1:NDCG@5 \(N@5\) and Recall@5 \(R@5\) in % on standard and vague query splits, stratified by tier\. Bold = best per column\.Table[1](https://arxiv.org/html/2605.29271#S4.T1)compares CoHyDE against seven reference points; reading the rows top to bottom traces the logical sequence that motivates the co\-training design\.

#### Encoder\-only fine\-tuning is brittle on vague queries\.

The InfoNCE\-trained encoder \(BGE S1a\) dominates every standard evaluation split by a wide margin, lifting G1 NDCG@5 from 56\.5 to 84\.2 over vanilla BGE\. On vague paraphrases of the same queries, however, it collapses: G1 vague falls−39\.5\-39\.5pp from its own performance on standard counterpart, and G3 vague reaches14\.914\.9% — barely above the vanilla baseline\. The strong commercial encoder \(text\-embedding\-3\-large\) follows the same pattern at a lower absolute level: competitive on standard, but no more robust on vague\. The encoder has learned a similarity function calibrated to the surface vocabulary of well\-formed queries; any deviation from that vocabulary exposes its brittleness\.

#### Description generation bridges vocabulary gaps; query rewriting does not\.

Table[1](https://arxiv.org/html/2605.29271#S4.T1)includes both a query expansion baseline and a HyDE baseline, both using the same vanilla BGE encoder and the same vanilla Qwen3\.5\-4B generator\. On standard queries the two are comparable; the decisive difference is on vague cross\-domain queries\. Query rewriting, which keeps the inference\-time anchor on the query side of the embedding space, reaches G3 vague NDCG@5 of only6\.26\.2%—below the vanilla BGE baseline of8\.38\.3%\. HyDE, which generates hypothetical catalog\-style tool descriptions and moves the anchor to the document side, reaches17\.417\.4% on the same split, a\+11\.2\+11\.2pp gap\. The pattern is consistent across all tiers: HyDE outperforms query rewriting on every vague cell, often by double\-digit margins\. This establishes the generative direction that CoHyDE adopts: producing a hypothetical tool description rather than reformulating the query\.

#### Combining HyDE with a query\-trained encoder makes things worse\.

A natural next step is to combine the gains of encoder fine\-tuning with HyDE generation\. Table[1](https://arxiv.org/html/2605.29271#S4.T1)shows that this naive combination*backfires*: “HyDE \(vanilla LLM \+ trained BGE S1a\)” drops−10\.8\-10\.8pp on G1 standard NDCG@5 relative to the trained encoder used alone \(73\.473\.4vs84\.284\.2\), and trails on every other split as well\. The trained encoder’s similarity function was calibrated on raw user queries as anchors; at inference it receives hypothetical catalog descriptions whose embedding distribution is shifted away from that calibration manifold, distorting the nearest\-neighbour search\. This is the direct motivation for co\-training: the encoder and rewriter cannot be composed after independent training\. Instead, they should evolve their representation spaces together\.

#### CoHyDE resolves all three failure modes simultaneously\.

CoHyDE atr=3r\{=\}3improves over the strongest single\-component baseline \(BGE S1a\) on every split\. Standard\-query gains are modest \(average\+2\.5\+2\.5pp\), reflecting that co\-training preserves the encoder’s standard\-query precision rather than trading it away\. Vague\-query gains are substantially larger \(average\+6\.3\+6\.3pp\), closing the lexical brittleness that neither the trained encoder nor baseline HyDE could resolve on its own\. Crucially, the co\-trained encoder also closes the representation\-mismatch gap: trained exclusively on DPO\-generated hypothetical descriptions with zero raw queries in its training data, it reaches G1 standard NDCG@5 of86\.886\.8%, matching and slightly exceeding the BGE encoder trained on raw queries\. The jointly\-trained space has been shaped so that raw query vectors at inference land in the same neighbourhood as their corresponding catalog descriptions, without ever having seen those queries during training\.

### 4\.3Ablations

StandardVagueG1G2G3G1G2G3VariantN@5R@5N@5R@5N@5R@5N@5R@5N@5R@5N@5R@5CoHyDE \(full\)86\.891\.073\.678\.060\.160\.449\.455\.238\.741\.521\.126\.2CoHyDE \(w/o S1b rewriter warmup\)81\.387\.071\.575\.650\.553\.847\.054\.135\.236\.919\.621\.8CoHyDE \(trained LLM \+ vanilla encoder\)63\.268\.738\.140\.336\.237\.040\.145\.217\.619\.412\.915\.3CoHyDE \(vanilla LLM \+ trained encoder\)86\.379\.575\.662\.253\.747\.644\.147\.432\.831\.815\.817\.9

Table 2:Ablation study\. Each row removes or replaces one component of CoHyDE\. Bold = best per column\.![Refer to caption](https://arxiv.org/html/2605.29271v1/x2.png)Figure 2:Per\-round NDCG@5 trajectory on standard \(left\) and vague \(right\) query splits, stratified by tier\. Both splits improve monotonically from S1 through R3\.We isolate four design choices in CoHyDE: \(i\) the rewriter warmup stage S1b, which pre\-trains the LLM on catalog surface forms before the co\-training loop begins; \(ii\) the joint encoder update, asking whether the gains require a co\-trained encoder or can be obtained by pairing the trained rewriter with a vanilla encoder; \(iii\) the symmetric question for the encoder side, asking whether the co\-trained encoder retains its advantage when paired with a vanilla \(untrained\) rewriter; and \(iv\) the number of co\-training roundsrr, which measures convergence behaviour and whether additional rounds continue to improve retrieval quality\.

Table[2](https://arxiv.org/html/2605.29271#S4.T2)reports results for each ablated variant across all six evaluation splits\.

#### Rewriter warmup is critical for cross\-domain retrieval\.

Removing the rewriter warmup drops standard G3 NDCG@5 by9\.69\.6pp \(60\.1→50\.560\.1\\to 50\.5\) and R@5 by6\.66\.6pp, while standard G1 and G2 fall by only5\.55\.5pp and2\.12\.1pp respectively\. Vague\-query degradation is consistently smaller \(≤3\.5\\leq 3\.5pp across all tiers\)\. The gradient of the drop, steepest on standard G3 and shallowest on vague splits, reflects what the warmup actually provides: the rewriter learns the catalog’s vocabulary and surface forms*before*the co\-training loop begins\. On near\-domain standard G1 queries, the encoder can partially compensate for a cold rewriter; on cross\-domain standard G3 tools, whose descriptions share few surface tokens with user queries, a warmup\-free rewriter fails to generate catalog\-aligned descriptions from the outset and the encoder’s nearest\-neighbour search degrades from round one\.

#### The trained rewriter requires a jointly\-trained encoder\.

Pairing the co\-trained rewriter with a vanilla BGE encoder produces the largest degradation in Table[2](https://arxiv.org/html/2605.29271#S4.T2)\. NDCG@5 collapses on standard splits by23\.623\.6pp,35\.535\.5pp, and23\.923\.9pp on G1, G2, and G3 respectively; vague splits decline by99–2121pp\. The vanilla encoder was trained on raw user queries, so its representation space is calibrated to natural\-language vectors rather than to the catalog\-style hypothetical descriptions the DPO\-aligned rewriter generates\. Feeding it rewriter outputs at inference therefore distorts, rather than improves, the similarity search\. This result confirms that the rewriter’s gains are not a free add\-on to any encoder: they require an encoder whose representation space has been co\-shaped to match the rewriter’s output distribution\.

#### The encoder is load\-bearing for standard queries; the rewriter differentiates vague ones\.

The symmetric ablation, co\-trained encoder with a vanilla rewriter, reveals the complementary side\. On easy standard queries, the co\-trained encoder is nearly self\-sufficient: G1 standard NDCG@5 falls by only0\.50\.5pp \(86\.386\.3vs86\.886\.8\), and G2 standard actually edges out the full model by2\.02\.0pp \(75\.675\.6vs73\.673\.6\)\. The co\-trained encoder has absorbed enough of the catalog distribution that zero\-shot HyDE queries land acceptably close in its embedding space without a fine\-tuned rewriter\. The gap opens on harder settings: NDCG@5 on standard G3 falls by6\.46\.4pp \(60\.1→53\.760\.1\\to 53\.7\) and on vague splits by5\.35\.3–5\.95\.9pp uniformly across all tiers\. These are precisely the conditions where the rewriter’s DPO alignment matters—bridging a large lexical gap on cross\-domain tools, or reasoning past underspecification on vague queries\.

Together, ablations \(ii\) and \(iii\) confirm the asymmetry established in §[4\.2](https://arxiv.org/html/2605.29271#S4.SS2): the encoder carries precision on near\-vocabulary standard queries; the rewriter provides robustness on hard and vague ones; co\-training is what enables both gains simultaneously\.

#### Co\-training performance evolution across rounds\.

Figure[2](https://arxiv.org/html/2605.29271#S4.F2)traces NDCG@5 at each stage for all six evaluation splits\. Performance is monotonically non\-decreasing from S1 through R3 on five of six splits; the single exception is standard G2, which retreats by a marginal0\.60\.6pp between R2 and R3\. Gains from R1 to R2 are consistently larger than those from R2 to R3 across all tiers and both evaluation split types, indicating the coupled encoder–rewriter system approaches convergence within three rounds\. The diminishing updates and the single non\-monotonic cell motivate our choice to report R3 as the final CoHyDE configuration\.

### 4\.4Comparison with Closest Prior Methods

CoHyDE is most directly related to two lines of work that also use iterative feedback to improve tool or document retrieval\.Shaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)propose an iterative loop in which the LLM’s*downstream tool\-usage success*is fed back to retrain the retriever; the retriever evolves across rounds but the query representation at inference is the raw user query and no rewriter component is trained\. RaFe\(Maoet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib37)\)trains a query rewriter with RL feedback from an external reranker in a general RAG setting; critically, the rewriter*paraphrases the user query*into a more retrieval\-friendly form — it does not generate catalog\-style hypothetical descriptions — so the inference\-time anchor remains on the query side of the embedding space, and the encoder remains frozen throughout\. Both methods train only one side of the retrieval pipeline and use a signal external to the encoder\-rewriter pair rather than closing the loop directly through the retrieval objective\.

![Refer to caption](https://arxiv.org/html/2605.29271v1/x3.png)Figure 3:NDCG@5 comparison with the two closest prior methods across all six evaluation splits\. Error bars show 95% confidence intervals\.Figure[3](https://arxiv.org/html/2605.29271#S4.F3)reports NDCG@5 for all three methods across standard and vague splits\. On standard queries CoHyDE leads on all three tiers by a wide margin overShaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\):\+17\.3\+17\.3pp on G1,\+18\.5\+18\.5pp on G2, and\+14\.3\+14\.3pp on G3\. RaFe\(Maoet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib37)\)is a stronger standard\-query competitor thanShaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\), closing much of the gap, but still trails CoHyDE by6\.96\.9pp on G1,6\.46\.4pp on G2, and6\.46\.4pp on G3\. The vague splits separate the methods more sharply\.Shaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)holds up on G1 and G2 vague \(within≤2\.1\\leq 2\.1pp of CoHyDE\), but falls behind on G3\. RaFe\(Maoet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib37)\)degrades most severely on G3 vague, dropping to13\.613\.6NDCG@5 against CoHyDE’s21\.121\.1— a7\.57\.5pp gap on the hardest cross\-domain vague split, compared to RaFe’s6\.46\.4pp deficit on the corresponding standard split\. Confidence intervals for all reported differences follow the paired\-bootstrap protocol \(Appendix[K](https://arxiv.org/html/2605.29271#A11)\); re\-implementation details forShaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)are in Appendix[M](https://arxiv.org/html/2605.29271#A13)\.

The pattern is consistent with the failure\-mode framing of §[4\.2](https://arxiv.org/html/2605.29271#S4.SS2): an encoder trained on raw queries with a frozen rewriter \(i\.e\.Shaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)\) and a rewriter trained against an external reranker with a frozen encoder \(i\.e\.Maoet al\.\([2024](https://arxiv.org/html/2605.29271#bib.bib37)\)\) both fail to bridge the lexical gap that vague cross\-domain queries expose\. CoHyDE’s joint co\-training, where the encoder’s retrieval metric directly supervises the rewriter and the rewriter’s outputs shape the encoder’s representation space, sustains the advantage across both query distributions\.

## 5Conclusion

Contrastive encoder fine\-tuning and HyDE\-style description generation fail in complementary directions, and their naive composition makes things worse because their representation spaces have been calibrated to different input distributions\. We introduce CoHyDE, an iterative co\-training loop that resolves this by evolving the encoder and rewriter together: the encoder’s NDCG@5 scores supervise the rewriter via DPO, and the rewriter’s catalog\-aligned outputs become the encoder’s training anchors each round\. Three rounds improve over the strongest single\-component baseline on every evaluation cell, with average gains of\+2\.5\+2\.5pp NDCG@5 on standard queries and\+6\.3\+6\.3pp on vague ones\. The asymmetric improvement is the direct consequence of the mechanism: the jointly\-trained encoder learns a space where raw query vectors land near their corresponding catalog descriptions at inference — without ever seeing those queries during training — suggesting that for retrieval over idiosyncratic catalogs with underspecified queries, the encoder and rewriter are better treated as a single co\-evolving system\.

## Limitations

All reported numbers are from a single training seed; the bootstrap confidence intervals in Appendix[P](https://arxiv.org/html/2605.29271#A16)characterise evaluation\-set variance but not training\-side variance, and multi\-seed retrains were not run due to the per\-round compute cost of each co\-training loop \(Appendix[N](https://arxiv.org/html/2605.29271#A14)\)\. Experiments are conducted on a 10K\-tool English subset of ToolBench, which is skewed toward consumer\-facing RapidAPI REST endpoints; it remains to be seen whether the co\-training gains transfer to enterprise catalogs, non\-English queries, or function\-call schemas that lack free\-text descriptions\. The vague\-query split𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}is generated and validated by the LLMs used throughout the pipeline; though spot\-checked by a human on 50 paraphrases \(Appendix[A](https://arxiv.org/html/2605.29271#A1)\), systematic biases shared between the generator and judge may go undetected\. Finally, we benchmark against single\-vector dense retrievers and BM25 but not against cross\-encoder rerankers or sparse–dense hybrids; a comparison with such methods would require matched latency or FLOPs budgets, which we leave to future work\.

## Ethical Considerations

We conducted experiments within the provisions of the ACL Ethics Policy and relevant research\-integrity guidelines\. There are, to the best of our knowledge, no remaining ethical risks that have not been addressed\.

## References

- R\. Anantha, B\. Bandyopadhyay, A\. Kashi, S\. Mahinder, A\. W\. Hill, and S\. Chappidi \(2023\)ProTIP: progressive tool retrieval improves planning\.External Links:2312\.10332,[Link](https://arxiv.org/abs/2312.10332)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.29271#S3.SS1.p2.2),[§3\.5](https://arxiv.org/html/2605.29271#S3.SS5.SSS0.Px1.p1.2)\.
- Anonymous \(2026\)ToolSense: A diagnostic framework for auditing parametric tool knowledge in LLMs\.Note:Under reviewCited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hSyW5go0v8)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Bonifacio, H\. Abonizio, M\. Fadaee, and R\. Nogueira \(2022\)InPars: data augmentation for information retrieval using large language models\.External Links:2202\.05144,[Link](https://arxiv.org/abs/2202.05144)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Chen, L\. Zhuang, F\. Liao, J\. Liu, J\. Wang, and B\. Du \(2026\)Tool retrieval bridge: aligning vague instructions with retriever preferences via bridge model\.External Links:2604\.07816,[Link](https://arxiv.org/abs/2604.07816)Cited by:[item 1](https://arxiv.org/html/2605.29271#A1.I1.i1.p1.1),[Appendix A](https://arxiv.org/html/2605.29271#A1.p1.1),[§3\.2](https://arxiv.org/html/2605.29271#S3.SS2.SSS0.Px4.p1.2)\.
- X\. Chen, K\. Lakhotia, B\. Oguz, A\. Gupta, P\. Lewis, S\. Peshterliev, Y\. Mehdad, S\. Gupta, and W\. Yih \(2022\)Salient phrase aware dense retrieval: can a dense retriever imitate a sparse one?\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 250–262\.External Links:[Link](https://aclanthology.org/2022.findings-emnlp.19/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.19)Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p3.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Chen, J\. Yoon, D\. S\. Sachan, Q\. Wang, V\. Cohen\-Addad, M\. Bateni, C\. Lee, and T\. Pfister \(2024\)Re\-invoke: tool invocation rewriting for zero\-shot tool retrieval\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4705–4726\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.270/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.270)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Dai, V\. Y\. Zhao, J\. Ma, Y\. Luan, J\. Ni, J\. Lu, A\. Bakalov, K\. Guu, K\. Hall, and M\. Chang \(2023\)Promptagator: few\-shot dense retrieval from 8 examples\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gmL46YMpu2J)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Gao, X\. Ma, J\. Lin, and J\. Callan \(2023\)Precise zero\-shot dense retrieval without relevance labels\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1762–1777\.External Links:[Link](https://aclanthology.org/2023.acl-long.99/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.99)Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p2.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Hsu, O\. Khattab, C\. Finn, and A\. Sharma \(2025\)Grounding by trying: LLMs with reinforcement learning\-enhanced retrieval\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=BPAZ6yW3K7)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Izacard, P\. Lewis, M\. Lomeli, L\. Hosseini, F\. Petroni, T\. Schick, J\. Dwivedi\-Yu, A\. Joulin, S\. Riedel, and E\. Grave \(2023\)Atlas: few\-shot learning with retrieval augmented language models\.J\. Mach\. Learn\. Res\.24\(1\)\.External Links:ISSN 1532\-4435Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Johnson, M\. Douze, and H\. Jegou \(2021\)Billion\-Scale Similarity Search with GPUs\.IEEE Transactions on Big Data7\(03\),pp\. 535–547\.External Links:ISSN 2332\-7790,[Document](https://dx.doi.org/10.1109/TBDATA.2019.2921572),[Link](https://doi.ieeecomputersociety.org/10.1109/TBDATA.2019.2921572)Cited by:[§3\.1](https://arxiv.org/html/2605.29271#S3.SS1.p2.2)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 6769–6781\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.550/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.550)Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p2.1)\.
- Y\. Lei, Y\. Cao, T\. Zhou, T\. Shen, and A\. Yates \(2024\)Corpus\-steered query expansion with large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 2: Short Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 393–401\.External Links:[Link](https://aclanthology.org/2024.eacl-short.34/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-short.34)Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p3.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA\.External Links:ISBN 9781713829546Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Lin, A\. Asai, M\. Li, B\. Oguz, J\. Lin, Y\. Mehdad, W\. Yih, and X\. Chen \(2023\)How to train your dragon: diverse augmentation towards generalizable dense retrieval\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 6385–6400\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.423/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.423)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- X\. V\. Lin, X\. Chen, M\. Chen, W\. Shi, M\. Lomeli, R\. James, P\. Rodriguez, J\. Kahn, G\. Szilvasy, M\. Lewis, L\. Zettlemoyer, and W\. Yih \(2024\)RA\-DIT: retrieval\-augmented dual instruction tuning\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=22OTbutug9)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- E\. Lumer, V\. Subbiah, J\. Burke, P\. Basavaraju, and A\. Huber \(2025\)Toolshed: scale tool\-equipped agents with advanced rag\-tool fusion and tool knowledge bases\.InProceedings of the 17th International Conference on Agents and Artificial Intelligence \- Volume 3: ICAART,pp\. 1180–1191\.External Links:[Document](https://dx.doi.org/10.5220/0013303000003890),ISBN 978\-989\-758\-737\-5Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Ma, Y\. Gong, P\. He, H\. Zhao, and N\. Duan \(2023\)Query rewriting in retrieval\-augmented large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 5303–5315\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.322/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.322)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Mao, Y\. Jiang, B\. Chen, X\. Li, P\. Wang, X\. Wang, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang \(2024\)RaFe: ranking feedback improves query rewriting for RAG\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 884–901\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.49/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.49)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1),[§4\.4](https://arxiv.org/html/2605.29271#S4.SS4.p1.1),[§4\.4](https://arxiv.org/html/2605.29271#S4.SS4.p2.11),[§4\.4](https://arxiv.org/html/2605.29271#S4.SS4.p3.1)\.
- R\. Meng, Y\. Liu, S\. Yavuz, D\. Agarwal, L\. Tu, N\. Yu, J\. Zhang, M\. Bhat, and Y\. Zhou \(2024\)AugTriever: unsupervised dense retrieval and domain adaptation by scalable data augmentation\.External Links:2212\.08841,[Link](https://arxiv.org/abs/2212.08841)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Nogueira, W\. Yang, J\. Lin, and K\. Cho \(2019\)Document expansion by query prediction\.External Links:1904\.08375,[Link](https://arxiv.org/abs/1904.08375)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive APIs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=tBRNC6YemY)Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p1.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, dahai li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dHng2O0Jjr)Cited by:[1st item](https://arxiv.org/html/2605.29271#A17.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.29271#S1.p1.1),[§1](https://arxiv.org/html/2605.29271#S1.p5.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.29271#S3.SS1.p2.2),[§3\.2](https://arxiv.org/html/2605.29271#S3.SS2.SSS0.Px1.p1.4),[§3\.5](https://arxiv.org/html/2605.29271#S3.SS5.SSS0.Px1.p1.2)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2024\)Towards completeness\-oriented tool retrieval for large language models\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,CIKM ’24,New York, NY, USA,pp\. 1930–1940\.External Links:ISBN 9798400704369,[Link](https://doi.org/10.1145/3627673.3679847),[Document](https://dx.doi.org/10.1145/3627673.3679847)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.29271#S3.SS1.p2.2)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=HPuSIXJaa9)Cited by:[Appendix G](https://arxiv.org/html/2605.29271#A7.SS0.SSS0.Px4.p1.13),[§3\.5](https://arxiv.org/html/2605.29271#S3.SS5.SSS0.Px5.p1.5)\.
- C\. Sciavolino, Z\. Zhong, J\. Lee, and D\. Chen \(2021\)Simple entity\-centric questions challenge dense retrievers\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 6138–6148\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.496/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.496)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Shao, Y\. Gong, Y\. Shen, M\. Huang, N\. Duan, and W\. Chen \(2023\)Enhancing retrieval\-augmented large language models with iterative retrieval\-generation synergy\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9248–9274\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.620/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.620)Cited by:[Appendix M](https://arxiv.org/html/2605.29271#A13.SS0.SSS0.Px1.p1.1),[Appendix M](https://arxiv.org/html/2605.29271#A13.SS0.SSS0.Px2.p1.3),[Appendix M](https://arxiv.org/html/2605.29271#A13.p1.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1),[§4\.4](https://arxiv.org/html/2605.29271#S4.SS4.p1.1),[§4\.4](https://arxiv.org/html/2605.29271#S4.SS4.p2.11),[§4\.4](https://arxiv.org/html/2605.29271#S4.SS4.p3.1)\.
- W\. Shi, S\. Min, M\. Yasunaga, M\. Seo, R\. James, M\. Lewis, L\. Zettlemoyer, and W\. Yih \(2024\)REPLUG: retrieval\-augmented black\-box language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 8371–8384\.External Links:[Link](https://aclanthology.org/2024.naacl-long.463/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.463)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Shi, Y\. Wang, L\. Yan, P\. Ren, S\. Wang, D\. Yin, and Z\. Ren \(2025\)Retrieval models aren’t tool\-savvy: benchmarking tool retrieval for large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 24497–24524\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1258/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1258),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.29271#S3.SS1.p2.2),[§3\.5](https://arxiv.org/html/2605.29271#S3.SS5.SSS0.Px1.p1.2)\.
- N\. Thakur, N\. Reimers, A\. Rücklé, A\. Srivastava, and I\. Gurevych \(2021\)BEIR: a heterogeneous benchmark for zero\-shot evaluation of information retrieval models\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\),External Links:[Link](https://openreview.net/forum?id=wCu6T5xFjeJ)Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p3.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2019\)Representation learning with contrastive predictive coding\.External Links:1807\.03748,[Link](https://arxiv.org/abs/1807.03748)Cited by:[Appendix F](https://arxiv.org/html/2605.29271#A6.SS0.SSS0.Px1.p1.2),[§3\.3](https://arxiv.org/html/2605.29271#S3.SS3.p1.6)\.
- K\. Wang, N\. Thakur, N\. Reimers, and I\. Gurevych \(2022\)GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 2345–2360\.External Links:[Link](https://aclanthology.org/2022.naacl-main.168/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.168)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Wang, N\. Yang, and F\. Wei \(2023\)Query2doc: query expansion with large language models\.InThe 2023 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://openreview.net/forum?id=QH4EMvwF8I)Cited by:[Appendix L](https://arxiv.org/html/2605.29271#A12.SS0.SSS0.Px5.p1.3),[§1](https://arxiv.org/html/2605.29271#S1.p2.1),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Wang, X\. Han, L\. Ji, S\. Wang, T\. Baldwin, and H\. Li \(2025\)ToolGen: unified tool retrieval and calling via generation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XLMAMmowdY)Cited by:[2nd item](https://arxiv.org/html/2605.29271#A17.I1.i2.p1.2),[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2605.29271#S1.p3.1)\.
- S\. Xiao, Z\. Liu, P\. Zhang, N\. Muennighoff, D\. Lian, and J\. Nie \(2024\)C\-pack: packed resources for general chinese embeddings\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’24,New York, NY, USA,pp\. 641–649\.External Links:ISBN 9798400704314,[Link](https://doi.org/10.1145/3626772.3657878),[Document](https://dx.doi.org/10.1145/3626772.3657878)Cited by:[3rd item](https://arxiv.org/html/2605.29271#A17.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.29271#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.29271#S3.SS3.p1.6)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[4th item](https://arxiv.org/html/2605.29271#A17.I1.i4.p1.1),[§3\.4](https://arxiv.org/html/2605.29271#S3.SS4.p1.5),[§4\.1](https://arxiv.org/html/2605.29271#S4.SS1.SSS0.Px2.p1.2)\.
- Y\. Yu, C\. Xiong, S\. Sun, C\. Zhang, and A\. Overwijk \(2022\)COCO\-DR: combating the distribution shift in zero\-shot dense retrieval with contrastive and distributionally robust learning\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 1462–1479\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.95/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.95)Cited by:[§2](https://arxiv.org/html/2605.29271#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix

## Appendix AVague\-Query Construction and Validation

The vague\-query split𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}is a held\-out paraphrase of the official 1,092\-query evaluation set, constructed to probe robustness under query\-side distribution shift while preserving the gold tool set\. Construction follows the protocol ofChenet al\.\([2026](https://arxiv.org/html/2605.29271#bib.bib63)\)verbatim, with claude\-4\.5\-opus substituted for their GPT\-4o paraphraser\.

#### Two\-pass validation\.

The split is validated in two passes\.

1. 1\.LLM self\-check\.Every paraphrase in𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}is re\-presented to claude\-4\.5\-opus in a separate session, together with the original query and the gold tool set, and scored on the three binary criteria ofChenet al\.\([2026](https://arxiv.org/html/2605.29271#bib.bib63)\)— \(i\) intent preservation, \(ii\) absence of leaked tool names / API verbs / domain keywords, \(iii\) plausibility as an end\-user utterance\. An example is retained only if all three criteria are satisfied\. Substituting claude\-4\.5\-opus for the GPT\-4o validator used byChenet al\.\([2026](https://arxiv.org/html/2605.29271#bib.bib63)\)is the only deviation from their protocol\.
2. 2\.Human spot\-check\.50 paraphrases were sampled uniformly at random from the LLM\-validated split and re\-verified by human against the same three criteria\. All 50 passed all three criteria, giving a 6% rule\-of\-three upper bound on the true failure rate at 95% confidence\. The annotator was not blinded to the paraphraser identity; this is a transparency disclosure rather than a methodological strength\. The annotator was not compensated separately, apart from their regular wages during the research; no external annotators were used\.

#### Ethics review\.

The annotation involves no human subjects beyond the human conducting the spot\-check and qualifies as exempt from formal ethics\-board review under the relevant institutional guidelines\. No consent procedure was required because the annotator is the data producer\.

#### Annotator demographics\.

The annotator is a full\-time employee of the authoring organization and resides in the USA\.

## Appendix BCleaning Operator

The deterministic cleaning operatorclean​\(⋅\)\\mathrm\{clean\}\(\\cdot\)applied to every rewriter output before encoding strips:

1. 1\.Reasoning\-trace blocks delimited by<think\>\.\.\.</think\>\.
2. 2\.Unclosed reasoning traces \(a leading<think\>with no terminator\), in which case the entire output is rejected and replaced with the original query\.
3. 3\.Conversational preambles matching^\(Sure\|Okay\|Of course\|Here is\|Here’s\)\[^\.\]\*\\\\backslash\.\\\\backslashs\+\.
4. 4\.Trailing whitespace and repeated blank lines\.

The operator is implemented as a sequence of regular\-expression substitutions and is applied identically at SFT\-target construction, DPO\-candidate scoring, and inference time\.

## Appendix CHyDE\-Style Rewriter Prompt

The HyDE\-style promptρHyDE\\rho\_\{\\mathrm\{HyDE\}\}is used at the optional SFT stage \(when included; see Appendix[H](https://arxiv.org/html/2605.29271#A8)\), at S2 \(generating𝒟d\(ψr\)\\mathcal\{D\}\_\{\\mathrm\{d\}\}^\{\(\\psi\_\{r\}\)\}\), at S4 \(sampling DPO candidates\), and at every inference\-time HyDE evaluation reported in §[4](https://arxiv.org/html/2605.29271#S4)\.

#### System message:

> You are an expert at understanding API tool pipelines\. When given a user query, you describe the sequence of API calls needed to fulfill it\. Each description should focus on what the tool does, what inputs it takes, and what data it returns\. Write each tool’s description as a single concise technical sentence\.

#### User message:

> User query: \{query\} Think about the full pipeline of API calls needed to answer the query\. Describe each API tool in the pipeline in order, explaining what data it provides and how it feeds into the next step\. Be concise and technical\.

## Appendix DQuery\-Rewriting Prompt

The query\-rewriting promptρrewrite\\rho\_\{\\mathrm\{rewrite\}\}is used*only*in the prompt\-style ablation reported in Appendix[L](https://arxiv.org/html/2605.29271#A12); it is not used anywhere for CoHyDE\.

#### System message:

> You are a query enhancement expert\. Given a user query and the relevant API tools, rewrite the query to be more specific and detailed\. Include relevant tool names, API capabilities, and technical terms that would help a retrieval system find the right tools\. Keep it as a natural user request, but a more specific version of what the user is asking for\.

#### User message:

> Original query: \{query\} Relevant tools: \{tool\_names\} Rewritten query:

## Appendix EPer\-Stage Hyperparameter Summary

Table[3](https://arxiv.org/html/2605.29271#A5.T3)consolidates every training and inference stage in the main pipeline with its load\-bearing hyperparameters\. Full per\-stage detail \(objective, optimiser, schedules, ablation context\) is in Appendix[F](https://arxiv.org/html/2605.29271#A6)\(encoder; S1a, S3r\) and Appendix[G](https://arxiv.org/html/2605.29271#A7)\(rewriter; S1b, S2, S4r\)\. The optional HyDE\-style SFT bridging stage — which is*not*part of the main pipeline — is documented separately in Appendix[H](https://arxiv.org/html/2605.29271#A8)\. Software versions for every stage are in Appendix[O](https://arxiv.org/html/2605.29271#A15)\.

Table 3:Per\-stage hyperparameters for the main pipeline\. “LR” is the optimiser learning rate \(AdamW, weight decay10−210^\{\-2\}, bf16 throughout\); for S4 “β\\beta” is the DPO regularisation coefficient\. “Effective BS” is per\-device batch size×\\timesgradient accumulation\. All training runs on a single H200 GPU; full per\-stage detail in Appendices[F](https://arxiv.org/html/2605.29271#A6),[G](https://arxiv.org/html/2605.29271#A7)\.
## Appendix FEncoder Training Hyperparameters

The encoder is trained at two distinct points in the pipeline: once at S1a \(warmup on real query–tool pairs\) and once per round at S3r\(retrain on rewriter\-generated descriptions\)\. Both stages use the same InfoNCE objective; they differ only in the anchor source and in whether they continue from the previous checkpoint or restart fromθ0\\theta\_\{0\}\.

#### InfoNCE loss\.

Letℬ=\{\(ai,pi\)\}i=1B\\mathcal\{B\}=\\\{\(a\_\{i\},p\_\{i\}\)\\\}\_\{i=1\}^\{B\}be a mini\-batch of \(anchor, positive\) pairs, and writeSi​jθ=⟨fθ​\(ai\),fθ​\(pj\)⟩/τS^\{\\theta\}\_\{ij\}=\\langle f\_\{\\theta\}\(a\_\{i\}\),f\_\{\\theta\}\(p\_\{j\}\)\\rangle/\\tau\. The symmetric InfoNCE loss\(van den Oordet al\.,[2019](https://arxiv.org/html/2605.29271#bib.bib60)\)is

ℒNCE\(θ;ℬ\)=−12​B∑i=1B\[logexp⁡Si​iθ∑j=1Bexp⁡Si​jθ\+logexp⁡Si​iθ∑j=1Bexp⁡Sj​iθ\],\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}\(\\theta;\\mathcal\{B\}\)=\-\\frac\{1\}\{2B\}\\sum\_\{i=1\}^\{B\}\\\!\\Biggl\[\\log\\frac\{\\exp S^\{\\theta\}\_\{ii\}\}\{\\sum\_\{j=1\}^\{B\}\\exp S^\{\\theta\}\_\{ij\}\}\\\\ \+\\log\\frac\{\\exp S^\{\\theta\}\_\{ii\}\}\{\\sum\_\{j=1\}^\{B\}\\exp S^\{\\theta\}\_\{ji\}\}\\Biggr\],\(8\)with temperatureτ=0\.05\\tau=0\.05\. Negatives are in\-batch \(no hard\-negative mining\)\.

#### S1a: Encoder warmup\.

Anchors are real queriesqq, positives areϕ5​\(t\)\\phi\_\{5\}\(t\)for the gold tool\. Initialised from BGE\-large\-en\-v1\.5\. AdamW with learning rateηθ=2×10−5\\eta\_\{\\theta\}=2\\times 10^\{\-5\}, weight decay10−210^\{\-2\}, cosine schedule with 5% warmup, batch sizeB=256B=256, max sequence length 256 tokens, 5 epochs over the 104,224 \(G1\+G2\+G3\) training pairs of𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}\. Validation NDCG@5 \(mean over G1/G2/G3 dev splits\) is computed every 200 steps and the checkpoint maximising it is retained — this is step 3600 in our run\. All training is on a single H200 GPU with native bf16 mixed precision; no gradient accumulation\. CLS\-token pooling, L2\-normalised before scoring\. No dropout beyond BGE’s defaults\.

#### S3r: Per\-round encoder retrain\.

Anchors are the rewriter outputsd~=gψr​\(ρHyDE​\(q\)\)\\tilde\{d\}=g\_\{\\psi\_\{r\}\}\(\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)from the regenerated bootstrap set𝒟d\(ψr\)\\mathcal\{D\}^\{\(\\psi\_\{r\}\)\}\_\{\\mathrm\{d\}\}; positives areϕ​\(t\)\\phi\(t\)forϕ∼Unif​\(Φ\)\\phi\\sim\\mathrm\{Unif\}\(\\Phi\)\. The encoder is initialised fromθr\\theta\_\{r\}\(i\.e\. continued from the previous round’s checkpoint, not fromθ0\\theta\_\{0\}\)\. All other hyperparameters — optimiser, learning rate2×10−52\\times 10^\{\-5\}, weight decay, cosine schedule with 5% warmup, batch size 256, max sequence length 256, bf16, validation cadence, single\-GPU — are identical to S1a\. The retrain runs for the same 5\-epoch budget over𝒟d\(ψr\)\\mathcal\{D\}^\{\(\\psi\_\{r\}\)\}\_\{\\mathrm\{d\}\}, with the best validation\-NDCG@5 checkpoint retained \(around step 3400–4000 across rounds\)\.*No real\(q,t\)\(q,t\)pair is used at S3r;*the encoder is trained purely on \(rewritten\-description, tool\) pairs and tested on real queries at inference time\. An ablation that mixes realqq\-anchored andd~\\tilde\{d\}\-anchored pairs in the same retrain is reported in Appendix[L](https://arxiv.org/html/2605.29271#A12)\(combined\-pair encoder retrain\)\.

## Appendix GRewriter Training and Inference Hyperparameters

The rewriter is trained at S4r\(DPO alignment, run once per round\), and is sampled from at S2 \(bootstrap data generation\), at S4r\(DPO candidate sampling\), and at inference time\. Each of these uses different decoding settings, listed below\.

#### S1b: 5\-format tool\-rendering SFT\.

The rewriterψ0=\\psi\_\{0\}=Qwen3\.5\-4B is fine\-tuned on the catalog𝒯\\mathcal\{T\}rendered under all five formatsϕ1,…,ϕ5\\phi\_\{1\},\\ldots,\\phi\_\{5\}\(defined in §[3\.2](https://arxiv.org/html/2605.29271#S3.SS2)\)\. Each tool is presented as a next\-token prediction target under each of the five rendering conventions, sampled with equal weight per mini\-batch\. LoRA with rankr=16r=16,αLoRA=32\\alpha\_\{\\mathrm\{LoRA\}\}=32, dropout 0\.05, applied to attentionq,k,v,oq,k,v,oprojections\. AdamW with learning rateηψS1b=2×10−5\\eta\_\{\\psi\}^\{\\mathrm\{S1b\}\}=2\\times 10^\{\-5\}, linear schedule with 3% warmup, per\-device batch size 2, gradient accumulation 32 \(effective batch size 64\), max sequence length 1024, 8 epochs over the mixture \(≈\\approx50K examples per epoch\), bf16 mixed precision, gradient checkpointing on \(non\-reentrant\), single H200 GPU\. Validation hit@5 on the G1/G2/G3 retrieval dev splits is computed every 100 steps and the best checkpoint is retained\.

#### S2: Bootstrap description generation\.

Usingψ1\\psi\_\{1\}\(the S1b checkpoint\), we generate the first round of \(description, tool\) training data𝒟d\(ψ1\)\\mathcal\{D\}^\{\(\\psi\_\{1\}\)\}\_\{\\mathrm\{d\}\}over all queriesq∈𝒟trainq\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}\. Sampling is greedy \(T=0T=0, top\-p=1p=1, top\-k=1k=1, no repetition penalty\) with a 150\-token completion budget, served via vLLM\. We use a single completion per query\. Outputs pass throughclean​\(⋅\)\\mathrm\{clean\}\(\\cdot\)\(Appendix[B](https://arxiv.org/html/2605.29271#A2)\) before being used as encoder anchors\. The same generation protocol is re\-run at the start of every subsequent roundrrto produce𝒟d\(ψr\)\\mathcal\{D\}^\{\(\\psi\_\{r\}\)\}\_\{\\mathrm\{d\}\}from the current rewriterψr\\psi\_\{r\}\.

#### S4r: DPO candidate sampling\.

For each queryq∈𝒟trainq\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}we sampleN=4N=4candidate descriptions fromψr\\psi\_\{r\}at temperatureT=0\.7T=0\.7, top\-p=0\.95p=0\.95, top\-k=50k=50, with a 300\-token completion budget, served via vLLM\. Each candidated~\(j\)\\tilde\{d\}^\{\(j\)\}is encoded by the freshly\-retrained encoderθr\+1\\theta\_\{r\+1\}\(from S3r\); candidates are scored by their NDCG@5 against the gold tool setTq∗T^\{\*\}\_\{q\}underθr\+1\\theta\_\{r\+1\}\. The chosen / rejected pair\(d~q\+,d~q−\)\(\\tilde\{d\}^\{\+\}\_\{q\},\\tilde\{d\}^\{\-\}\_\{q\}\)is the \(argmax, argmin\) of the four scores\. Queries whose four candidates yield identical NDCG@5 are dropped from the DPO set\.

#### S4r: DPO training\.

We use TRL’sDPOTrainerwith the sigmoid loss formulation\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib62)\):

ℒDPO​\(ψ;ψr\)=−log⁡σ​\(β​\[Δψ​\(d~\+,q\)−Δψ​\(d~−,q\)\]\),\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}\(\\psi;\\psi\_\{r\}\)=\-\\log\\sigma\\\!\\Bigl\(\\beta\\bigl\[\\Delta\_\{\\psi\}\(\\tilde\{d\}^\{\+\},q\)\-\\Delta\_\{\\psi\}\(\\tilde\{d\}^\{\-\},q\)\\bigr\]\\Bigr\),withΔψ​\(d~,q\)=log⁡pψ​\(d~\|ρHyDE​\(q\)\)pψr​\(d~\|ρHyDE​\(q\)\)\\Delta\_\{\\psi\}\(\\tilde\{d\},q\)=\\log\\frac\{p\_\{\\psi\}\(\\tilde\{d\}\|\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)\}\{p\_\{\\psi\_\{r\}\}\(\\tilde\{d\}\|\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\)\}andβ=0\.1\\beta=0\.1\. LoRA with rankr′=64r^\{\\prime\}=64,αLoRA=128\\alpha\_\{\\mathrm\{LoRA\}\}=128, dropout 0\.05, applied to attentionq,k,v,oq,k,v,oprojections only; embeddings and the language\-model head are*not*tuned at S4 \(the new tool tokens are already learned at S1b and held fixed thereafter\)\. AdamW with learning rateηψS4=5×10−6\\eta\_\{\\psi\}^\{\\mathrm\{S4\}\}=5\\times 10^\{\-6\}, cosine schedule with 3% warmup, per\-device batch size 2, gradient accumulation 4 \(effective batch size 8\), max prompt length 1024 tokens, max completion length 300 tokens, 1 epoch over the DPO pair set \(≈\\approx4,371 optimiser steps\), bf16 mixed precision, gradient checkpointing on, single H200 GPU\. The reference policyψr\\psi\_\{r\}is the previous round’s rewriter; atr=1r=1this isψ1\\psi\_\{1\}from S1b\. The trained adapter is merged back into the base weights at the end of the round beforeψr\+1\\psi\_\{r\+1\}is used at S2 of roundr\+1r\+1\.

#### Inference\-time decoding\.

At evaluation time the rewriter is sampled greedily \(T=0T=0\) with a 150\-token completion budget, single completion per query, served via vLLM\. The deterministic cleaning operatorclean​\(⋅\)\\mathrm\{clean\}\(\\cdot\)\(Appendix[B](https://arxiv.org/html/2605.29271#A2)\) is applied before the description is passed to the encoder\. The same decoding protocol is used for both standard and vague evaluation passes\.

#### Reference policy and adapter merging\.

At each roundrr, the DPO referenceψr\\psi\_\{r\}is loaded from the merged checkpoint of roundr−1r\{\-\}1\(or fromψ1\\psi\_\{1\}atr=1r=1\)\. After DPO training, the LoRA adapter is merged into the base weights to produceψr\+1\\psi\_\{r\+1\}, which serves both as the next round’s S2 generator and as the next round’s DPO reference\.

#### Implementation\.

PyTorch with HuggingFace Transformers; encoder training uses an in\-house InfoNCE script; rewriter SFT and DPO use TRL’sSFTTrainerandDPOTrainerwith PEFT for LoRA adapters\. vLLM serves the rewriter at S2, S4 candidate sampling, and inference\. Exact software versions are listed in Appendix[O](https://arxiv.org/html/2605.29271#A15)\.

## Appendix HOptional SFT Stage \(HyDE\-Style Bridging\)

This appendix describes an*optional*HyDE\-style SFT pass that we ran in early experiments but*do not*use in the main pipeline reported in §[4](https://arxiv.org/html/2605.29271#S4)\. It is a separate stage from the 5\-format tool\-rendering SFT \(S1b\) used in the CoHyDE pipeline\.

In early experiments we inserted a brief LoRA\-SFT pass between S1 and S20to align the rewriter’s output style with the catalog rendering\. The motivation was that the base Qwen3\.5\-4B rewriter underρHyDE\\rho\_\{\\mathrm\{HyDE\}\}produced free\-form text whose length and style differed visibly from the 5\-format catalog rendering — in particular, outputs were often substantially longer than any singleϕi\\phi\_\{i\}\. A short SFT pass on cleaned descriptions taughtψ\\psithe catalog\-style output vocabulary and stop tokens, narrowing this style gap\.

Once we adopted the 5\-format encoder warmup \(S1\) and the 5\-format rewriter warmup S1b \(Appendix[G](https://arxiv.org/html/2605.29271#A7)\), the picture changed\. By training the encoder underϕ∼Unif​\(Φ\)\\phi\\sim\\mathrm\{Unif\}\(\\Phi\)— withϕ5\\phi\_\{5\}in particular being the long, multi\-sentence rendering closest in length and style to the rewriter’s output — the encoder learns a representation that is approximately invariant across the style gap this SFT pass was originally designed to close, and S1b then teaches the rewriter the catalog vocabulary directly\. In this configuration, S2 can be initialised fromψ1\\psi\_\{1\}\(the S1b checkpoint\) without an intervening HyDE\-style SFT pass, and the iterative loop proceeds as in Algorithm[1](https://arxiv.org/html/2605.29271#alg1)\.

#### Pipeline with optional SFT\.

When included, the optional SFT pass produces an alternativeψ1\\psi\_\{1\}as follows\. Sampling descriptions fromψ0\\psi\_\{0\}underρHyDE\\rho\_\{\\mathrm\{HyDE\}\}for queries in𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}and applyingclean​\(⋅\)\\mathrm\{clean\}\(\\cdot\)with a length filter \(\|d~\|\>30\|\\tilde\{d\}\|\>30characters\) yields a cleaned set𝒮SFT\\mathcal\{S\}\_\{\\mathrm\{SFT\}\}of≈\\approx2,754 pairs\. We then run a*short*LoRA SFT pass — explicitly*not*trained to convergence:

ψ1\(opt\)=arg⁡minψ−∑\(q,d~\)∈𝒮SFTlog⁡pψ​\(d~\|ρHyDE​\(q\)\)\.\\psi\_\{1\}^\{\\mathrm\{\(opt\)\}\}=\\arg\\min\_\{\\psi\}\\,\-\\\!\\\!\\sum\_\{\(q,\\tilde\{d\}\)\\in\\mathcal\{S\}\_\{\\mathrm\{SFT\}\}\}\\log p\_\{\\psi\}\\\!\\bigl\(\\tilde\{d\}\\,\\big\|\\,\\rho\_\{\\mathrm\{HyDE\}\}\(q\)\\bigr\)\.\(9\)The iterative loop then runs fromψ1\(opt\)\\psi\_\{1\}^\{\\mathrm\{\(opt\)\}\}instead ofψ1\\psi\_\{1\}from S1b\.

#### Hyperparameters\.

LoRA with rankr=16r=16,αLoRA=32\\alpha\_\{\\mathrm\{LoRA\}\}=32, dropout 0\.05, applied to all attention projection matrices \(q,k,v,oq,k,v,o\); embeddings and the language\-model head are not tuned\. AdamW withηψ=2×10−4\\eta\_\{\\psi\}=2\\times 10^\{\-4\}, linear schedule with 20\-step warmup, effective batch size 32 \(per\-device 4×\\timesgradient accumulation 8\), max sequence length 512, 100 optimisation steps total \(≈\\approx3,200 examples seen, less than 2 epochs over𝒮SFT\\mathcal\{S\}\_\{\\mathrm\{SFT\}\}\)\. bf16 mixed precision, single H200 GPU, gradient checkpointing on\. Longer schedules \(1,000 / 5,000 steps\) degraded downstream DPO performance by reducing the diversity of candidates available to the S4 sampler; this ablation is reported in Appendix[L](https://arxiv.org/html/2605.29271#A12)\.

#### Ablation\.

§[4](https://arxiv.org/html/2605.29271#S4)reports retrieval numbers for the main pipeline \(without the optional SFT pass, i\.e\. S2 initialised from S1b\)\. The variant with the optional SFT pass is reported in Appendix[L](https://arxiv.org/html/2605.29271#A12)and does not improve over the main pipeline\.

## Appendix IEvaluation Metrics

For a queryqqwith gold tool setTq∗T^\{\*\}\_\{q\}and retrieved rankingT^k​\(q\)=\(t^1,…,t^k\)\\hat\{T\}\_\{k\}\(q\)=\(\\hat\{t\}\_\{1\},\\ldots,\\hat\{t\}\_\{k\}\):

hit​@​k​\(q\)\\displaystyle\\mathrm\{hit@\}k\(q\)=𝟙​\[T^k​\(q\)∩Tq∗≠∅\],\\displaystyle=\\mathbb\{1\}\\\!\\bigl\[\\hat\{T\}\_\{k\}\(q\)\\cap T^\{\*\}\_\{q\}\\neq\\varnothing\\bigr\],\(10\)recall​@​k​\(q\)\\displaystyle\\mathrm\{recall@\}k\(q\)=\|T^k​\(q\)∩Tq∗\|\|Tq∗\|,\\displaystyle=\\frac\{\|\\hat\{T\}\_\{k\}\(q\)\\cap T^\{\*\}\_\{q\}\|\}\{\|T^\{\*\}\_\{q\}\|\},\(11\)NDCG​@​k​\(q\)\\displaystyle\\mathrm\{NDCG@\}k\(q\)=∑j=1k𝟙​\[t^j∈Tq∗\]log2⁡\(j\+1\)∑j=1min⁡\(k,\|Tq∗\|\)1log2⁡\(j\+1\)\.\\displaystyle=\\frac\{\\sum\_\{j=1\}^\{k\}\\frac\{\\mathbb\{1\}\[\\hat\{t\}\_\{j\}\\in T^\{\*\}\_\{q\}\]\}\{\\log\_\{2\}\(j\+1\)\}\}\{\\sum\_\{j=1\}^\{\\min\(k,\|T^\{\*\}\_\{q\}\|\)\}\\frac\{1\}\{\\log\_\{2\}\(j\+1\)\}\}\.\(12\)Each metric is averaged over queries in the relevant tier\. Definitions match the standardir\_measuresimplementations\.

## Appendix JRound\-3kk\-Sweep

Table[4](https://arxiv.org/html/2605.29271#A10.T4)reports hit@kk, recall@kk, and NDCG@kkfor the converged round\-3 co\-trained system atk∈\{1,5,10,20\}k\\in\\\{1,5,10,20\\\}, on both standard and vague query splits, stratified by tier\. Numbers are sourced from the same evaluation run that supplies the round\-3 NDCG@5\. NDCG@1 equals hit@1 by construction\. Recall@1 is reported in full but, as noted in §[3\.6](https://arxiv.org/html/2605.29271#S3.SS6), is bounded above by1/\|Tq∗\|1/\|T^\{\*\}\_\{q\}\|and is therefore lower than the other metrics for every multi\-tool query\.

Table 4:Fullkk\-sweep for the round\-3 co\-trained system\. NDCG@1 = hit@1 by construction\. Recall@1 is capped at1/\|Tq∗\|1/\|T^\{\*\}\_\{q\}\|for multi\-tool queries\.
## Appendix KBootstrap CI Protocol

The paired\-bootstrap 95% confidence intervals reported in §[4](https://arxiv.org/html/2605.29271#S4)and Appendix[P](https://arxiv.org/html/2605.29271#A16)are computed as follows\. For each tierG∈\{G1,G2,G3\}G\\in\\\{G\_\{1\},G\_\{2\},G\_\{3\}\\\}and split∈\{standard,vague\}\\in\\\{\\mathrm\{standard\},\\mathrm\{vague\}\\\}, let\{xq\}q∈𝒬\(G\)\\\{x\_\{q\}\\\}\_\{q\\in\\mathcal\{Q\}^\{\(G\)\}\}and\{yq\}q∈𝒬\(G\)\\\{y\_\{q\}\\\}\_\{q\\in\\mathcal\{Q\}^\{\(G\)\}\}be the per\-query NDCG@5 scores under the two systems being compared \(e\.g\. Round 3 and Xu re\-implementation\), each system having produced its own ranking for the same set of queries from the same evaluation run\. We resample query indices with replacement,B=10,000B=10\{,\}000times, with a fixed random seed\. For each resamplebb, we computex¯\(b\)=meanq∈Sb​xq\\bar\{x\}^\{\(b\)\}=\\mathrm\{mean\}\_\{q\\in S\_\{b\}\}x\_\{q\}andy¯\(b\)=meanq∈Sb​yq\\bar\{y\}^\{\(b\)\}=\\mathrm\{mean\}\_\{q\\in S\_\{b\}\}y\_\{q\}, and the paired differenceδ\(b\)=x¯\(b\)−y¯\(b\)\\delta^\{\(b\)\}=\\bar\{x\}^\{\(b\)\}\-\\bar\{y\}^\{\(b\)\}\. The 95% CI of the difference is the \(2\.5%, 97\.5%\) percentile interval of\{δ\(b\)\}b=1B\\\{\\delta^\{\(b\)\}\\\}\_\{b=1\}^\{B\}; we report this as\[lo,hi\]\[\\mathrm\{lo\},\\mathrm\{hi\}\]\. The same protocol with a single\-systemxx\-only resample yields the per\-method CI half\-widths quoted in Appendix[P](https://arxiv.org/html/2605.29271#A16)\(±2\\pm 2pp /±3\\pm 3pp /±5−6\\pm 5\{\-\}6pp on G1/G2/G3, dominated by tier size\|𝒬\(G\)\|\|\\mathcal\{Q\}^\{\(G\)\}\|\)\.

## Appendix LDesign\-Choice Ablations: Details

This appendix gives the per\-variant numbers and discussion behind the summary in §[4\.3](https://arxiv.org/html/2605.29271#S4.SS3)\. None of these variants are part of the main pipeline\.

#### Single\-format encoder training

\(ϕ≡ϕi\\phi\\equiv\\phi\_\{i\}for a fixedii\)\. Training under any single rendering matched the 5\-format encoder on its matched evaluation rendering but underperformed on the others\. Training underϕ5\\phi\_\{5\}alone — the rendering closest in length to the rewriter’s output — still produced an encoder less robust to rewriter outputs of varying length than the 5\-format encoder\. We interpret this as evidence that the format mixture is doing more than augmenting on the longest format: by forcing the encoder to assign similar embeddings to the same tool across five different surface forms, it learns a length\- and style\-invariant representation that the description\-only S2 retrains can then build on\.

#### Combined\-pair encoder retrain\.

Replacing the description\-only S3robjective with the mixed batch𝒟q\+d\(ψr\);α=0\.5\\mathcal\{D\}^\{\(\\psi\_\{r\}\);\\alpha=0\.5\}\_\{\\mathrm\{q\+d\}\}produced no improvement over description\-only and slightly degraded vague\-query performance\. The mechanism we attribute this to is that mixingqq\-anchored pairs back into S3 partially pulls the encoder toward the on\-distributionqq\-anchored fixed point established at S1 — the very fixed point whose vague\-query failure we are trying to escape\. The description\-only objective is, in this view, doing distribution shift on purpose\.

#### Query\-rewrite promptρrewrite\\rho\_\{\\mathrm\{rewrite\}\}\.

Substituting the catalog\-styleρHyDE\\rho\_\{\\mathrm\{HyDE\}\}with the user\-styleρrewrite\\rho\_\{\\mathrm\{rewrite\}\}at any stage of the loop \(sampling SFT targets, generating S3rpairs, or sampling DPO candidates\) lost on every standard metric, with the largest gap on cross\-domain G3\. The prompt is given the relevant tool names as in\-context anchors, so the comparison is not a strawman: with that anchoring,ρrewrite\\rho\_\{\\mathrm\{rewrite\}\}produces specific, plausible user queries — they simply do not match the style of the contrastive pair the encoder sees during S3 retraining\.

#### Longer SFT schedules\.

Extending the optional SFT pass \(Appendix[H](https://arxiv.org/html/2605.29271#A8)\) from 100 steps to 1,000 / 5,000 steps closed the SFT train loss but produced lower\-diversity candidates at S4 and a smaller DPO margin, ultimately reducing the closed\-loop gain\. The DPO update relies on temperature\-0\.7 sampling spreading mass across distinguishable candidates; over\-fitted rewriters concentrate that mass and degrade the preference signal\.

#### HyDE\-concat\.

Concatenatingqqwithd~\\tilde\{d\}before encoding, in the spirit of Query2doc\(Wanget al\.,[2023](https://arxiv.org/html/2605.29271#bib.bib6)\), helped slightly on G1 standard but hurt vague queries — where the originalqq’s lexical surface is precisely the surface we are trying to escape\.

## Appendix MXu et al\. 2024 Re\-implementation

This appendix documents the hyperparameters and prompts used for the head\-to\-head againstShaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)reported in §[4\.4](https://arxiv.org/html/2605.29271#S4.SS4)\.

#### Encoder\.

Shaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)’s pipeline trains the encoder once contrastively and never updates it again at inference time\. We instantiate this once\-trained encoder with our S1a InfoNCE checkpoint \(BGE\-large\-en\-v1\.5 fine\-tuned with InfoNCE on real\(q,ϕ5​\(t\)\)\(q,\\phi\_\{5\}\(t\)\)pairs; full hyperparameters in Appendix[F](https://arxiv.org/html/2605.29271#A6)\)\. This is a strictly stronger starting point than the Sentence\-BERT base used in the original paper, and is therefore a charitable substitution: any gap our co\-training closes against this baseline cannot be attributed to a weaker re\-implemented encoder\.

#### LLM refiner\.

Qwen3\.5\-4B served via vLLM, identical model and serving setup as the main paper’s rewriter \(we deliberately use the same LLM as our rewriter to remove model\-capacity confounds;Shaoet al\.\([2023](https://arxiv.org/html/2605.29271#bib.bib33)\)use GPT\-3\.5\)\. Greedy decoding \(T=0T=0, top\-p=1p=1, top\-k=1k=1\), max 400 generated tokens per stage\. The same temperature is used at all three prompted stages within a round\.

#### Iteration schedule\.

T=3T=3refinement rounds \(matching the paper’s reported best\)\. Within each round, the LLM sees the current top\-KKretrieved tools withK=10K=10and runs the three\-stage Comprehension / Assessment / Refinement prompts; the refined instruction \(or N/A early\-stop\) becomes the next round’s retrieval input\. Final ranking is the last round’s\. Top\-50 are saved for evaluation atk∈\{1,5,10,20\}k\\in\\\{1,5,10,20\\\}\.

#### Three\-stage prompts\.

The paper does not provide verbatim text\. Our reimplementation uses prompts that match the three\-stage description in their §3:

- •P\_Comprehension\(system\+\+user\): summarise user goals and the functionalities of the top\-KKretrieved tools, one short sentence per goal and per tool\.
- •P\_Assessment\(system\+\+user\): given the comprehension and the retrieved set, decide which goals are SOLVED vs UNSOLVED and whether the ranking matches importance; outputSolved:andUnsolved:sections\.
- •P\_Refinement\(system\+\+user\): given the assessment, output eitherN/A\(if all goals solved and ranking matches\) or a refined one\-paragraph instruction enriched with the missing intent\.

This should be read as a faithful reimplementation of the pipeline structure rather than an exact reproduction of Xu’s prompts\.

#### Caveat\.

Beyond the prompt approximation, our reimplementation differs from the original paper in two respects: \(i\) a stronger encoder \(S1a InfoNCE BGE\-large vs Sentence\-BERT base\), and \(ii\) a different LLM \(Qwen3\.5\-4B vs GPT\-3\.5\)\. Both substitutions advantage Xu’s method on this benchmark, making the head\-to\-head charitable to it\.

## Appendix NCompute Budget and Infrastructure

#### Hardware\.

All experiments were run on a single node with 8 H200 GPUs\. The encoder training, rewriter SFT, rewriter DPO, and HyDE inference passes each fit on a single GPU; multi\-GPU parallelism was used only opportunistically and not required for any reported result\.

#### Per\-stage wall\-clock cost \(approximate, single H200\)\.

- •S1a \(encoder InfoNCE warmup, 5 epochs, batch 256\):∼\\sim3 hours\.
- •S1b \(rewriter 5\-format tool\-memorisation SFT,≈\\approx50K examples\):∼\\sim2 hours\.
- •Per\-round S2 \(description regeneration over𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}atT=0T=0, 150\-token budget, via vLLM\):∼\\sim2 hours\.
- •Per\-round S3 encoder retrain:∼\\sim1\.5 hours\.
- •Per\-round S4 DPO data generation \(N=4N=4candidates per query atT=0\.7T=0\.7, scored by the current encoder\):∼\\sim6 hours\.
- •Per\-round S4 DPO training \(≈\\approx4,371 steps, LoRAr=64r=64\):∼\\sim4 hours\.
- •Per\-configuration vague\-split evaluation \(HyDE generation over 1,092 queries via vLLM\):∼\\sim2\.5 hours per pass\.

#### Total budget\.

Three rounds of co\-training plus all baselines, ablations, and rejected design\-choice variants \(Appendix[L](https://arxiv.org/html/2605.29271#A12)\) totalled approximately 400–500 GPU\-hours on H200\-class hardware\. Reproducing only the main result \(S1a \+ S1b \+ three rounds \+ a single end\-to\-end vague evaluation\) would take roughly 50 GPU\-hours\.

## Appendix OSoftware Versions

Encoder training uses an in\-house InfoNCE script built on PyTorch 2\.4 and HuggingFace Transformers 4\.46\. Rewriter SFT and DPO use TRL 0\.11 \(SFTTrainer,DPOTrainer\) with PEFT 0\.13 for LoRA adapters\. Rewriter inference uses vLLM 0\.6\. Mixed\-precision training uses native PyTorch bf16\. Evaluation uses our own retrieval scoring code; metric definitions match the standardir\_measuresimplementations and are given in closed form in Appendix[I](https://arxiv.org/html/2605.29271#A9)\.

## Appendix PSingle\-Seed Caveat

All reported numbers are from a single training seed\. We did not run multi\-seed variance estimates due to the per\-round compute cost \(§[N](https://arxiv.org/html/2605.29271#A14)\); the per\-round trajectory in §[4\.3](https://arxiv.org/html/2605.29271#S4.SS3)serves as a partial proxy for stability, in that the system’s behaviour across rounds is smooth on tier\-averaged metrics and only mildly non\-monotonic at the per\-cell level\. As a separate, finite\-sample uncertainty estimate, we computed paired\-bootstrap 95% CIs \(B=10,000\{=\}10\{,\}000\) of NDCG@5 over the 593 / 399 / 100 queries in G1/G2/G3 \(protocol in Appendix[K](https://arxiv.org/html/2605.29271#A11)\)\. The half\-width of these CIs — which captures sampling uncertainty over the eval set,*not*training\-seed variance — is approximately±2\\pm 2pp on G1 cells,±3\\pm 3pp on G2 cells, and±5\\pm 5–66pp on G3 cells \(the smallest tier\)\. Cell\-level differences should accordingly be read against the bootstrap CI of the difference rather than a flat noise band: round\-3 vs\. S1 differences on standard tiers are well outside this band on G1/G2 and at the edge on G3; vague\-tier differences are well outside on G2 but smaller than the bootstrap CI on G1 and G3\. Multi\-seed retrains, which would also bound training\-side variance, are an open item\.

## Appendix QEthics, Risks, and Artifacts

### Q\.1Upstream Artifacts and Licenses

This work builds on the following publicly available artifacts, used in a manner consistent with their stated intended use \(research benchmarks and research\-grade pretrained models\)\.

- •ToolBench\(Qinet al\.,[2024](https://arxiv.org/html/2605.29271#bib.bib2)\): source of the underlying API pool and the official G1/G2/G3 evaluation queries\. Released under Apache 2\.0\.[https://github\.com/OpenBMB/ToolBench](https://github.com/OpenBMB/ToolBench)\.
- •ToolGen\(Wanget al\.,[2025](https://arxiv.org/html/2605.29271#bib.bib30)\): source of the 46,980\-tool catalog from which we derive the 10K subset𝒯\\mathcal\{T\}, and the source of the \(query, gold\-tool\-set\) training pairs𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}\. Released under Apache 2\.0\.[https://github\.com/Reason\-Wang/ToolGen](https://github.com/Reason-Wang/ToolGen)\.
- •
- •

### Q\.2Data Coverage and Privacy

#### Language and domain\.

The ToolBench/ToolGen catalog is entirely English\-language and is sourced from RapidAPI’s public catalog, skewed toward consumer\-facing REST APIs \(weather, sports, lifestyle, finance, entertainment\)\. No non\-English text appears in queries, tool descriptions, or rewriter outputs\.

#### Personally identifying information\.

Tool records contain API metadata \(titles, endpoints, parameter schemas, free\-text descriptions written by API publishers\)\. They do not contain end\-user PII\. We did not run a dedicated PII scan on the catalog because the source records are already public API documentation; however, the manual review of 100 vague\-paraphrase outputs in Appendix[A](https://arxiv.org/html/2605.29271#A1)did not surface any inadvertent generation of personal information\.

#### Offensive content\.

The catalog includes some adult\-content\-tagged APIs \(a small minority, consistent with RapidAPI’s public listings\)\. We did not filter these out, on the grounds that doing so would change the benchmark composition and make our numbers incomparable with prior work on the same catalog\. No offensive content appears in any reported example or figure\.

#### Split sizes\.

As reported in §[3](https://arxiv.org/html/2605.29271#S3): catalog\|𝒯\|=10,000\|\\mathcal\{T\}\|=10\{,\}000; training set\|𝒟train\|=104,224\|\\mathcal\{D\}\_\{\\mathrm\{train\}\}\|=104\{,\}224\(G1: 44,873; G2: 35,402; G3: 23,949\); evaluation queries 593 / 399 / 100 over G1/G2/G3 \(1,092 total\), with vague paraphrases of the same evaluation queries forming𝒬vague\\mathcal\{Q\}\_\{\\mathrm\{vague\}\}of equal size\.

### Q\.3Risks

Tool retrieval is a component of larger tool\-using agent systems; improvements in retrieval can amplify both desirable and undesirable downstream agent behaviour, depending on the tools in the catalog and the agent’s policy over them\. Our experiments are run on the ToolBench\-derived ToolGen catalog, which inherits whatever selection biases that catalog has — consumer\-facing REST APIs over enterprise or safety\-critical tools, English\-language descriptions, no audit of the underlying APIs’ content\. A rewriter aligned to a specific encoder is, in effect, a steering vector over that encoder’s retrieval distribution; the same mechanism that closes catalog\-misalignment gaps could in principle be used to bias retrieval toward a chosen subset of tools, and any practitioner reusing this method should be aware that the rewriter’s behaviour is encoder\-specific\. We see no near\-term dual\-use concern beyond what already applies to any open dense retriever or instruction\-tuned LLM\.

### Q\.4AI Assistant Use

For prototyping the codebase and experimentation, as well as for writing and editing of this manuscript, Claude Code with the Opus\-4\.5 model was used; all technical content, experimental design, claims, and figures are the authors’ own\.

Similar Articles

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

arXiv cs.CL

HyDRA is a hybrid dynamic routing architecture for heterogeneous LLM pools that predicts fine-grained capability requirements per query and selects the cheapest capable model via shortfall matching, achieving up to 72.5% cost savings with quality maintained. It is deployed in GitHub Copilot's VS Code Chat auto-mode and decouples routing from model catalog, requiring no retraining when models change.

CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

arXiv cs.AI

This paper introduces CoCoDA, a framework that uses a co-evolving compositional Directed Acyclic Graph (DAG) to manage tool libraries for augmented agents. It enables small language models to efficiently retrieve and compose tools, allowing an 8B model to match or exceed the performance of a 32B model on reasoning benchmarks.

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.