Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

arXiv cs.CL 06/16/26, 04:00 AM Papers
Summary
Introduces XBCP (Cross-lingual BrowseComp-Plus), a benchmark for evaluating deep research agents and retrievers in cross-lingual and multilingual settings. Results show significant performance degradation when evidence is in a different language from the query, highlighting both retrieval failures and agent-side difficulty in integrating language-mismatched evidence.
arXiv:2606.15345v1 Announce Type: new Abstract: Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:47 AM
# Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus
Source: [https://arxiv.org/html/2606.15345](https://arxiv.org/html/2606.15345)
Yuheng Lu1, Qingcheng Zeng211footnotemark:1, Heli Qi1,3, Puxuan Yu4,Fuheng Zhao5, Rui Yang6,Hitomi Yanaka7,3,Naoto Yokoya7,3,Weihao Xuan7,3 1Waseda University,2Northwestern University,3RIKEN AIP,4Snowflake Inc\., 5University of Utah,6Duke\-NUS Medical School,7The University of Tokyo

###### Abstract

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers\. Existing browsing benchmarks, however, largely assume that the user’s query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language\. We introduceXBCP\(Cross\-lingual BrowseComp\-Plus\), a controlled benchmark that preserves the English question\-and\-answer space of BrowseComp\-Plus but varies the languages of the supporting documents\.XBCPinstantiates two complementary settings\. In the cross\-lingual setting, each query is paired with evidence in a single assigned language\. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high\-resource and low\-resource regimes\. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval\. Results reveal substantial degradation when evidence is translated\. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably\. Notably, accuracy remains lower even when all gold evidence is supplied directly\. These findings suggest that cross\-lingual deep research exposes both retrieval failures and an independent, agent\-side difficulty in integrating language\-mismatched evidence\.

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross\-Lingual BrowseComp\-Plus

## 1Introduction

Large language model \(LLM\) agents represent a shift from models that answer from parametric knowledge alone to systems that actively acquire, filter, and synthesize external evidence\. Deep research systems are a representative instance of this shift: given a complex information need, an agent must plan searches, inspect retrieved sources, judge whether the evidence is sufficient, and compose a grounded answer\(OpenAI,[2025a](https://arxiv.org/html/2606.15345#bib.bib2)\)\. This broader movement has made browsing\-based evaluation a central test of agentic capability\. BrowseComp\(Weiet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib3)\)crystallizes the challenge by posing difficult but verifiable questions whose answers require nontrivial web exploration, thereby stressing both search behavior and evidence\-grounded reasoning\. However, evaluations over live web search measure an entire time\-varying system at once, entangling the language model, retrieval method, ranking API, and underlying corpus\. BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\)addresses this limitation by grounding BrowseComp\-style questions in a fixed, human\-verified corpus with supporting documents and hard negatives, turning browsing evaluation into a controlled setting where retrievers and LLM agents can be studied both separately and in interaction\.

This controlled view of deep research, however, remains largely confined to monolingual settings\. The limitation matters because multilingual and cross\-lingual retrieval have long been central concerns in information retrieval, and recent multilingual embedding models have greatly expanded the ability to retrieve across languages\(Yuet al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib4); Zhanget al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib5),[2025](https://arxiv.org/html/2606.15345#bib.bib7)\)\. Most evaluations of these models still treat retrieval as a standalone ranking problem: a query is matched against a fixed collection, and success is measured by document\-level relevance\. This abstraction is useful for isolating retrieval quality, but it does not capture what happens when retrieval is part of an agentic search process\. In that setting, the system must issue and refine searches, compare partial evidence, and decide how retrieved information should support an answer\. Recent browsing\-agent benchmarks beyond English, such as BrowseComp\-ZH\(Zhouet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib8)\), broaden the linguistic scope of agent evaluation but remain primarily monolingual: questions, evidence, and answers all stay within the same language\. They therefore leave open the genuinely cross\-lingual case, where an information need expressed in one language must be answered using evidence written in another\. A cross\-lingual extension of BrowseComp\-Plus is needed to make this setting measurable\. Such a benchmark would test whether multilingual retrievers can surface the right evidence during agentic search and whether LLM agents can integrate language\-mismatched evidence into faithful answers\. To make this setting measurable, we introduce Cross\-lingual BrowseComp\-Plus \(XBCP\)\. To the best of our knowledge, XBCP is the first benchmark to formalize cross\-lingual deep research, extending the controlled evaluation paradigm of BrowseComp\-Plus from monolingual to multilingual retrieval\. XBCP preserves the task structure of BrowseComp\-Plus: questions are posed in English, answers are expected in English, and the evidence is grounded in a fixed corpus\. The key difference is that the supporting evidence is no longer assumed to be written in the same language as the question\. We instantiate this design with two complementary configurations\. In the cross\-lingual setting, all supporting documents for a given query appear in the same language, while the assigned language varies across queries\. This tests whether systems remain robust as otherwise comparable tasks move across languages\. In the multilingual setting, the evidence corpus is randomly but equally assigned to 12 languages spanning high\-resource and low\-resource regimes, enabling controlled evaluation of English queries against language\-specific evidence documents\. Together, these configurations allow XBCP to evaluate both whether multilingual retrievers can surface language\-mismatched evidence during agentic search and whether LLM agents can integrate such evidence into faithful English answers\. Our experiments reveal large drops in accuracy and evidence recall across retrievers, reduced citation reliability, and persistent degradation even under oracle retrieval\. These findings indicate that cross\-lingual deep research stresses both retrieval and agent\-side evidence integration\. Figure[1](https://arxiv.org/html/2606.15345#S1.F1)summarizes the construction and evaluation pipeline\.

![Refer to caption](https://arxiv.org/html/2606.15345v1/x1.png)Figure 1:Overview of theXBCPpipeline\. We translate and reorganize the evidence side of BrowseComp\-Plus into cross\-lingual and multilingual corpora, rebuild retrieval indexes for controlled agent experiments, and evaluate agents and retrievers with end\-to\-end accuracy, evidence recall, calibration, oracle retrieval, and per\-language analysis\.
## 2Related Works

##### Deep Research Systems\.

Deep research systems extend tool\-augmented LLMs from single\-step retrieval to long\-horizon information seeking, where agents must plan searches, interact with external sources, verify intermediate evidence, and synthesize grounded answers\. OpenAI Deep Research\(OpenAI,[2025a](https://arxiv.org/html/2606.15345#bib.bib2)\)exemplifies this paradigm and has motivated a growing line of open research agents that scale the underlying capabilities in different ways: Tongyi DeepResearch\(Teamet al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib9)\)combines agentic mid\-training and post\-training with large\-scale synthetic trajectories, MiroThinker\(MiroMind Teamet al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib12)\)studies model, context, and interaction scaling, and Marco DeepResearch\(Zhuet al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib11)\)emphasizes verification\-centric training and inference to reduce error propagation in long\-horizon search\. Benchmarking has also moved toward more demanding settings, including Chinese web browsing in BrowseComp\-ZH\(Zhouet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib8)\), expert\-level financial search in FinSearchComp\(Huet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib13)\), and noisy or conflicting search results in SealQA\(Phamet al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib14)\)\. These efforts have substantially advanced both systems and evaluations, but remain largely monolingual or domain\-specific, leaving cross\-lingual deep research underexplored\.

##### Multilingual and Cross\-lingual Retrieval\.

Multilingual and cross\-lingual retrieval has moved from translation\-mediated CLIR toward shared embedding spaces\.mE5\(Wanget al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib19)\)extends the E5 recipe with billion\-scale multilingual contrastive pre\-training and supervised fine\-tuning, while later systems expand the design space through long\-context encoders inmGTE\(Zhanget al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib5)\), efficiency\- and compression\-aware multilingual embeddings inArctic\-Embed 2\.0\(Yuet al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib4)\), and foundation\-model\-based multilingual training inQwen3 Embedding\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib7)\)\. This progress is accompanied by a broader recognition that CLIR is not simply monolingual retrieval plus translation: retrieval quality depends on cross\-lingual representation alignment, resource imbalance, domain transfer, and evaluation design\(Goworeket al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib15)\)\. Evaluation has therefore expanded to representative benchmarks such as MMTEB\(Enevoldsenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib20)\), MIRACLZhanget al\.\([2023](https://arxiv.org/html/2606.15345#bib.bib21)\), and MLDRChenet al\.\([2024](https://arxiv.org/html/2606.15345#bib.bib22)\), but it remains a fixed\-collection ranking problem\. Large\-scale CLIR experiments show that multilingual bi\-encoders and translation\-based lexical retrieval dominate across different datasets and language regimes\(Zuoet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib6)\); task\-specific fact\-checking studies further show that multilingual and cross\-lingual retrieval yield different model rankings and gains from supervised adaptation\(Ramponiet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib18)\)\. These works provide strong retrievers and ranking\-oriented evaluations, but not a view of cross\-lingual retrieval inside the iterative search, evidence selection, and answer synthesis loop of deep research agents\.

## 3BuildingXBCP

### 3\.1Translation\-Based Construction

We buildXBCPby translating the evidence side of BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\): questions remain in English, final answers are evaluated in English, and only the evidence documents vary in languages\. We useGPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2606.15345#bib.bib32)\)as the translation model with a single language\-conditioned prompt that requests complete translation into the target language, including titles, terminology, proper nouns, and metadata field names, while preserving URLs, email addresses, formulas, and code blocks; the full prompt is shown in Appendix[B](https://arxiv.org/html/2606.15345#A2)\. This prompt is applied to each source document for the non\-English target languages, while English documents are retained unchanged\. The resulting evidence languages are designed to span different resource conditions\. We include relatively high\-resource languages with substantial web and retrieval coverage, namely Chinese, English, French, German, Japanese, Korean, Portuguese, and Spanish, as well as low\-resource African languages, namely Swahili, Wolof, Yoruba, and Zulu\. This language set allowsXBCPto test whether cross\-lingual deep research systems degrade smoothly across resource regimes or fail disproportionately when evidence appears in languages with weaker retrieval and modeling support\.

The translated corpus supports two evaluation configurations\. In the cross\-lingual setting, each query is assigned to one evidence language, so all supporting documents for that query appear in the same language\(English serves as an untranslated reference\)\.Appendix Table[8](https://arxiv.org/html/2606.15345#A1.T8)reports the resulting 830 query assignments and 5,040 evidence\-document assignments\. In the multilingual setting, 5,040 evidence document instances are randomly but equally assigned to 12 languages, making 420 evidence docs per language; Appendix Table[9](https://arxiv.org/html/2606.15345#A1.T9)gives the per\-language document counts\. This construction lets us vary the linguistic form of the evidence while preserving the original task semantics, making retrieval failures and agent\-side synthesis failures comparable across languages\.

### 3\.2Verification and Quality Control

To assess the quality of the translated evidence, we conduct an independent expert verification study following the translation\-evaluation rubric of MMLU\-ProX\(Xuanet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib33)\)\. The rubrics is in Appendix[C](https://arxiv.org/html/2606.15345#A3)\. We sample 200 translated documents from each of 11 non\-English languages, yielding 2200 translation instances in total\. Expert annotators compare each translation against the original English document and rate it along the same three dimensions in MMLU\-ProX, accuracy, fluency, and completeness on 1\-5 scale, so that the verification focuses on whether the translated documents preserve the evidence needed for retrieval and answer synthesis\. Verification results are in Appendix[D](https://arxiv.org/html/2606.15345#A4)\. All language\-level mean scores exceed 4\.0, suggesting that the translated evidence is generally usable for controlled evaluation, while residual artifacts may remain\.

## 4Experiments and Results

### 4\.1Experimental Setup

Following the evaluation protocol of BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\), we evaluateXBCPby pairing search agents with controlled retriever tools over fixed corpora\. We consider four agents:GPT\-OSS\-20B\(OpenAIet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib24)\),GPT\-OSS\-120B\(OpenAIet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib24)\),Qwen3\.6\-35B\-A3B\(Qwen Team,[2026](https://arxiv.org/html/2606.15345#bib.bib25)\), andDeepSeek\-V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.15345#bib.bib26)\)\. For retrieval, we compare a sparse lexical baseline, BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.15345#bib.bib23)\), with four dense multilingual retrievers:Qwen3\-Embedding\-4B,Qwen3\-Embedding\-8B\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib7)\),Multilingual\-E5\-Large\(Wanget al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib19)\), andArctic\-Embed\-L\-2\.0\(Yuet al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib4)\)\.GPT\-OSS\-20B,GPT\-OSS\-120B, andQwen3\.6\-35B\-A3Bare evaluated with all five retrievers, whileDeepSeek\-V4\-Prois evaluated with BM25 andQwen3\-Embedding\-8B\. Each available agent\-retriever pair is evaluated on three corpus conditions\.

Evaluations are at two complementary levels\. First, end\-to\-end agent performance captures whether an agent can answer correctly while using a retriever as its search tool\. Accuracy scores final answer correctness; evidence recall, computed over the union of documents returned across the agent’s search trajectory, measures retriever\-side coverage of human\-verified evidence independent of downstream agent behavior; average search calls captures exploration cost; and calibration error measures the mismatch between the agent’s stated confidence and its observed correctness\.

Second, we analyze retriever behavior as it appears inside the agent loop\. In this setting, retrieval quality is not only a top\-kkranking property: a useful retriever should surface supporting documents consistently enough for the agent to find them through iterative search, reduce unnecessary follow\-up searches, and provide evidence that can be cited in the final response\. We therefore report citation coverage, average citation count, citation precision, and citation recall to measure whether retrieved evidence is carried through into faithful source attribution\.

Beyond these two levels, we additionally evaluate an oracle retrieval setting that bypasses search and ranking by supplying all supporting evidence directly to the agent, isolating reasoning errors from retrieval errors\. We also report three supplementary analyses: a per\-language decomposition, a reasoning\-based query expansion experiment, and a reasoning\-effort control study\.

Since our benchmarks are set in multilingual and crosslingual settings, the original selected modelsQwen3\-32B\(Yanget al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib34)\)andGPT\-4\.1\(OpenAI,[2025b](https://arxiv.org/html/2606.15345#bib.bib35)\)in LLM\-as\-Judge in BrowseComp\-Plus are not suitable in our experiments\. We therefore adoptGPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2606.15345#bib.bib32)\)and change the judge prompt for evaluation\. The new judge prompt is in Appendix[E](https://arxiv.org/html/2606.15345#A5)\.

### 4\.2Main Results

#### 4\.2\.1End\-to\-End Agent Evaluation

Table[1](https://arxiv.org/html/2606.15345#S4.T1)reports end\-to\-end accuracy and evidence recall\. The strongest overall performance is obtained byDeepSeek\-V4\-ProwithQwen3\-Embedding\-8B, reaching 64\.70% accuracy on the original corpus, 48\.80% in the multilingual setting, and 42\.29% in the cross\-lingual setting\. Among the agents evaluated with the full retriever suite,Qwen3\-Embedding\-8Balso gives the strongest original\-corpus performance, consistent with the BrowseComp\-Plus finding that stronger retrievers improve deep\-research agents by surfacing more useful evidence during iterative search\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\)\.

The same table shows that translated evidence introduces a large additional difficulty\. WithQwen3\-Embedding\-8B, accuracy drops by roughly 16–23 pp across agents when moving from the original corpus to the translated settings\. The degradation appears not only with BM25 but also with dense multilingual retrievers\. Meanwhile, multilingual and cross\-lingual results are close across most agent–retriever pairs, suggesting that the primary bottleneck is language mismatch rather than the specific language\-assignment regime\.

Table 1:End\-to\-end agent performance across corpus conditions\. Multi\. denotes the multilingual corpus, Cross\. denotes the cross\-lingual corpus, andΔM\\Delta\_\{M\}andΔC\\Delta\_\{C\}denote changes from the original corpus to the multilingual and cross\-lingual corpora, respectively\.DeepSeek\-V4\-Prois evaluated with BM25 andQwen3\-Embedding\-8B\.The efficiency and calibration trends reinforce this conclusion\. Table[2](https://arxiv.org/html/2606.15345#S4.T2)shows that agents generally issue more searches after evidence is translated, but these additional searches do not recover the lost accuracy\. Calibration error also increases in both translated settings, indicating that cross\-lingual evidence makes agents not only less accurate, but also less reliable in estimating their own correctness\.

Table 2:Search efficiency and calibration error withQwen3\-Embedding\-8B\. Search denotes average search calls per query; calibration error is reported in percentages\.
#### 4\.2\.2Retriever Evaluation

Evidence recall in Table[1](https://arxiv.org/html/2606.15345#S4.T1)makes the retrieval bottleneck visible\.Qwen3\-Embedding\-8Bconsistently retrieves the most supporting evidence, while BM25 drops sharply under translated evidence, confirming that lexical matching is poorly suited to English queries over non\-English documents\. Other dense multilingual retrievers recover part of the loss, but still trail the strongest retriever and remain substantially weaker after translation\. Thus, standard multilingual retrieval ability does not directly translate into robust retrieval for complex agentic search\.

We further examine whether retrieved evidence is used correctly in final answers\. Table[3](https://arxiv.org/html/2606.15345#S4.T3)shows that citation coverage, precision, and recall all decline once evidence is translated\. This indicates that language mismatch affects not only retrieval, but also whether retrieved sources are carried through into faithful attribution\. We provide a citation\-error case study in Appendix[G](https://arxiv.org/html/2606.15345#A7)\.

Table 3:Citation behavior withQwen3\-Embedding\-8B\. Cov\., Prec\., and Rec\. denote citation coverage, citation precision, and citation recall, all in percentages\.
#### 4\.2\.3Oracle Retrieval

The oracle setting provides a diagnostic decomposition of the end\-to\-end results\. Table[4](https://arxiv.org/html/2606.15345#S4.T4)compares the strongest tool\-based condition, usingQwen3\-Embedding\-8B, with an oracle condition in which all supporting evidence is supplied directly\. The retrieval/search gap remains large in every corpus condition: oracle retrieval improves accuracy by over 55 pp on the original corpus and by roughly 65–75 pp after translation\. Thus, the largest absolute headroom still lies in getting the right evidence into the agent’s context during iterative search\.

At the same time, oracle retrieval does not eliminate the cross\-lingual penalty\. Even with all required evidence provided, translated\-evidence oracle accuracy remains below original\-corpus oracle accuracy for all agents\. These gaps reveal an agent\-side bottleneck beyond retrieval: the model must identify relevant facts, align them with the English question, and synthesize an English answer without losing the evidential constraint\. We further decompose this bottleneck using a fully target\-language oracle variant in Appendix[F](https://arxiv.org/html/2606.15345#A6)\.

Table 4:Oracle retrieval and error decomposition\. Tool accuracy usesQwen3\-Embedding\-8B\. Ret\. Gap is oracle accuracy minus tool\-based accuracy under the same corpus condition; Lang\. Gap is the drop from original\-corpus oracle accuracy to translated\-evidence oracle accuracy\.

### 4\.3Supplementary Analyses

#### 4\.3\.1Per\-Language Decomposition

Table[5](https://arxiv.org/html/2606.15345#S4.T5)reports a per\-language decomposition forQwen3\.6\-35B\-A3BwithQwen3\-Embedding\-8B, with English as an untranslated reference and the remaining languages grouped by resource level\. Full results for other agent–retriever pairs appear in Appendix[H](https://arxiv.org/html/2606.15345#A8)\.

Two patterns stand out\. First, resource level is most visible before oracle retrieval\. High\-resource languages average 18\.39% tool accuracy and 28\.48% evidence recall, whereas low\-resource languages average 10\.87% and 18\.00%, respectively\. Yet their oracle accuracies remain relatively close, at 89\.67% and 87\.32%\. This suggests that the low\-resource penalty in this batch is driven primarily by retrieval failure rather than by an intrinsic inability to answer once evidence is provided\. Swahili and Wolof illustrate this most sharply: oracle accuracy stays near 86–90% while tool\-based accuracy collapses to roughly 15%\.

Second, resource level alone does not explain all variations\. Within the high\-resource group, French, German, Portuguese, and Spanish substantially outperform Japanese and Korean, with Japanese also showing one of the lowest oracle accuracies; Zulu exhibits an analogous pattern among low\-resource languages\. Cross\-lingual deep research is therefore shaped by two separable but interacting factors: the retriever’s ability to surface evidence across languages, and the agent’s ability to align language\-specific evidence with an English query\.

Table 5:Per\-language results in the cross\-lingual setting forQwen3\.6\-35B\-A3BwithQwen3\-Embedding\-8B, plus oracle accuracy for the same agent\. All scores are percentages exceptNN\. O–T Gap denotes oracle accuracy minus tool\-based accuracy\. Group averages are weighted by the number of queries and exclude the untranslated English reference\.
#### 4\.3\.2The Impact of Query Expansion

Chenet al\.\([2026](https://arxiv.org/html/2606.15345#bib.bib31)\)argue that deep research agents expose a retrieval signal that conventional retrievers ignore: before issuing a search query, the agent often writes a natural\-language reasoning trace that clarifies the task intent, summarizes prior findings, and identifies unresolved evidence needs\. Their fullAgentIRsystem trains a retriever to jointly embed the reasoning trace and the issued query\. We study a lighter\-weight variant inXBCP: without any retriever training or index changes, we use the agent’s current reasoning trace as query expansion by concatenating it with the issued search query before passing the input toQwen3\-Embedding\-8B\. This isolates whether agent\-side reasoning is already useful as a retrieval signal, and whether the benefit survives when the relevant evidence is written in another language\.

Table 6:AgentIR\-style zero\-training query expansion forGPT\-OSS\-20BwithQwen3\-Embedding\-8B\. \+Reason\. denotes concatenating the agent’s current reasoning trace with the issued query\. Acc\., Ev\. Rec\., and Cal\.Err\. are percentages; Search denotes average search calls per query\.
Table[6](https://arxiv.org/html/2606.15345#S4.T6)shows that reasoning\-based expansion consistently improves performance across all three corpus conditions\. On the original corpus, accuracy increases by 3\.25 pp and evidence recall by 4\.86 pp, while calibration error and search turns both decrease\. The same pattern holds after translation, although with smaller gains\. The improvements therefore do not come from additional exploration, since the expanded runs use slightly fewer search calls on average; rather, the reasoning trace appears to make each search query more informative\.

From the perspective ofXBCP, this result has two implications\. First, cross\-lingual deep research should treat query formulation as part of the retrieval problem: the agent’s reasoning can help disambiguate underspecified sub\-queries and expose more supporting evidence even without retriever fine\-tuning\. Second, the smaller gains under translated evidence show that reasoning\-aware query expansion is not sufficient by itself\. The system still depends on the retriever’s cross\-lingual alignment to bridge the language gap\.

#### 4\.3\.3The Impact of Reasoning Effort

Following BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\), we further examine how reasoning effort affects both answer quality and search behavior\. This is a particularly important diagnostic for agentic search: increasing the inference budget may improve final reasoning, but it can also change search iterations and exposed evidence before answering\. We therefore vary the reasoning\-effort mode ofGPT\-OSS\-20Bwhile holding the retriever fixed toQwen3\-Embedding\-8B\. This setup asks whether cross\-lingual failures can be mitigated by deeper deliberation, or whether language mismatch persists regardless of search effort\.

Table 7:Impact of reasoning effort forGPT\-OSS\-20BwithQwen3\-Embedding\-8B\. Acc\., Ev\. Rec\., and Cal\.Err\. are percentages; Search denotes average search calls per query\.Table[7](https://arxiv.org/html/2606.15345#S4.T7)shows that higher reasoning effort consistently improves both accuracy and evidence recall\. From low to high effort, an increase from 15\.18% to 36\.02% is observed for the original corpus, and over 10 pp increases are observed for both translated settings\. Evidence recall follows the same pattern, increasing in all 3 settings\. These gains come with a clear efficiency cost: high effort requires over 26 search calls per query, compared with roughly 2 calls under low effort\. Calibration also improves at high effort, suggesting that more extensive search and deliberation make the agent less overconfident\.

The comparison with the original corpus is more revealing\. High\-effort cross\-lingual and multilingual runs reach only about the accuracy of the low\-effort original run, despite using more than 14 times as many search calls; they remain far below the medium\-effort original run\. Thus, additional reasoning effort improves the agent in every corpus condition, but it does not turn cross\-lingual evidence into a monolingual problem\. In conclusion, the dominant difficulty is the language mismatch between the English information need and translated evidence, rather than the specific corpus assignment regime\.

## 5Discussion

Our experiments identify cross\-linguality as a structural source of difficulty for deep research agents, not merely as a perturbation to first\-stage retrieval\. By varying only the evidence language,XBCPisolates how language mismatch propagates through the evidence\-seeking pipeline\. This design brings together two evaluation traditions that have largely remained separate\. Multilingual and cross\-lingual retrieval benchmarks\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15345#bib.bib21); Enevoldsenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib20); Zuoet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib6)\)isolate whether a system can rank relevant documents across languages in a fixed collection, while deep research benchmarks\(Weiet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib3); Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\)evaluate iterative evidence seeking and grounded answer synthesis but typically assume that questions and evidence are linguistically aligned\.XBCPconnects these views by asking whether cross\-lingual retrieval remains effective once it becomes part of an agentic search process\.

This perspective first reveals a retrieval/search bottleneck\. Our results show that dense multilingual retrievers outperform BM25 after translation\. Yet conventional retrieval success does not guarantee that an agent will find the right evidence during iterative search\. This gap is consistent with prior work showing that multilingual and cross\-lingual retrieval can exhibit different behavior across language regimes and retrieval configurations\(Ramponiet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib18); Zuoet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib6); Zenget al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib30)\)\. InXBCP, the same issue appears inside the agent loop: translated corpora reduce evidence recall, increase search effort, and lower citation reliability even when the retriever is dense and multilingual\. The implication is that cross\-lingual retrievers should not be evaluated only by whether they rank relevant documents highly in isolation, but also by whether they expose the evidence at the right point in an agent’s search trajectory\.

XBCPalso separates this retrieval/search bottleneck from an evidence\-integration bottleneck\. Recent work on multilingual and cross\-lingual RAG has shown that language\-mismatched evidence can complicate retrieval, consistency, and reasoning over multilingual contexts\(Liuet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib27); Ranaldiet al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib28); Qiet al\.,[2026](https://arxiv.org/html/2606.15345#bib.bib29)\)\. However, these studies remain focused on relatively short\-chain single\-hop or multi\-hop QA settings, leaving the long\-horizon deep research setting underexamined\. Our oracle results instantiate this distinction: providing all gold evidence substantially raises accuracy, confirming that finding evidence is a major bottleneck, but translated oracle accuracy remains below original one\. Thus, cross\-lingual deep research is decomposable into two linked questions: whether system can find language\-mismatched evidence, and whether it can use evidence faithfully once it is found\. The latter requires the agent to identify relevant facts in non\-English sources, align them with an English question and answer space, and preserve the evidential constraint during synthesis\.

The per\-language results further suggest that low\-resource effects enter the system primarily before evidence reaches the model\. Multilingual retrieval evaluation has long emphasized that language resource level, typology, and annotation coverage shape retrieval behavior\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15345#bib.bib21)\); multilingual LLM research similarly identifies language imbalance and multilingual alignment as central challenges\(Xuet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib16)\)\. InXBCP, low\-resource languages show substantially lower tool\-based accuracy and evidence recall than high\-resource languages, but their oracle accuracy is comparable\. It indicates that the largest low\-resource penalty appears during retrieval: once strong agents receive the gold documents, they can still extract and integrate the relevant information\. Resource\-level effects enter the system primarily before evidence reaches the model\.

Taken together, these findings point toward language\-aware agentic search rather than simply stronger multilingual retrieval\. Active retrieval work argues that systems should decide dynamically when and what to retrieve during generation\(Jianget al\.,[2023](https://arxiv.org/html/2606.15345#bib.bib17)\), while CLIR research has increasingly moved from translation\-based methods toward LLM\-based alignment, with cross\-lingual representation alignment remaining a central challenge\(Goworeket al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib15)\)\.XBCPextends this view to deep research: agents need to recognize the language of available evidence, formulate queries across languages and entity variants, decide when translation or language\-specific search is needed, and preserve source attribution across the final answer\. Cross\-lingual deep research therefore requires coordination between the retriever, query planner, reader, and citation mechanism, so that language\-mismatched evidence can be found, interpreted, and cited as part of a single grounded reasoning process\.

## Limitations

Our main experiments report a single evaluation run per agent–retriever–corpus configuration\. Running agents over full search trajectories with multiple retrievers across three corpus conditions is computationally expensive, and we did not repeat each configuration over multiple random seeds\. While the gaps between corpus conditions and between retrievers are large and consistent across agents, formal variance estimates and significance tests over multiple runs are left to future work\.

We use a single set of inference hyperparameters per agent, following each model’s recommended generation configuration, without tuning sampling temperature or top\-p\. This keeps comparisons across conditions controlled, but condition\-specific tuning, particularly for low\-resource languages, may partially reduce the observed gaps\. A systematic study of inference configuration is beyond the scope of this work\.

## References

- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)M3\-embedding: multi\-linguality, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2318–2335\.External Links:[Link](https://aclanthology.org/2024.findings-acl.137/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.137)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Chen, X\. Ma, S\. Zhuang, J\. Lin, A\. Asai, and V\. Zhong \(2026\)AgentIR: reasoning\-aware retrieval for deep research agents\.External Links:2603\.04384,[Link](https://arxiv.org/abs/2603.04384)Cited by:[§4\.3\.2](https://arxiv.org/html/2606.15345#S4.SS3.SSS2.p1.1)\.
- Z\. Chen, X\. Ma, S\. Zhuang, P\. Nie, K\. Zou, A\. Liu, J\. Green, K\. Patel, R\. Meng, M\. Su, S\. Sharifymoghaddam, Y\. Li, H\. Hong, X\. Shi, X\. Liu, N\. Thakur, C\. Zhang, L\. Gao, W\. Chen, and J\. Lin \(2025\)BrowseComp\-plus: a more fair and transparent evaluation benchmark of deep\-research agent\.External Links:2508\.06600,[Link](https://arxiv.org/abs/2508.06600)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px1.p1.1),[Appendix L](https://arxiv.org/html/2606.15345#A12.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.15345#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.15345#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1),[§4\.2\.1](https://arxiv.org/html/2606.15345#S4.SS2.SSS1.p1.1),[§4\.3\.3](https://arxiv.org/html/2606.15345#S4.SS3.SSS3.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- K\. Enevoldsen, I\. Chung, I\. Kerboua, M\. Kardos, A\. Mathur, D\. Stap, J\. Gala, W\. Siblini, D\. Krzemiński, G\. I\. Winata, S\. Sturua, S\. Utpala, M\. Ciancone, M\. Schaeffer, G\. Sequeira, D\. Misra, S\. Dhakal, J\. Rystrøm, R\. Solomatin, Ö\. Çağatan, A\. Kundu, M\. Bernstorff, S\. Xiao, A\. Sukhlecha, B\. Pahwa, R\. Poświata, K\. K\. GV, S\. Ashraf, D\. Auras, B\. Plüster, J\. P\. Harries, L\. Magne, I\. Mohr, M\. Hendriksen, D\. Zhu, H\. Gisserot\-Boukhlef, T\. Aarsen, J\. Kostkan, K\. Wojtasik, T\. Lee, M\. Šuppa, C\. Zhang, R\. Rocca, M\. Hamdy, A\. Michail, J\. Yang, M\. Faysse, A\. Vatolin, N\. Thakur, M\. Dey, D\. Vasani, P\. Chitale, S\. Tedeschi, N\. Tai, A\. Snegirev, M\. Günther, M\. Xia, W\. Shi, X\. H\. Lù, J\. Clive, G\. Krishnakumar, A\. Maksimova, S\. Wehrli, M\. Tikhonova, H\. Panchal, A\. Abramov, M\. Ostendorff, Z\. Liu, S\. Clematide, L\. J\. Miranda, A\. Fenogenova, G\. Song, R\. B\. Safi, W\. Li, A\. Borghini, F\. Cassano, H\. Su, J\. Lin, H\. Yen, L\. Hansen, S\. Hooker, C\. Xiao, V\. Adlakha, O\. Weller, S\. Reddy, and N\. Muennighoff \(2025\)MMTEB: massive multilingual text embedding benchmark\.arXiv preprint arXiv:2502\.13595\.External Links:[Link](https://arxiv.org/abs/2502.13595),[Document](https://dx.doi.org/10.48550/arXiv.2502.13595)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p1.1)\.
- R\. Goworek, O\. Macmillan\-Scott, and E\. B\. Özyiğit \(2025\)Bridging language gaps: advances in cross\-lingual information retrieval with multilingual llms\.External Links:2510\.00908,[Link](https://arxiv.org/abs/2510.00908)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p5.1)\.
- L\. Hu, J\. Jiao, J\. Liu, Y\. Ren, Z\. Wen, K\. Zhang, X\. Zhang, X\. Gao, T\. He, F\. Hu, Y\. Liao, Z\. Wang, C\. Yang, Q\. Yang, M\. Yin, Z\. Zeng, G\. Zhang, X\. Zhang, X\. Zhao, Z\. Zhu, H\. Namkoong, W\. Huang, and Y\. Tang \(2025\)FinSearchComp: towards a realistic, expert\-level evaluation of financial search and reasoning\.External Links:2509\.13160,[Link](https://arxiv.org/abs/2509.13160)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Jiang, F\. Xu, L\. Gao, Z\. Sun, Q\. Liu, J\. Dwivedi\-Yu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)Active retrieval augmented generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 7969–7992\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.495/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.495)Cited by:[§5](https://arxiv.org/html/2606.15345#S5.p5.1)\.
- W\. Liu, S\. Trenous, L\. F\. R\. Ribeiro, B\. Byrne, and F\. Hieber \(2025\)XRAG: cross\-lingual retrieval\-augmented generation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 15669–15690\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.849/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.849),ISBN 979\-8\-89176\-335\-7Cited by:[§5](https://arxiv.org/html/2606.15345#S5.p3.1)\.
- MiroMind Team, S\. Bai, L\. Bing, C\. Chen, G\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, J\. Dai, X\. Dong, W\. Dou, Y\. Deng, Y\. Fu, J\. Ge, C\. Han, T\. Huang, Z\. Huang, J\. Jiao, S\. Jiang, T\. Jiao, X\. Jian, L\. Lei, R\. Li, G\. Luo, T\. Li, X\. Lin, Z\. Liu, Z\. Li, J\. Ni, Q\. Ren, P\. Sun, S\. Su, C\. Tao, B\. Wang, W\. Wang, H\. Wang, J\. Wang, J\. Wang, J\. Wang, L\. Wang, S\. Wang, W\. Wang, Z\. Wang, J\. Xu, S\. Xing, C\. Yang, H\. Ye, J\. Yu, Y\. Yu, M\. Zhong, T\. Zhao, X\. Zhu, Y\. Zhou, Y\. Zhang, and Z\. Zhu \(2026\)MiroThinker: pushing the performance boundaries of open\-source research agents via model, context, and interactive scaling\.External Links:2511\.11793,[Link](https://arxiv.org/abs/2511.11793)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI, :, S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao, B\. Barak, A\. Bennett, T\. Bertao, N\. Brett, E\. Brevdo, G\. Brockman, S\. Bubeck, C\. Chang, K\. Chen, M\. Chen, E\. Cheung, A\. Clark, D\. Cook, M\. Dukhan, C\. Dvorak, K\. Fives, V\. Fomenko, T\. Garipov, K\. Georgiev, M\. Glaese, T\. Gogineni, A\. Goucher, L\. Gross, K\. G\. Guzman, J\. Hallman, J\. Hehir, J\. Heidecke, A\. Helyar, H\. Hu, R\. Huet, J\. Huh, S\. Jain, Z\. Johnson, C\. Koch, I\. Kofman, D\. Kundel, J\. Kwon, V\. Kyrylov, E\. Y\. Le, G\. Leclerc, J\. P\. Lennon, S\. Lessans, M\. Lezcano\-Casado, Y\. Li, Z\. Li, J\. Lin, J\. Liss, Lily, Liu, J\. Liu, K\. Lu, C\. Lu, Z\. Martinovic, L\. McCallum, J\. McGrath, S\. McKinney, A\. McLaughlin, S\. Mei, S\. Mostovoy, T\. Mu, G\. Myles, A\. Neitz, A\. Nichol, J\. Pachocki, A\. Paino, D\. Palmie, A\. Pantuliano, G\. Parascandolo, J\. Park, L\. Pathak, C\. Paz, L\. Peran, D\. Pimenov, M\. Pokrass, E\. Proehl, H\. Qiu, G\. Raila, F\. Raso, H\. Ren, K\. Richardson, D\. Robinson, B\. Rotsted, H\. Salman, S\. Sanjeev, M\. Schwarzer, D\. Sculley, H\. Sikchi, K\. Simon, K\. Singhal, Y\. Song, D\. Stuckey, Z\. Sun, P\. Tillet, S\. Toizer, F\. Tsimpourlas, N\. Vyas, E\. Wallace, X\. Wang, M\. Wang, O\. Watkins, K\. Weil, A\. Wendling, K\. Whinnery, C\. Whitney, H\. Wong, L\. Yang, Y\. Yang, M\. Yasunaga, K\. Ying, W\. Zaremba, W\. Zhan, C\. Zhang, B\. Zhang, E\. Zhang, and S\. Zhao \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- OpenAI \(2025a\)Deep Research System Card\.External Links:[Link](https://cdn.openai.com/deep-research-system-card.pdf)Cited by:[§1](https://arxiv.org/html/2606.15345#S1.p1.1),[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2025b\)Introducing gpt\-4\.1 in the api\.Note:[https://openai\.com/index/gpt\-4\-1/](https://openai.com/index/gpt-4-1/)Cited by:[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p5.1)\.
- OpenAI \(2026\)GPT\-5\.4 Thinking System Card\.External Links:[Link](https://deploymentsafety.openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[Appendix L](https://arxiv.org/html/2606.15345#A12.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.15345#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p5.1)\.
- T\. Pham, N\. Nguyen, P\. Zunjare, W\. Chen, Y\. Tseng, and T\. Vu \(2026\)SealQA: raising the bar for reasoning in search\-augmented language models\.External Links:2506\.01062,[Link](https://arxiv.org/abs/2506.01062)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Qi, F\. Mo, S\. Lu, Y\. Chen, J\. Nie, and K\. Huang \(2026\)CroSearch\-r1: better leveraging cross\-lingual knowledge for retrieval\-augmented generation\.External Links:2604\.25182,[Link](https://arxiv.org/abs/2604.25182)Cited by:[§5](https://arxiv.org/html/2606.15345#S5.p3.1)\.
- Qwen Team \(2026\)Qwen3\.6\-35B\-A3B: agentic coding power, now open to all\.External Links:[Link](https://qwen.ai/blog?id=qwen3.6-35b-a3b)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- A\. Ramponi, M\. Rovera, R\. Moro, and S\. Tonelli \(2025\)Multilingual vs crosslingual retrieval of fact\-checked claims: a tale of two approaches\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 29057–29076\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1480/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1480),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p2.1)\.
- L\. Ranaldi, B\. Haddow, and A\. Birch \(2026\)Multilingual retrieval\-augmented generation for knowledge\-intensive question answering task\.InFindings of the Association for Computational Linguistics: EACL 2026,V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 697–716\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.35/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.35),ISBN 979\-8\-89176\-386\-9Cited by:[§5](https://arxiv.org/html/2606.15345#S5.p3.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Found\. Trends Inf\. Retr\.3\(4\),pp\. 333–389\.External Links:ISSN 1554\-0669,[Link](https://doi.org/10.1561/1500000019),[Document](https://dx.doi.org/10.1561/1500000019)Cited by:[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- T\. D\. Team, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou, K\. Li, L\. Su, L\. Ou, L\. Zhang, P\. Xie, R\. Ye, W\. Yin, X\. Yu, X\. Wang, X\. Wu, X\. Chen, Y\. Zhao, Z\. Zhang, Z\. Tao, Z\. Zhang, Z\. Qiao, C\. Wang, D\. Yu, G\. Fu, H\. Shen, J\. Yang, J\. Lin, J\. Zhang, K\. Zeng, L\. Yang, H\. Yin, M\. Song, M\. Yan, M\. Liao, P\. Xia, Q\. Xiao, R\. Min, R\. Ding, R\. Fang, S\. Chen, S\. Huang, S\. Wang, S\. Cai, W\. Shen, X\. Wang, X\. Guan, X\. Geng, Y\. Shi, Y\. Wu, Z\. Chen, Z\. Li, and Y\. Jiang \(2026\)Tongyi deepresearch technical report\.External Links:2510\.24701,[Link](https://arxiv.org/abs/2510.24701)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- Tongyi DeepResearch Team \(2025\)Tongyi deepresearch: a new era of open\-source ai researchers\.Note:[https://github\.com/Alibaba\-NLP/DeepResearch](https://github.com/Alibaba-NLP/DeepResearch)Cited by:[Appendix I](https://arxiv.org/html/2606.15345#A9.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei \(2024\)Multilingual e5 text embeddings: a technical report\.External Links:2402\.05672,[Link](https://arxiv.org/abs/2402.05672)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese \(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.External Links:2504\.12516,[Link](https://arxiv.org/abs/2504.12516)Cited by:[§1](https://arxiv.org/html/2606.15345#S1.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p1.1)\.
- Y\. Xu, L\. Hu, J\. Zhao, Z\. Qiu, K\. Xu, Y\. Ye, and H\. Gu \(2025\)A survey on multilingual large language models: corpora, alignment, and bias\.Frontiers of Computer Science19\(11\)\.External Links:ISSN 2095\-2236,[Link](http://dx.doi.org/10.1007/s11704-024-40579-4),[Document](https://dx.doi.org/10.1007/s11704-024-40579-4)Cited by:[§5](https://arxiv.org/html/2606.15345#S5.p4.1)\.
- W\. Xuan, R\. Yang, H\. Qi, Q\. Zeng, Y\. Xiao, A\. Feng, D\. Liu, Y\. Xing, J\. Wang, F\. Gao, J\. Lu, Y\. Jiang, H\. Li, X\. Li, K\. Yu, R\. Dong, S\. Gu, Y\. Li, X\. Xie, F\. Juefei\-Xu, F\. Khomh, O\. Yoshie, Q\. Chen, D\. Teodoro, N\. Liu, R\. Goebel, L\. Ma, E\. Marrese\-Taylor, S\. Lu, Y\. Iwasawa, Y\. Matsuo, and I\. Li \(2025\)MMLU\-prox: a multilingual benchmark for advanced large language model evaluation\.External Links:2503\.10497,[Link](https://arxiv.org/abs/2503.10497)Cited by:[Appendix C](https://arxiv.org/html/2606.15345#A3.p1.1),[§3\.2](https://arxiv.org/html/2606.15345#S3.SS2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p5.1)\.
- P\. Yu, L\. Merrick, G\. Nuti, and D\. Campos \(2024\)Arctic\-embed 2\.0: multilingual retrieval without compromise\.External Links:2412\.04506,[Link](https://arxiv.org/abs/2412.04506)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.15345#S1.p2.1),[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- Q\. Zeng, Y\. Lu, Z\. Zhou, H\. Qi, P\. Yu, F\. Zhao, H\. Yanaka, W\. Xuan, and N\. Yokoya \(2026\)Code\-switching information retrieval: benchmarks, analysis, and the limits of current retrievers\.External Links:2604\.17632,[Link](https://arxiv.org/abs/2604.17632)Cited by:[§5](https://arxiv.org/html/2606.15345#S5.p2.1)\.
- X\. Zhang, Y\. Zhang, D\. Long, W\. Xie, Z\. Dai, J\. Tang, H\. Lin, B\. Yang, P\. Xie, F\. Huang, M\. Zhang, W\. Li, and M\. Zhang \(2024\)mGTE: generalized long\-context text representation and reranking models for multilingual text retrieval\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,F\. Dernoncourt, D\. Preoţiuc\-Pietro, and A\. Shimorina \(Eds\.\),Miami, Florida, US,pp\. 1393–1412\.External Links:[Link](https://aclanthology.org/2024.emnlp-industry.103/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-industry.103)Cited by:[§1](https://arxiv.org/html/2606.15345#S1.p2.1),[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Zhang, N\. Thakur, O\. Ogundepo, E\. Kamalloo, D\. Alfonso\-Hermelo, X\. Li, Q\. Liu, M\. Rezagholizadeh, and J\. Lin \(2023\)MIRACL: a multilingual retrieval dataset covering 18 diverse languages\.Transactions of the Association for Computational Linguistics11,pp\. 1114–1131\.External Links:[Link](https://aclanthology.org/2023.tacl-1.63/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00595)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p4.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.External Links:2506\.05176,[Link](https://arxiv.org/abs/2506.05176)Cited by:[Appendix K](https://arxiv.org/html/2606.15345#A11.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.15345#S1.p2.1),[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.15345#S4.SS1.p1.1)\.
- P\. Zhou, B\. Leon, X\. Ying, C\. Zhang, Y\. Shao, Q\. Ye, D\. Chong, Z\. Jin, C\. Xie, M\. Cao, Y\. Gu, S\. Hong, J\. Ren, J\. Chen, C\. Liu, and Y\. Hua \(2025\)BrowseComp\-zh: benchmarking web browsing ability of large language models in chinese\.External Links:2504\.19314,[Link](https://arxiv.org/abs/2504.19314)Cited by:[§1](https://arxiv.org/html/2606.15345#S1.p2.1),[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Zhu, Q\. Jia, T\. Lan, J\. Ren, F\. Gu, F\. Jiang, L\. Wang, Z\. Xu, and W\. Luo \(2026\)Marco deepresearch: unlocking efficient deep research agents via verification\-centric design\.External Links:2603\.28376,[Link](https://arxiv.org/abs/2603.28376)Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zuo, P\. Hong, O\. Kraus, B\. Plank, and R\. Litschko \(2025\)Evaluating large language models for cross\-lingual retrieval\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 11415–11429\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.612/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.612),ISBN 979\-8\-89176\-335\-7Cited by:[§2](https://arxiv.org/html/2606.15345#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p1.1),[§5](https://arxiv.org/html/2606.15345#S5.p2.1)\.

## Appendix AXBCPConstruction Details

Table 8:Language assignment statistics for the cross\-lingual setting\.
Table 9:Per\-language corpus coverage in the multilingual setting\. The 420 English source documents are retained unchanged, and the remaining 4,620 document instances are produced by translation\.

## Appendix BTranslation Prompt

Prompt used for document translationInstruction\.Translate the following document completely into\{target\_language\}\.Translate everything including proper nouns, titles, terminology, and metadata field names according to\{target\_language\}conventions\. For example,name:should become the equivalent in\{target\_language\},birth\_date:should become the equivalent in\{target\_language\}, etc\.Rules\.1\.Ensure cultural appropriateness for\{target\_language\}speakers\.2\.If works such as books, movies, TV shows, songs, or other literary/entertainment titles have well\-known translations in\{target\_language\}, use those established translations\.3\.Preserve all URLs, email addresses, math formulas, and code blocks unchanged\.4\.Output only the translated document\. Do not add explanations\.

## Appendix CTranslation Verification Rubrics

This translation verification rubrics follows the rubrics conducted by MMLU\-ProX\(Xuanet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib33)\)\.

Prompt used for expert translation verificationInstruction\.You are an expert bilingual evaluator\. Compare the source document with its machine\-translated version in the target language\. Rate the translation on accuracy, fluency, and completeness using the criteria below\. Provide a score from 1 to 5 for each dimension and a brief justification for any score below 5\.Evaluation Criteria for Expert Rating of Machine Translation Results 1\. Accuracy \(1\-5\):•5 \(Highly Accurate\):–All key terms and concepts are translated correctly with no errors\.–Every technical term corresponds precisely to the original text, with no mistranslations or incorrect word choices\.–The most appropriate and professional terminology in the target language is used\.–Expressions align with commonly used terminology in professional or technical contexts\.•4 \(Accurate\):–Most terms and concepts are translated correctly, with only a few minor errors that do not affect overall comprehension\.–Some terms may be slightly imprecise, but the translation remains generally accurate\.–Uses appropriate terminology in the target language in most cases\.–A few terms may be simplified but remain understandable within the intended domain\.•3 \(Moderately Accurate\):–Key terms and concepts are mostly correct but contain some errors that may cause partial misunderstandings\.–Some critical terms are inaccurately translated, requiring the reader to infer the intended meaning\.–Slight deviations in the use of target\-language terminology\.–Occasionally uses uncommon or outdated terms\.•2 \(Somewhat Inaccurate\):–Many key terms and concepts are mistranslated, significantly affecting comprehension\.–Important concepts are incorrectly translated, leading to potential misunderstandings of the original text\.–Uses incorrect or inappropriate terminology in the target language\.–Terminology is inconsistent, reducing the text’s professionalism\.•1 \(Inaccurate\):–Frequent and severe mistranslations of key terms and concepts, failing to convey the original meaning\.–Most of the content does not match the original text\.–Lacks proper use of target\-language terminology\.–Terminology is chaotic, possibly using irrelevant or incorrect vocabulary entirely\.2\. Fluency \(1–5\):•5 \(Highly Fluent\):–The target\-language expression is natural and smooth, making it effortless to read\.–The language style is refined and appropriate for professional or formal contexts\.–The sentence structure fully adheres to natural conventions in the target language, with no grammatical or lexical errors\.•4 \(Fluent\):–The target\-language expression is generally natural, with only minor linguistic imperfections that do not affect comprehension\.–Some sentences may sound slightly stiff\.–Sentence structures mostly conform to target\-language norms, with very few grammatical errors\.•3 \(Moderately Fluent\):–The target\-language expression is somewhat unnatural, requiring the reader to adjust their understanding slightly\.–Some inappropriate word choices or rigid sentence structures are present\.–Sentence structures are mostly correct, but some grammatical errors exist\.•2 \(Somewhat Unnatural\):–The target\-language expression lacks fluency, making it difficult to read smoothly\.–Sentence transitions are awkward, and logical connections are unclear\.–Many structural issues exist, with frequent grammatical errors\.•1 \(Not Fluent\):–The target\-language expression is highly unnatural or difficult to understand\.–Literal translation is evident, lacking natural phrasing in the target language\.–The sentence structure is disorganized, with severe grammatical mistakes, making the text unreadable\.3\. Completeness \(1–5\):•5 \(Fully Complete\):–The full meaning of the original text is retained with no omissions or additions\.–All details, data, and annotations are accurately conveyed\.–The translation maintains the same length and depth as the original text\.•4 \(Complete\):–The primary meaning of the original text is retained, with only a few minor details omitted or slightly unclear\.–Some less critical information may be left out\.–The translation generally corresponds to the original content\.•3 \(Moderately Complete\):–Most of the original meaning is conveyed, but some information is missing or added\.–Important details may be overlooked\.–The translation differs from the original in certain aspects, requiring readers to infer some content\.•2 \(Somewhat Incomplete\):–The core information from the original text is not fully conveyed, with noticeable omissions or unnecessary additions\.–Potential inclusion of unrelated information\.–The translation does not fully correspond to the original, affecting comprehension\.•1 \(Incomplete\):–Significant omissions or added incorrect information prevent an accurate reflection of the original text\.–Important sections or sentences are missing\.–The translation deviates heavily from the original, making it difficult to understand the intended meaning\.

## Appendix DTranslation Verification Results

Table[10](https://arxiv.org/html/2606.15345#A4.T10)in this appendix report per\-language translation results in our corpora\. For each language in translation, we adopt three dimensions in evaluation: accuracy, fluency and completeness\. Each language evaluation has 200 samples and the results are reported in average value\.

Table 10:Per\-language translation verification results\. All values are on 1–5 scale\.
## Appendix EJudge Prompt

Prompt used for LLM\-as\-JudgeJudge whether the following\[response\]to\[question\]is correct or not based on the precise and unambiguous\[correct\_answer\]below\.\[question\]:\{question\}\[response\]:\{response\}\[correct\_answer\]:\{correct\_answer\}The evidence documents used to answer this question are in another language\. As a result, the extracted answer may be written in another language rather than English\. The\[correct\_answer\]is in English\. You must judge whether the extracted answer and the correct answer refer to the same entity, concept, or value, regardless of language differences\. For example, “ハーバード大学” and Harvard University, “迈克尔·乔丹“ and “Michael Jordan”, or “서울” and ”Seoul” should be considered equivalent\.Your judgement must be in the format and criteria specified below:1\.extracted\_final\_answer: The final exact answer extracted from the\[response\]\. Put the extracted answer as “None” if there is no exact, final answer to extract from the response\.2\.reasoning: Explain why theextracted\_final\_answeris correct or incorrect based on\[correct\_answer\], focusing only on whether they refer to the same entity or value\. If they are in different languages, determine whether they are translations or transliterations of each other\. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than\[correct\_answer\], focus only on whether the answers match\.3\.correct: Answer “yes” ifextracted\_final\_answermatches the\[correct\_answer\]given above, or is a translation/transliteration of it, or is within a small margin of error for numerical problems\. Answer “no” otherwise, i\.e\. if there is any inconsistency, ambiguity, non\-equivalency, or if the extracted answer is incorrect\.4\.confidence: The extracted confidence score between 0% and 100% from\[response\]\. Put 100 if there is no confidence score available\.

## Appendix FDecomposing the Agent Cross\-lingual Bottleneck

Our oracle experiments show that providing gold evidence directly to the agent does not fully recover monolingual performance, revealing an*agent\-side*cross\-lingual bottleneck\. A natural follow\-up question is whether this bottleneck arises because the agent must reason over non\-English evidence, or because it must also switch between an English prompt and non\-English content\. To disentangle these factors, we introduce a fully target\-language oracle variant \(Oracle\-tq\+tp\), in which the system prompt, the query, and the evidence documents are all presented in the target language\. This removes any language switching and tests whether a monolingual non\-English environment helps the agent reason more effectively\.

Table 11:Oracle accuracy \(%\) on the cross\-lingual corpus under three prompt–evidence language configurations\. EN Oracle: English prompt \+ English evidence \(upper bound\)\. Oracle: English prompt \+ target\-language evidence\. Oracle\-tq\+tp: target\-language prompt \+ target\-language evidence\.Table[11](https://arxiv.org/html/2606.15345#A6.T11)shows that, contrary to our expectation, Oracle\-tq\+tp performs*worse*than the standard Oracle with English prompt: GPT\-OSS\-20B drops by 5\.92 pp and GPT\-OSS\-120B by 5\.38 pp\. The agent reasons less effectively when the prompt is also in the target language, even though language switching is eliminated\. This reveals that the agent’s cross\-lingual weakness has two distinct components:

1. 1\.Evidence understanding bottleneck\(EN Oracle→\\toOracle\): the agent loses 12\.77 pp \(20B\) / 9\.42 pp \(120B\) from reading non\-English evidence, even under English instructions\.
2. 2\.Prompt language penalty\(Oracle→\\toOracle\-tq\+tp\): switching the prompt to the target language costs an additional 5\.92 pp \(20B\) / 5\.38 pp \(120B\), indicating that these models follow instructions more reliably in English\.

These results have two implications\. First, the agent bottleneck is*intrinsic*to the model’s multilingual reasoning capability, not a surface\-level language\-switching artifact\. Providing a fully monolingual target\-language environment does not help; it makes things worse\. Second, English serves as the agent’s “native language” for instruction following: even when all content is non\-English, the agent benefits from receiving its task description in English\. This suggests that improving cross\-lingual agent performance requires stronger multilingual pretraining, not prompt translation\.

## Appendix GCitation Precision Error Analysis

GPT\-OSS\-120Bexhibits the steepest citation precision drop among all agents: from 50\.89% on the original corpus to 24\.30% \(multilingual\) and 26\.26% \(cross\-lingual\), a reduction of roughly 50% \(Table[3](https://arxiv.org/html/2606.15345#S4.T3)\)\. To diagnose this degradation, we classify every query whereGPT\-OSS\-120Bmade citations but failed to cite any gold evidence document into two mutually exclusive error types: \(1\) the agent retrieved at least one gold document but cited other documents instead \(*mapping failure*\); \(2\) no gold document was retrieved and the agent cited English negative documents instead \(*no gold retrieved*\)\. We includeGPT\-OSS\-20BandQwen3\.6\-35B\-A3Bas reference points in Table[12](https://arxiv.org/html/2606.15345#A7.T12)\.

Table 12:Citation error classification withQwen3\-Embedding\-8B\. Prec\. is citation precision \(%\) among queries with citations\. Errors is the number of queries that cited zero gold documents\. Map\.Fail: gold was retrieved but agent cited other documents\. No Gold: no gold document was retrieved\. Percentages in parentheses sum to 100% within each row\.ForGPT\-OSS\-120B, the dominant error type is*no gold retrieved*, accounting for 57\.08% of errors on the original corpus and rising to 66\.13% on the multilingual corpus\. In these cases, the retriever never surfaced the gold document during the agent’s search trajectory, so the agent cited English negative documents that appeared topically related but did not contain the correct evidence\. Mapping failures account for the remaining 33\.87–42\.92% of errors and decline as a share after translation, not because the agent improves at citation mapping, but because fewer gold documents are retrieved in the first place\.

Compared withGPT\-OSS\-20BandQwen3\.6\-35B\-A3B,GPT\-OSS\-120Bhas substantially more total errors \(226–272 vs\. 108–172\)\. This is driven by its higher citation coverage \(60\.6% vs\. 50\.4% and 41\.5%\): the 120B model cites documents more frequently, creating more opportunities for incorrect citations\.

## Appendix HAdditional Per\-Language Results

All tables in this appendix report per\-language results in the cross\-lingual setting\. Q3\-4B and Q3\-8B denoteQwen3\-Embedding\-4BandQwen3\-Embedding\-8B, respectively\.

Table 13:Per\-language tool\-based accuracy forGPT\-OSS\-20B\. All values are percentages\.Table 14:Per\-language tool\-based accuracy forGPT\-OSS\-120B\. All values are percentages\.Table 15:Per\-language tool\-based accuracy forQwen3\.6\-35B\-A3B\. All values are percentages\.Table 16:Per\-language evidence recall forGPT\-OSS\-20B\. All values are percentages\.Table 17:Per\-language evidence recall forGPT\-OSS\-120B\. All values are percentages\.Table 18:Per\-language evidence recall forQwen3\.6\-35B\-A3B\. All values are percentages\.Table 19:Per\-language tool\-based performance forDeepSeek\-V4\-Pro\. Acc\. and Rec\. denote accuracy and evidence recall; all values are percentages\.Table 20:Per\-language oracle accuracy in the cross\-lingual setting\. All values are percentages\.
## Appendix ITongyi\-DeepResearch Results

We additionally evaluateTongyi\-DeepResearch\-30B\-A3B\(Tongyi DeepResearch Team,[2025](https://arxiv.org/html/2606.15345#bib.bib10)\), a deep research agent built on a Qwen3\-based MoE architecture\. Unlike the other agents in our study, Tongyi uses an in\-band ReAct\-style tool calling protocol with<tool\_call\>XML tags rather than the OpenAI function\-calling API\. Table[21](https://arxiv.org/html/2606.15345#A9.T21)reports its performance with BM25 andQwen3\-Embedding\-8Bacross all three corpus conditions\. Tongyi’s ReAct\-style output format does not reliably produce per\-query confidence scores despite prompt\-level instructions, making the metric calibration error unreliable\. Therefore, we exclude it from our main results but put it in the appendix for reference\.

Table 21:Tongyi\-DeepResearch\-30B\-A3Bresults\. Acc\. and Ev\.Rec\. are percentages; Search is average search calls per query\. Q3\-8B denotesQwen3\-Embedding\-8B\. Calibration error is omitted because Tongyi’s ReAct\-style output format does not reliably produce per\-query confidence scores despite prompt\-level instructions, making the metric unreliable\.Tongyi achieves 39\.64% accuracy on the original corpus withQwen3\-Embedding\-8B, the highest among all agents at comparable parameter counts\. Its evidence recall \(58\.12%\) also exceedsGPT\-OSS\-20B\(42\.91%\) andQwen3\.6\-35B\-A3B\(43\.14%\)\. After translation, accuracy drops by 13\.50–14\.58 pp withQwen3\-Embedding\-8B, a smaller relative degradation thanGPT\-OSS\-20B\(20\.84–20\.96 pp\)\.

## Appendix JInference Hyperparameters

For each agent we follow the generation configuration recommended by the model release, applied uniformly across all corpus conditions and evidence languages\.GPT\-OSS\-20BandGPT\-OSS\-120Bare served locally with vLLM in temperature1\.01\.0, top\-pp1\.01\.0\.Qwen3\.6\-35B\-A3Bis served locally with vLLM in temperature0\.70\.7, top\-pp0\.80\.8\.DeepSeek\-V4\-Prois accessed through its official API in default settings\(temperature1\.01\.0, top\-pp1\.01\.0\)\. All other generation parameters are left at each model’s default value\.

## Appendix KLicense Statement

##### BrowseComp\-Plus\.

Our benchmark,XBCP, is derived from BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\), which is released under the MIT License\. We use BrowseComp\-Plus in accordance with the MIT License terms, retaining the original copyright notice and license text in all derived artifacts\.

##### Models\.

We use the following models under their respective licenses:GPT\-OSS\-20B\(OpenAIet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib24)\),GPT\-OSS\-120B\(OpenAIet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib24)\),Qwen3\-Embedding\-4B,Qwen3\-Embedding\-8B\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib7)\), andArctic\-Embed\-L\-2\.0\(Yuet al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib4)\)are released under the Apache License 2\.0;Multilingual\-E5\-Large\(Wanget al\.,[2024](https://arxiv.org/html/2606.15345#bib.bib19)\)is released under the MIT License\.Qwen3\.6\-35B\-A3B\(Qwen Team,[2026](https://arxiv.org/html/2606.15345#bib.bib25)\)is released under the Apache License 2\.0 and is used locally via vLLM\.DeepSeek\-V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.15345#bib.bib26)\), whose model weights are released under the MIT License, is accessed in our experiments through its official API under the DeepSeek Open Platform Terms of Service\.GPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2606.15345#bib.bib32)\), a proprietary model accessible only through OpenAI’s API, is used solely to generate translations for theXBCPevidence corpora; its outputs are used in accordance with OpenAI’s Terms of Use, which grant users ownership of model outputs subject to OpenAI’s usage policies\. All use of these models is for non\-commercial academic research\.

##### Release\.

We will releaseXBCPunder the MIT License, consistent with the license of the underlying BrowseComp\-Plus benchmark\. The release will include the translated evidence corpora, query–language assignments, and evaluation scripts, with attribution to BrowseComp\-Plus and to each model whose outputs contributed to the construction of the benchmark\.

## Appendix LGenAI Statement

We disclose the use of generative AI tools in this work in accordance with the ACL Policy on the Use of AI Writing Assistance\.

##### AI use in research artifacts\.

Generative AI played a central role in constructing theXBCPbenchmark\. Specifically, we usedGPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2606.15345#bib.bib32)\)as the translation engine to render the English evidence documents of BrowseComp\-Plus\(Chenet al\.,[2025](https://arxiv.org/html/2606.15345#bib.bib1)\)into the eleven non\-English target languages used in our cross\-lingual and multilingual corpora\. The exact prompt is provided in Appendix[B](https://arxiv.org/html/2606.15345#A2)\. Translation quality was assessed through expert human verification on samples of all eleven non\-English using the rubric in Appendix[C](https://arxiv.org/html/2606.15345#A3); we discuss the implications and limitations of automatic translation in the Limitations section\.

##### AI use in experiments\.

The agents and retrievers evaluated in this work are themselves LLM\-based or neural systems \(GPT\-OSS\-20B,GPT\-OSS\-120B,Qwen3\.6\-35B\-A3B,DeepSeek\-V4\-Pro, and four multilingual embedding models\)\. Their use is the subject of study rather than an auxiliary tool, and is fully described in Section 4\.

##### AI use in writing\.

We used AI assistants \(Claude and ChatGPT\) for surface\-level writing support, including grammar correction, sentence\-level rephrasing for clarity and concision, and LaTeX formatting suggestions\. All scientific claims, experimental design choices, analyses, and conclusions are authored and verified by the human authors\. AI assistants were not used to generate citations, statistical results, or any factual content reported in this paper\.

##### Responsibility\.

The authors take full responsibility for the content of this paper, including any text that may have been initially drafted or edited with AI assistance\.

## Appendix MEthics

XBCP is a translation\-based benchmark for evaluating deep research agents\. Translations are produced by GPT\-5\.4, and despite expert verification on a sample, residual translation artifacts may propagate into low\-resource\-language evaluation, potentially under\- or over\-estimating system performance for those languages\. XBCP is derived from BrowseComp\-Plus, which is built from publicly available web documents\. We do not collect new personal data from individuals\. The benchmark therefore inherits the question scope of BrowseComp\-Plus and is intended for research evaluation, not for deployment\-grade safety claims\.

Expert bilingual annotators were recruited through commercial language\-service companies\. They were compensated according to standard professional translation\-evaluation rates\.
Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Similar Articles

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Cross-Lingual Exploration for Parametric Knowledge

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

Submit Feedback

Similar Articles

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
Cross-Lingual Exploration for Parametric Knowledge
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval