SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

arXiv cs.CL Papers

Summary

Introduces SEATauBench, the first agent-focused evaluation framework for Southeast Asian languages, adapting τ²-Bench to Mandarin, Vietnamese, Thai, Indonesian, and Filipino, and reveals a significant capability gap when moving from English to localized settings.

arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce SEATauBench, the first agent-focused evaluation framework for SEA sovereign AI. SeaTau adapts TauBench to five languages -- Mandarin, Vietnamese, Thai, Indonesian, and Filipino -- and evaluates agents across progressively localized settings that vary the language of user-agent interaction, tool specifications, and task domains. Across three recent models, we find that English agent capabilities transfer reasonably well when only the conversation language changes, but quality and robustness degrade sharply as more task contexts are localized, with the largest losses in full domain adaptation. We also the limits of English-only agent assessment for measuring agent capabilities in SEA languages. More broadly, SeaTau provides a diagnostic benchmark and reusable adaptation pipeline for building reliable multilingual agents for linguistically diverse regions. Data and code can be accessed at github.com/SEACrowd/SEATauBench.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:27 AM

# Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages
Source: [https://arxiv.org/html/2606.28715](https://arxiv.org/html/2606.28715)
My Chiffon Nguyen1,Aulia Adila1,Saksorn Ruangtanusak1,2,Kittiphat Leesombatwathana1,3, Vissuta Gunawan Lim1,Patomporn Payoungkhamdee1,4,Samuel Cahyawijaya1,5 1SEACrowd,2SCB DataX, SCBX Group,3Chulalongkorn University,4VISTEC,5Cohere \{chiffonng136,auliaadila036,vglim3653\}@gmail\.com saksorn\.ruangtanusak@data\-x\.ai,6534404823@student\.chula\.ac\.th patomporn\.p\_s21@vistec\.ac\.th,samuelcahyawijaya@cohere\.com

###### Abstract

While AI development and evaluation for Southeast Asia \(SEA\) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI\. To fill this gap, we introduce SEATauBench111SEATauBench is pronounced ”si\-tau\-bench”, similar to the Filipino word for string beans, ”sitaw”\., the first agent\-focused evaluation framework for SEA sovereign AI\. SEATauBench adaptsτ2\\tau^\{2\}\-Bench to five languages—Mandarin, Vietnamese, Thai, Indonesian, and Filipino—and evaluates agents across progressively localized settings that vary the language of user\-agent interaction, tool specifications, and task domains\. Across three recent models, we find that English agent capabilities transfer reasonably well when only the conversation language changes, but quality and robustness degrade sharply as more task contexts are localized, with the largest losses in full domain adaptation\. We also the limits of English\-only agent assessment for measuring agent capabilities in SEA languages\. More broadly, SEATauBench provides a diagnostic benchmark and reusable adaptation pipeline for building reliable multilingual agents for linguistically diverse regions\. Data and code can be accessed at[github\.com/SEACrowd/SEATauBench](https://github.com/SEACrowd/SEATauBench)\.

![[Uncaptioned image]](https://arxiv.org/html/2606.28715v1/sitaw_logo.png)

SEATauBench: Adapting Tool\-Agent\-User Evaluation Into Low\-Resource Southeast Asian Languages

My Chiffon Nguyen1, Aulia Adila1, Saksorn Ruangtanusak1,2, Kittiphat Leesombatwathana1,3,Vissuta Gunawan Lim1,Patomporn Payoungkhamdee1,4,Samuel Cahyawijaya1,51SEACrowd,2SCB DataX, SCBX Group,3Chulalongkorn University,4VISTEC,5Cohere\{chiffonng136,auliaadila036,vglim3653\}@gmail\.comsaksorn\.ruangtanusak@data\-x\.ai,6534404823@student\.chula\.ac\.thpatomporn\.p\_s21@vistec\.ac\.th,samuelcahyawijaya@cohere\.com

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.28715v1/figs/fig-1-real.jpg)Figure 1:SEATauBench exposes a critical English\-SEA agentic capability gap in existing proprietary and open\-source LLMs across progressively localized evaluation scenarios\. This evidence exposes the unreliability of existing English\-centric benchmark to reflect the actual capabilities of LLMs for sovereign AI adoption\.Sovereign artificial intelligence \(AI\) has become critical for nations seeking to maintain autonomy in their digital futuresChaeet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib4)\), including Southeast Asia\. As articulated inMushkaniet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib58)\); Barasaet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib59)\), sovereign AI encompasses not only technological self\-reliance but also cultural and linguistic relevance, a vital dimension for the more than 700 million people whose linguistic diversity is poorly represented by English\-centric development and evaluationBhandari and Modi \([2026](https://arxiv.org/html/2606.28715#bib.bib61)\); Putra \([2024](https://arxiv.org/html/2606.28715#bib.bib63)\)\.

SEA\-specific language evaluation has expanded rapidly, with SEA\-Exam and SEA\-BenchLiuet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib49)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib64)\); Nguyenet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib65)\), SEA\-VLCahyawijayaet al\.\([2025a](https://arxiv.org/html/2606.28715#bib.bib52),[2026](https://arxiv.org/html/2606.28715#bib.bib53)\), SEACrowdLoveniaet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib47)\), SEA\-HELMSusantoet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib50)\), NusaCrowdCahyawijayaet al\.\([2023a](https://arxiv.org/html/2606.28715#bib.bib51)\), NusaWritesCahyawijayaet al\.\([2023b](https://arxiv.org/html/2606.28715#bib.bib44)\), and NusaXWinataet al\.\([2023](https://arxiv.org/html/2606.28715#bib.bib43)\)establishing valuable foundations for measuring regional language understanding, cultural knowledge, reasoning, safety, and multimodal capabilities\. Nevertheless, these benchmarks primarily assess static model behaviors, with limited work evaluating whether agents can complete multi\-turn, tool\-mediated tasks in regional languages, the capabilities required when more and more AI systems operate in real\-world deployments like service, commerce and travelBudzianowskiet al\.\([2018](https://arxiv.org/html/2606.28715#bib.bib26)\); Ericet al\.\([2020](https://arxiv.org/html/2606.28715#bib.bib27)\); Zanget al\.\([2020](https://arxiv.org/html/2606.28715#bib.bib28)\); Hanet al\.\([2021](https://arxiv.org/html/2606.28715#bib.bib29)\); Yeet al\.\([2022](https://arxiv.org/html/2606.28715#bib.bib30)\)\.

To address this gap, we introduce SEATauBench, the first agent\-focused evaluation framework for SEA\. SEATauBench adaptsτ2\\tau^\{2\}\-BenchYaoet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib8)\); Barreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\); Shiet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib25)\); Rayet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib57)\)into five target languages L2 \(Mandarin Chinese, Vietnamese, Thai, Indonesian, Filipino\) and evaluates agents across three progressively localized settings: \(1\)L2 Interaction, isolating linguistic capability in user\-agent conversation; \(2\)L2 Tool, testing the ability to use tools with non\-English tool specifications; and \(3\)L2 Domain, evaluating performance when all task contexts are in L2 \([Section3\.3](https://arxiv.org/html/2606.28715#S3.SS3)\)\. To translate different interfaces an AI agent interacts with, without breaking the execution ofτ2\\tau^\{2\}\-Bench, we develop a structured, non\-breaking translation pipeline \([Section3\.2](https://arxiv.org/html/2606.28715#S3.SS2)\)\.

Across three recent agent models, we find that English agent capabilities transfer reasonably well when agents only need to respond in target languages, but quality and robustness degrade sharply once tools, policies, and task contexts are progressively provided in SEA languages \([Section5\.1](https://arxiv.org/html/2606.28715#S5.SS1)\)\. These results expose a gap between the growth of SEA evaluation resources and the readiness of current agents for sovereign AI deployment, establishing SEATauBench as a diagnostic benchmark for building reliable multilingual agents for the region\.

![Refer to caption](https://arxiv.org/html/2606.28715v1/x1.png)Figure 2:Overview of the automated translation pipeline for generating multilingualτ2\\tau^\{2\}\-Bench artifacts \([Section3\.2](https://arxiv.org/html/2606.28715#S3.SS2)\)\. We provide more details about the pipeline and the resulting translated artifacts in Appendix[A](https://arxiv.org/html/2606.28715#A1)\.
## 2Related Works

### 2\.1Agent Evaluation

Task\-oriented dialogue benchmarks such as MultiWOZBudzianowskiet al\.\([2018](https://arxiv.org/html/2606.28715#bib.bib26)\); Ericet al\.\([2020](https://arxiv.org/html/2606.28715#bib.bib27)\); Zanget al\.\([2020](https://arxiv.org/html/2606.28715#bib.bib28)\); Hanet al\.\([2021](https://arxiv.org/html/2606.28715#bib.bib29)\); Yeet al\.\([2022](https://arxiv.org/html/2606.28715#bib.bib30)\)and MASSIVEFitzGeraldet al\.\([2023](https://arxiv.org/html/2606.28715#bib.bib31)\)established evaluation for goal\-directed dialogue and multilingual intent\-slot understanding\. Recent tool\-use benchmarks, including ToolEvalQinet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib32)\)and BFCLPatilet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib12)\), extend this direction to function calling and API use\. However, deployed service agents must further sustain multi\-turn interactions, follow domain policies, and update external states\.τ2\\tau^\{2\}\-BenchYaoet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib8)\); Barreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\); Shiet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib25)\); Rayet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib57)\)address this through simulated user\-agent\-tool environments, scored by task completion metrics pass^11and pass^33\. Related benchmarks further assess realistic agent behavior in business and professional settingsHuanget al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib33)\); Drouinet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib35)\); Boisvertet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib36)\); Xuet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib34)\); Patwardhanet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib37)\)\.

Despite these advances, existing agent benchmarks remain largely centered on high\-resource languages, limiting their applicability to sovereign AI settings\. SEATauBench addresses this gap by extendingτ2\\tau^\{2\}\-Bench to SEA languages through localized conversational and tool\-use scenarios\. To our knowledge, it is the first multilingual benchmark to preserve the task\-oriented evaluation framework ofτ2\\tau^\{2\}\-Bench, contributing towards reliable assessment of agent capabilities in multilingual contexts\.

### 2\.2Multilingual Evaluation

Multilingual evaluation has mainly targeted understanding, reasoning, translation, and multimodal comprehension through benchmarks such as mMMLUHendryckset al\.\([2021](https://arxiv.org/html/2606.28715#bib.bib38)\); OpenAI \([2024](https://arxiv.org/html/2606.28715#bib.bib55)\), GlobalMMLUSinghet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib39)\), GlotEvalLuoet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib40)\), CVQARomeroet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib41)\), and AyaVisionBenchCohereLabs \([2025](https://arxiv.org/html/2606.28715#bib.bib42)\)\. For Southeast Asian languages, resources such as NusaXWinataet al\.\([2023](https://arxiv.org/html/2606.28715#bib.bib43)\), NusaWritesCahyawijayaet al\.\([2023b](https://arxiv.org/html/2606.28715#bib.bib44)\), NusaDialoguePurwariantiet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib45)\), IndoTODKautsaret al\.\([2023](https://arxiv.org/html/2606.28715#bib.bib46)\), SEACrowdLoveniaet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib47)\), StingrayBenchCahyawijayaet al\.\([2025b](https://arxiv.org/html/2606.28715#bib.bib48)\), SEAExam/SEABenchLiuet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib49)\), SEA\-HELMSusantoet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib50)\), SEA\-VLCahyawijayaet al\.\([2025a](https://arxiv.org/html/2606.28715#bib.bib52)\), and SEA\-VQAUrailertprasertet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib54)\)expand evaluation to local linguistic, cultural, reasoning, safety, and multimodal contexts\. However, these benchmarks largely focus on static evaluation rather than interactive tool use or multi\-turn task completion\.

The closest work is MASSIVE\-AgentsKulkarniet al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib23)\), which evaluates function calling across 52 languages using BFCL, but remains limited to function selection and argument prediction\. SEATauBench goes further by evaluating full user\-agent\-tool interaction, requiring agents to interact over multiple turns, follow instructions, use tools, and complete realistic tasks in SEA languages\. It therefore bridges multilingual function calling and interactive agent evaluation, addressing a gap left by existing SEA multilingual and agent evaluation\.

## 3Adaptingτ2\\tau^\{2\}\-Bench to SEATauBench

### 3\.1Background

We construct SEATauBench by extending the English tool\-agent\-user benchmarkτ2\\tau^\{2\}\-BenchBarreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\)to multilingual settings\. This adaptation must account for two coupled agent\-facing surfaces:interaction content\(task definitions, domain policies and workflows, and structured databases\) andexecutable interfaces\(tool schemas and, in telecom, return messages\)\.

At runtime, the agent must read policies, reason about tasks, call tools, inspect outputs, and talk to a simulated user in the target language\. The benchmark therefore needs a localized interface whose visible text is translated, while the execution layer still preserves canonical English values\. Our pipeline aligns these representations to preserve task semantics and metric comparability\.

[Section3\.2](https://arxiv.org/html/2606.28715#S3.SS2)describes how we translate and construct L2 artifacts, and[Section3\.3](https://arxiv.org/html/2606.28715#S3.SS3)uses those artifacts to define controlled evaluation scenarios with increasing levels of L2 adaptation\.

### 3\.2SEATauBench L2 Adaptation Pipeline

[Figure2](https://arxiv.org/html/2606.28715#S1.F2)summarizes our two\-phase adaptation pipeline:offline translation, which materializes language\-specific assets, andruntime localization, which patches the environment observed by the agent\. The full details are provided in Appendix[A](https://arxiv.org/html/2606.28715#A1)\.

We adapt three domains fromτ2\\tau^\{2\}\-Bench \(retail, airline, and telecom\) into five L2: Vietnamese \(vi\), Indonesian \(id\), Thai \(th\), Filipino \(tl\), and Mandarin Chinese \(zh\)\.222We do not include other SEA languages such as Malay, Lao, or Cambodian due to limited annotator capacity\.

#### Offline Translation\.

We translate static domain assets to construct L2 artifacts\. It extracts natural\-language spans from tasks, policies, databases, tool docstrings, schemas, and tool\-return templates, while masking executable tokens such as IDs, status values, tool names, and structural markers\. It first translates schema literals to establish a glossary, preventing the same executable value from being rendered inconsistently\. The pipeline then writes outputs with format\-specific writers and records per\-language manifests with model metadata and source\-file SHA\-256 fingerprints\. Appendix[A\.5](https://arxiv.org/html/2606.28715#A1.SS5)reports the resulting artifact statistics\.

#### Runtime Localization\.

The runtime localization phase handles dynamic content exposed during inference\. First, it localizes the tool schema shown to the agent: descriptions, enum choices, and examples are rendered in L2 so the scenario tests target\-language tool use rather than English schema reading, while the underlying implementation remains unchanged\. Second, it preserves executability by normalizing localized arguments back to canonical English values before tool calls; otherwise, correct L2 arguments could fail only because the original tools expect English literals\. After execution, tool responses are localized back into L2, and final payloads are canonicalized again before scoring, keeping the interaction monolingual for the agent while making metrics comparable across languages and with the English benchmark\. Together, offline translation and runtime localization let us vary which benchmark surfaces are exposed in L2, forming the scenario design\.

#### Human Manual Review\.

For each target language, a native speaker—either an author or a recruited reviewer—reviews the machine\-translated artifacts\. They read every translated prose document in full \(domain policy and agent/user instructions\) and a sample of 100 segments each from the database and the task contexts\. Reviews are carried out in a per\-language Excel workbook with one sheet per artifact \(see details in AppendixLABEL:app:\), where annotators add a corrected translation and notes \(both optional\)\. If human corrected values are present, we use them in experiments; otherwise, we consider machine translated versions usable\. We find that our pipeline ingests 91% of machine\-translated artifacts, showing that they can be trusted even if used by themselves\. Since we do not employ more than one reviewer per language, we do not calculate inter\-annotator agreement scores\.

### 3\.3Evaluation Scenarios

![Refer to caption](https://arxiv.org/html/2606.28715v1/x2.png)Figure 3:Evaluation results on SEATauBench\. SEATauBench reveals consistent degradation as the setting becomes less English\-centric on all languages in both pass@1 andρ3\\rho^\{3\}metrics, with most severe drops in low\-resource languages with non\-Latin scripts\. All scenario results are reported with three models, except for L2 Tool, where only GPT\-5\-mini and Qwen3\-235B\-A22B\-Inst are reported\.ScenarioL2ToolsL2ConvoL2 DB& Policy\(S1\) English Only✗✗✗\(S2\) L2 Interaction✗✓✗\(S3\) L2 Tool✓✗✗\(S4\) L2 Domain✓✓✓

Table 1:SEATauBench evaluation scenarios across English and five SEA languages\. S2 and S3 isolate L2 dialogue and L2 tool docstring respectively, while S4 L2 domain uses all translated contexts\.SEATauBench has four scenarios with increasing level of L2 adaptation \(summary in Table[1](https://arxiv.org/html/2606.28715#S3.T1)\)\.

\(S1\) English Only\.The original English benchmark, to establish the baseline performance for tested agents\.

\(S2\) L2 Interaction\.The simulated user and agent converse in L2, while tools, policies, databases, and task contexts remain in English\.

\(S3\) L2 Tool\.Tool schemas are rewritten in L2 while conversation and domain context remain in English\. We present translated tool schemas in two formats: single\-L2, where each schema uses one target language, and mixed\-L2, where schemas combine an increasing number of target languages\.

\(S4\) L2 Domain\.Agents operate in a complete L2 setting where dialogue, tool schemas, policies, task descriptions, and agent\-visible database are translated while task semantics remain fixed\. This scenario uses all translated artifacts and runtime localization described in[Section3\.2](https://arxiv.org/html/2606.28715#S3.SS2)\.

## 4Experimental Setup

#### Metrics\.

Followingτ2\\tau^\{2\}\-BenchBarreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\), we report two task metrics\. First, pass@​1@1measures quality as the mean success rate of independent trials reaching the expected final state, i\.e\., pass@​1@1=pass1=𝔼​\[r\]\\mathbb\{E\}\[r\]\. Building on existing metrics, we also derive a robustness measure,ρ∈\[0,1\]\\rho\\in\[0,1\], defined as

ρq=passqpass@1\\rho^\{q\}=\\frac\{\\text\{pass\}^\{q\}\}\{\\text\{pass@1\}\}\(1\)
whereρq\\rho^\{q\}denotes the probability that allqqindependent trials succeed, averaged across tasks\.ρq=1\\rho^\{q\}=1indicates that the agent performs consistently across multiple runs, while lowerρq\\rho^\{q\}indicates that the agent may solve a task once but fail to do so reliably across repeated trials\.

#### Task and Domain\.

For high\-quality evaluation, we use latest versions of retail and airline domainsCuadronet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib9)\), which incorporate several corrections and refinements over prior versionsYaoet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib8)\); Barreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\)\. For the telecom domain, we follow the originalτ2\\tau^\{2\}\-Bench, as its tasks were not updated in the most recent release\. We leave exploration of the banking domainShiet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib25)\)and the voice modalityRayet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib57)\)introduced inτ3\\tau^\{3\}\-bench to future work\.

#### Hyperparameters and Models\.

For all evaluations, we useq=3q=3, resulting in the robustness metricρ3\\rho^\{3\}, and Qwen3\-235B\-A22B\-Inst333[https://huggingface\.co/Qwen/Qwen3\-235B\-A22B\-Inst\-A22B\-Instruct\-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Inst-A22B-Instruct-2507)Qwen\-Team \([2025](https://arxiv.org/html/2606.28715#bib.bib3)\)as the user LLM\. For the natural\-language assertion judge, we use GPT\-4\.1, following the implementations provided inτ2\\tau^\{2\}\-Bench\. We evaluate three recent LLM agents spanning three model families: GPT\-5\-Mini444[https://developers\.openai\.com/api/docs/models/gpt\-5\-mini](https://developers.openai.com/api/docs/models/gpt-5-mini)Singhet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib2)\)as a proprietary model, Qwen3\-235B\-A22B\-Instruct\-2507 as a representative open\-source model, and Kimi\-K2\.5Kimi Teamet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib1)\)as a larger\-scale open\-source model\.

For each model, we use the default hyperparameters defined in each provider \(See Appendix[C](https://arxiv.org/html/2606.28715#A3)\)\. We avoid any model\-specific hyperparameter tuning to better reflect the out\-of\-the\-box tool\-agent\-user interaction capability when adopted to the specified scenario and L2\.

![Refer to caption](https://arxiv.org/html/2606.28715v1/x3.png)

![Refer to caption](https://arxiv.org/html/2606.28715v1/x4.png)

Figure 4:Quality and robustness of varying agent models across languages and domains\.\(left\)All agents yield consistently higher pass@​1@1in English and Chinese indicating a quality bias toward high\-resource languages\.\(center\)A similar trend is observed onρ3\\rho^\{3\}, with negative correlation ofρ3\\rho^\{3\}between English and L2 languages especially Thai\.\(right\)The trend of pass@​1@1holds across most domains and models with some exceptions as described in §[5\.2](https://arxiv.org/html/2606.28715#S5.SS2)\.

## 5Results

We present two main results\. First, we examine how quality and robustness change as more components of the English benchmark are converted to L2 \(Section[Section5\.1](https://arxiv.org/html/2606.28715#S5.SS1)\)\. Second, we test whether these trends are consistent across agent models and task domains \(Section[Section5\.2](https://arxiv.org/html/2606.28715#S5.SS2)\)\. Together, the results show that English\-centric evaluations overestimate multilingual agent reliability, and that the size of this gap depends on both the target language and the structure of the task environment\.

### 5\.1Performance Degrades in L2 Languages

[Figure3](https://arxiv.org/html/2606.28715#S3.F3)demonstrates a consistent decline in model performance as the evaluation setting becomes progressively less English\-centric\. Both pass@​1@1andρ3\\rho^\{3\}are strongest in English settings, but performance drops when the user–agent conversation is conducted in an L2, drops further when tool interfaces are localized, and typically reaches its lowest levels when the domain context is also translated\. This progression indicates that multilingual agentic competence is not only a matter of understanding non\-English user utterances\. Agents must also use localized tool schemas, entities, and ground decisions in domain policies expressed in L2\.

Although all L2 settings degrade relative to English, the severity and form of degradation differ by language\. In L2 Domain, Thai and Filipino show the largest pass@​1@1declines, falling to roughly the low\-0\.4 range\. Vietnamese and Chinese remain closer to the mid\-0\.5 range, while Indonesian shows a partial robustness recovery despite lower pass@​1@1\. The sharper drop in Thai may reflect challenges associated with a distinct writing system and lower\-resource coverage\. However, the comparable drop in Filipino shows that localization failures are not limited to non\-Latin scripts: even Latin\-script languages can expose weaknesses when agents must reason over fully localized dialogue, tools, schemas, and domain context\.

#### Quality and robustness diverge\.

[Figure3](https://arxiv.org/html/2606.28715#S3.F3)also shows that pass@​1@1andρ3\\rho^\{3\}do not always move together\. In Indonesian, Filipino, Chinese, and to a lesser extent Vietnamese, pass@​1@1drops in L2 Domain whileρ3\\rho^\{3\}remains comparatively stable or rebounds relative to L2 Tool\. This means that agents may solve fewer tasks overall, but when they do solve a task, their successful behavior is relatively reproducible in some L2 settings\. Thai is the clearest exception: both pass@​1@1andρ3\\rho^\{3\}decline in L2 Domain\. This suggests that some L2 conditions weaken not only average task success but also consistency across repeated trials\.

### 5\.2Does the Trend Hold across Agents?

Overall, model performance differs across metrics, as shown in Fig\.[4](https://arxiv.org/html/2606.28715#S4.F4)\. For pass@​1@1, Kimi K2\.5 performs best in most languages, including English, Vietnamese, Filipino, and Chinese, while GPT\-5 Mini leads in Indonesian and Thai, indicating that model advantages do not transfer uniformly across SEA languages\. Qwen3\-235B\-A22B\-Inst consistently has the lowest pass@​1@1, suggesting weaker single\-run task completion\. However,ρ3\\rho^\{3\}reveals a different pattern: despite its lower pass@​1@1, Qwen3\-235B\-A22B\-Inst often shows stronger robustness in lower\-resource L2 settings\. This robustness advantage weakens in higher\-resource languages such as English and Chinese, where Kimi K2\.5 and GPT\-5 Mini remain competitive or stronger\.

The domain\-level results show that the English–non\-English gap varies across domains and models\. For GPT\-5 Mini and Kimi K2\.5, pass@​1@1generally follows Telecom\>\>Airline\>\>Retail in both English and non\-English settings, suggesting that Retail is the most challenging domain due to its policies, item\-level constraints, and state updates\. Qwen3\-235B\-A22B\-Inst Instruct shows a different pattern: Retail performs best in English but becomes weakest in non\-English settings, indicating that domain\-specific strengths in English may not transfer under localization\. Theρ3\\rho^\{3\}results further show that localization reduces robustness, with English settings generally achieving higher consistency than non\-English settings across domains\. However, this degradation is not uniform\. For Qwen3\-235B\-A22B\-Inst, agent performance follows the expected non\-English decline in Telecom and Retail domains, while shows a relative robustness spike for Airline\. This suggests that robustness under localization depends on both task domain and the agent model behavior\.

## 6Analysis and Discussion

In this section, we examine error patterns arising from conversational user–agent interactions in L2 Interaction and L2 Domain scenarios \([Section6\.1](https://arxiv.org/html/2606.28715#S6.SS1)\) and the effects of having tools in multiple languages \([Section6\.2](https://arxiv.org/html/2606.28715#S6.SS2)\)\. Furthermore, we show that agent performance has little to no association with language correctness \([Section6\.3](https://arxiv.org/html/2606.28715#S6.SS3)\) and that English performance is an unreliable proxy for L2 performance with SEA languages \([Section6\.4](https://arxiv.org/html/2606.28715#S6.SS4)\)\.

### 6\.1Error Analysis

![Refer to caption](https://arxiv.org/html/2606.28715v1/x5.png)Figure 5:Error categorization for simulated user and agent, across SEA language in the L2 Interaction and L2 Domain scenarios\. Both agents and users tend to have higher critical errors in L2 than English, while having a similar benign error in comparison to English\. The errors are reported by judge DeepSeek\-V4\-Flash, for simulated user Qwen3\-235B\-A22B\-Inst and agent GPT\-5\-mini\.To further examine the downstream performance metrics reported in the previous section, we conduct a qualitative error analysis based on the LLM\-as\-a\-Judge framework proposed inShiet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib25)\)\. FollowingBarreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\), we categorize task simulation errors into two types:Criticalerror, in which the agent deviates from the user’s intent or commits irrecoverable mistakes, andBenignerror, which do not prevent successful task completion\. For early results, we analyze only scenarios that affect both behavioral changes on both simulated user and agent, i\.e\., L2 Interaction and L2 Domain scenarios, since L2 Tool only translates tool specifications while keeping all tool\-agent\-user interactions in English\. Specifically, we employ DeepSeek\-V4\-Flash as the oracle LLM on simulations done by GPT\-5\-mini as the agent and Qwen3\-235B\-A22B\-Inst as the simulated user\.

#### L2 Interaction\.

Agent performance degrades when interacting with users in L2\. As shown in[Figure5](https://arxiv.org/html/2606.28715#S6.F5), critical errors appear in around 40% of total simulations across SEA languages, with Thai being the most severe \(nearly 50%\)\. Moreover, when the user model must communicate only in SEA languages, its behavior becomes less stable and less consistent with the intended simulation setup, with the most severe degradation in Filipino\.

#### L2 Domain\.

Agent performance degradation is more pronounced in the L2 domain adaptation setting\. There are slightly more critical errors for agents here than L2 interaction, with clearest trends in Thai, Filipino, and Indonesian \([Figure5](https://arxiv.org/html/2606.28715#S6.F5)\)\. This finding suggests that operating in complete L2 environments further impairs the agent’s ability to complete tasks effectively\.

#### Simulated User\.

A similar trend is observed on the simulated user, where critical errors consistently account for around 20% of all errors in both analyzed scenarios\. Our main finding that agent performance drops as L2 adaptation increases across four scenarios \([Figure3](https://arxiv.org/html/2606.28715#S3.F3)\) might be compromised by the low reliability of the simulated user\. This account previously appeared inBarreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\)for English andSeshadriet al\.\([2026](https://arxiv.org/html/2606.28715#bib.bib67)\)for dialectical English\.

### 6\.2Mixed language tool causes modest drop in task performance

![Refer to caption](https://arxiv.org/html/2606.28715v1/x6.png)Figure 6:Pass@1 andρ3\\rho^\{3\}as the number of languages used for translated tools increase from 1 \(English\) to 5\. Performance for two agents \(GPT\-5\-Mini & Qwen3\-235B\-A22B\-Inst\) slightly drops when a second language is added, but it plateaus for more languages\.Beyond adaptation using a single L2, we also investigate how agents handle tools in multiple languages, which is increasingly relevant as tools for agents are developed in different world regions\. We do so by extending L2 Tool experiment and explore five different mixed language settings by gradually adding more languages to the mix for tool specification: Mix\-2 uses English and Thai, Mix\-3 adds Vietnamese, Mix\-4 adds Indonesian, and Mix\-5 adds Chinese \(for more details, see Appendix[B](https://arxiv.org/html/2606.28715#A2)\)\.

[Figure6](https://arxiv.org/html/2606.28715#S6.F6)shows that mixed language tool causes only a modest drop\. Pass@1 drops from0\.680\.68for English\-only average to0\.550\.55across all mixed\-language settings \(0\.130\.13decrease\)\.ρ3\\rho^\{3\}shows a similar trend of degradation \(0\.110\.11decrease\)\. These results indicate that agents with strong tool\-using performance in English might use tools slightly less effectively when tools are multilingual; however, increasing the language diversity in tool specification has little to no impact to the overall agentic tool\-using capability\.

### 6\.3Language use is uncorrelated with performance

![Refer to caption](https://arxiv.org/html/2606.28715v1/x7.png)Figure 7:Only 1\.4% of the variation in task performance metric pass3is explained by language use capability of tested agents, suggesting a very weak linear relationship\. Each dot represents a run for one scenario x domain x language setting\.We also measure whether the simulated user and agent operated across tasks in the correct L2, orlanguage correctness\. It is calculated as the fraction of eligible turns whose detected language \(by fastText\) matches the expected L2 for a single run\. We exclude trajectories with system errors and other errors unrelated to task execution\.

[Figure7](https://arxiv.org/html/2606.28715#S6.F7)shows that run\-level language correctness for agent has little association with task performance: a linear fit explains onlyR2=0\.014R^\{2\}=0\.014of the variance in pass3\. Several runs achieve near\-perfect language correctness while spanning a wide range of pass3scores, especially in L2 Tool and L2 Domain\. Conversely, L2 Interaction contains runs with visibly lower language correctness but still moderate or high task success\. This suggests that, at the aggregate run level, task failures are not primarily explained by whether the agent remains in the target language\. For language drift, we also measure how often the agent produces text in the expected L2 \([Figure9](https://arxiv.org/html/2606.28715#A5.F9)\), where in a conversation off\-target text appears \([Figure10](https://arxiv.org/html/2606.28715#A5.F10)\), and how much of the off\-target text is specifically English \([Figure11](https://arxiv.org/html/2606.28715#A5.F11)\)\.

### 6\.4Is English a reliable proxy for L2 in SEA?

Theτ2\\tau^\{2\}\-Bench evaluation framework, while effective for measuring agent capabilities, is resource\-intensive and computationally expensive\. Extending to SEATauBenchwith an additional language dimension adds to these costs\. To address this challenge, we examine correlation between quality and robustness metrics across SEATauBench languages to determine whether performance in one language could serve as a proxy for other languages\.

As illustrated in[Figure8](https://arxiv.org/html/2606.28715#S6.F8), agent performance in English has high predictive power of L2 performance for pass@1, with correlation above 0\.9\. However, there is more variability inρ3\\rho^\{3\}, where the highest correlation is in Chinese \(0\.88\) and the lowest in Thai \(0\.49\)\. Therefore, English performance is not a reliable representation for SEA language performance when we factor in consistency\. Filipino \(TL\), however, shows the strongest correlation with other SEA languages, with the lowest score being 0\.91 for pass@1 and 0\.85 forρ3\\rho^\{3\}\. Based on these findings, we recommend using the Filipino subset of SEATauBench as a proxy to estimate performance across the remaining four languages, for a more efficient yet representative evaluations of AI agents in the SEA region\.

![Refer to caption](https://arxiv.org/html/2606.28715v1/x8.png)Figure 8:Correlation of pass@1 \(upper triangule\) andρ3\\rho^\{3\}\(lower triangle\) across English and SEATauBench languages\. Correlation scores are aggregated by four scenarios, three agent models, and three task domains\.

## 7Conclusion

SEATauBenchestablishes the first agent\-focused framework for SEA linguistic diversity, revealing that English agentic capabilities transfer effectively to a superficial L2 interaction but degrade significantly with the inclusion of non\-English contexts, due to both agent limitations and reduced user capacity\. By exposing fundamental gaps in English\-only evaluation methodologies as a proxy of agents capability in SEA languages, SEATauBenchprovides essential diagnostic tools for sovereign AI development, establishing a rigorous foundation for future research in sovereign agentic AI solutions, ultimately supporting autonomous AI across SEA diverse linguistic communities\.

## 8Limitation

This work presents a systematic framework for extending agentic benchmarks to evaluate multilingual conversational capabilities with tool access in Southeast Asian \(SEA\) languages, specifically Vietnamese, Indonesian, Thai, Filipino, and Chinese, across the retail, airline, and telecom domains\. Although the proposed framework can, in principle, be extended to additional languages, the empirical findings and their implications are confined to the SEA language setting explored in this study\.

Another limitation is the benchmark adaptation methodology, which is specifically designed around theτ\\tau\-Bench evaluation paradigmYaoet al\.\([2024](https://arxiv.org/html/2606.28715#bib.bib8)\); Barreset al\.\([2025](https://arxiv.org/html/2606.28715#bib.bib56)\)\. While the proposed evaluation scenarios may be applicable more broadly, the underlying design constraints may limit the generalizability of the methodology beyondτ\\tau\-Bench frameworks\. In addition, the current analysis is limited to English and five Southeast Asian languages, consequently, the observed findings may not generalize to other SEA languages that are not covered in this work\.

## 9Ethical Statement

We conducted this research with careful attention to ethical considerations throughout the development process\. In developing of the translation pipeline and the evaluation framework, we ensure that there is no external sensitive information is exposed throughout the process and the resulting artifacts\. We recognize our responsibility to ensure these systems are fair, respectful of different cultures, and beneficial to all users\. Our approach focuses on being transparent about what the agents can and cannot do, actively working to identify and address potential biases, and taking steps to prevent misuse\. We handle all research data responsibly, protecting privacy while respecting linguistic diversity\. This work aims to contribute to AI that serves people equitably across languages and cultures, and we are committed to being accountable for how these technologies impact society\.

## Acknowledgment

This work was supported by the ThaiLLM collaboration, funded by the Digital Economy and Society \(DE\) Development Fund of the Ministry of Digital Economy and Society, Thailand\. We thank Kian Kyars for providing OpenAI API access for our experiments, and SCBX R&D for providing the necessary Google Cloud Vertex AI resources\.

## References

- Sovereignty in the Age of AI: Strategic Choices, Structural Dependencies\.Tony Blair Institute for Global Change\.Note:Accessed: 22 May 2026External Links:[Link](https://institute.global/insights/tech-and-digitalisation/sovereignty-in-the-age-of-ai-strategic-choices-structural-dependencies)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p1.1)\.
- V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan \(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.External Links:2506\.07982,[Link](https://arxiv.org/abs/2506.07982)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p3.2),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3),[§3\.1](https://arxiv.org/html/2606.28715#S3.SS1.p1.1),[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px1.p1.6),[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px2.p1.2),[§6\.1](https://arxiv.org/html/2606.28715#S6.SS1.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2606.28715#S6.SS1.p1.1),[§8](https://arxiv.org/html/2606.28715#S8.p2.2)\.
- M\. Bhandari and G\. Modi \(2026\)Why sovereign artificial intelligence is imperative in southeast asia\.EY\.Note:Accessed: 22 May 2026External Links:[Link](https://www.ey.com/en_sg/insights/ai/why-sovereign-artificial-intelligence-is-imperative-in-southeast-asia)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p1.1)\.
- L\. Boisvert, M\. Thakkar, M\. Gasse, M\. Caccia, T\. L\. S\. De Chezelles, Q\. Cappart, N\. Chapados, A\. Lacoste, and A\. Drouin \(2024\)WorkArena\+\+: towards compositional planning and reasoning\-based common knowledge work tasks\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 5996–6051\.External Links:[Document](https://dx.doi.org/10.52202/079017-0195),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/0b82662b6c32e887bb252a74d8cb2d5e-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- P\. Budzianowski, T\. Wen, B\. Tseng, I\. Casanueva, S\. Ultes, O\. Ramadan, and M\. Gašić \(2018\)MultiWOZ \- a large\-scale multi\-domain Wizard\-of\-Oz dataset for task\-oriented dialogue modelling\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 5016–5026\.External Links:[Link](https://aclanthology.org/D18-1547/),[Document](https://dx.doi.org/10.18653/v1/D18-1547)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- S\. Cahyawijaya, P\. Limkonchotiwat, T\. H\. Wong, H\. L\. Patel, A\. Agarwal, M\. A\. Rufino, C\. R\. Catalan, M\. R\. Qorib, V\. Feliren, H\. Lovenia, A\. H\. Khine, F\. Hudi, D\. Anugraha, A\. F\. Aji, R\. Chumpu, V\. Pham, M\. Wang, M\. F\. Imam, R\. Zhang, J\. M\. Imperial, K\. Nur'aini, D\. X\. Long, M\. I\. Wijanarko, J\. R\. A\. Moniz, P\. A\. Irawan, H\. M\. Zhafran, I\. Flores, S\. Z\. Pranida, J\. Kevin, J\. J\. Rosal, P\. N\. Monderin, K\. Kerdthaisong, A\. Mustafid, M\. C\. Nguyen, N\. Jongwiriyanurak, S\. Worajitwannakul, H\. Li, A\. X\. W\. Lim, B\. Wang, M\. R\. S\. Habibi, L\. H\. X\. Ng, M\. Bangera, Y\. Bangera, P\. Pattnayak, D\. L\. Chan, S\. C\. Djuniwar, C\. C\. M\. Oo, and H\. M\. Shan \(2026\)Anthropogenic regional adaptation in multimodal vision\-language model\.External Links:2604\.11490,[Link](https://arxiv.org/abs/2604.11490)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1)\.
- S\. Cahyawijaya, H\. Lovenia, A\. F\. Aji, G\. Winata, B\. Wilie, F\. Koto, R\. Mahendra, C\. Wibisono, A\. Romadhony, K\. Vincentio, J\. Santoso, D\. Moeljadi, C\. Wirawan, F\. Hudi, M\. S\. Wicaksono, I\. Parmonangan, I\. Alfina, I\. F\. Putra, S\. Rahmadani, Y\. Oenang, A\. Septiandri, J\. Jaya, K\. Dhole, A\. Suryani, R\. A\. Putri, D\. Su, K\. Stevens, M\. N\. Nityasya, M\. Adilazuarda, R\. Hadiwijaya, R\. Diandaru, T\. Yu, V\. Ghifari, W\. Dai, Y\. Xu, D\. Damapuspita, H\. Wibowo, C\. Tho, I\. Karo Karo, T\. Fatyanosa, Z\. Ji, G\. Neubig, T\. Baldwin, S\. Ruder, P\. Fung, H\. Sujaini, S\. Sakti, and A\. Purwarianti \(2023a\)NusaCrowd: open source initiative for Indonesian NLP resources\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 13745–13818\.External Links:[Link](https://aclanthology.org/2023.findings-acl.868/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.868)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1)\.
- S\. Cahyawijaya, H\. Lovenia, F\. Koto, D\. Adhista, E\. Dave, S\. Oktavianti, S\. Akbar, J\. Lee, N\. Shadieq, T\. W\. Cenggoro, H\. Linuwih, B\. Wilie, G\. Muridan, G\. Winata, D\. Moeljadi, A\. F\. Aji, A\. Purwarianti, and P\. Fung \(2023b\)NusaWrites: constructing high\-quality corpora for underrepresented and extremely low\-resource languages\.InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),J\. C\. Park, Y\. Arase, B\. Hu, W\. Lu, D\. Wijaya, A\. Purwarianti, and A\. A\. Krisnadhi \(Eds\.\),Nusa Dua, Bali,pp\. 921–945\.External Links:[Link](https://aclanthology.org/2023.ijcnlp-main.60/),[Document](https://dx.doi.org/10.18653/v1/2023.ijcnlp-main.60)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- S\. Cahyawijaya, H\. Lovenia, J\. R\. A\. Moniz, T\. H\. Wong, M\. R\. Farhansyah, T\. T\. Maung, F\. Hudi, D\. Anugraha, M\. R\. S\. Habibi, M\. R\. Qorib, A\. Agarwal, J\. M\. Imperial, H\. L\. Patel, V\. Feliren, B\. I\. Nasution, M\. A\. Rufino, G\. I\. Winata, R\. A\. Rajagede, C\. R\. Catalan, M\. F\. M\. Imam, P\. Pattnayak, S\. Z\. Pranida, K\. Pratama, Y\. Bangera, A\. Na\-Thalang, P\. N\. Monderin, Y\. Song, C\. Simon, L\. H\. X\. Ng, R\. L\. Sapan, T\. H\. Rafi, B\. Wang, Supryadi, K\. Veerakanjana, P\. Ittichaiwong, M\. T\. Roque, K\. Vincentio, T\. Kreangphet, P\. Artkaew, K\. H\. Palgunadi, Y\. Yu, R\. P\. Hastuti, W\. Nixon, M\. Bangera, A\. X\. W\. Lim, A\. H\. Khine, H\. M\. Zhafran, T\. Ferdinan, A\. A\. Izzani, A\. Singh, E\. Evan, J\. A\. Krito, M\. Anugraha, F\. A\. Ilasariya, H\. Li, J\. A\. Daniswara, F\. A\. Tjiaranata, E\. P\. Yulianrifat, C\. Udomcharoenchaikit, F\. R\. Ansori, M\. K\. Ihsani, G\. Nguyen, A\. M\. Barik, D\. J\. Velasco, R\. A\. Genadi, S\. Saha, C\. Wei, I\. E\. W\. Flores, K\. C\. K\. Han, A\. G\. D\. Santos, W\. S\. Lim, K\. S\. Phyo, T\. Santos, M\. Dwiastuti, J\. Luo, J\. C\. B\. Cruz, M\. S\. Hee, I\. A\. Hanif, M\. A\. Hakim, M\. R\. Sya'ban, K\. Kerdthaisong, L\. J\. V\. Miranda, F\. Koto, T\. N\. Fatyanosa, A\. F\. Aji, J\. J\. Rosal, J\. Kevin, R\. Wijaya, O\. P\. Kampman, R\. Zhang, B\. F\. Karlsson, and P\. Limkonchotiwat \(2025a\)Crowdsource, crawl, or generate? creating SEA\-VL, a multicultural vision\-language dataset for Southeast Asia\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18685–18717\.External Links:[Link](https://aclanthology.org/2025.acl-long.916/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.916),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- S\. Cahyawijaya, R\. Zhang, J\. C\. B\. Cruz, H\. Lovenia, E\. Gilbert, H\. Nomoto, and A\. F\. Aji \(2025b\)Thank you, stingray: multilingual large language models can not \(yet\) disambiguate cross\-lingual word senses\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 3228–3250\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.178/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.178),ISBN 979\-8\-89176\-195\-7Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- K\. Chae, G\. Kim, G\. Lee, T\. Kim, J\. Lee, and H\. Kim \(2025\)Assessing socio\-cultural alignment and technical safety of sovereign LLMs\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10579–10600\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.559/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.559),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p1.1)\.
- CohereLabs \(2025\)AyaVisionBench\.Note:[https://huggingface\.co/datasets/CohereLabs/AyaVisionBench](https://huggingface.co/datasets/CohereLabs/AyaVisionBench)Hugging Face datasetCited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- A\. Cuadron, P\. Yu, Y\. Liu, and A\. Gupta \(2026\)SABER: small actions, big errors — safeguarding mutating steps in LLM agents\.InICLR 2026 Workshop on Memory for LLM\-Based Agentic Systems,External Links:[Link](https://openreview.net/forum?id=En2z9dckgP)Cited by:[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px2.p1.2)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. D\. Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.External Links:2403\.07718,[Link](https://arxiv.org/abs/2403.07718)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- M\. Eric, R\. Goel, S\. Paul, A\. Sethi, S\. Agarwal, S\. Gao, A\. Kumar, A\. Goyal, P\. Ku, and D\. Hakkani\-Tur \(2020\)MultiWOZ 2\.1: a consolidated multi\-domain dialogue dataset with state corrections and state tracking baselines\.InProceedings of the Twelfth Language Resources and Evaluation Conference,N\. Calzolari, F\. Béchet, P\. Blache, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Marseille, France,pp\. 422–428\(eng\)\.External Links:[Link](https://aclanthology.org/2020.lrec-1.53/),ISBN 979\-10\-95546\-34\-4Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- J\. FitzGerald, C\. Hench, C\. Peris, S\. Mackie, K\. Rottmann, A\. Sanchez, A\. Nash, L\. Urbach, V\. Kakarala, R\. Singh, S\. Ranganath, L\. Crist, M\. Britan, W\. Leeuwis, G\. Tur, and P\. Natarajan \(2023\)MASSIVE: a 1M\-example multilingual natural language understanding dataset with 51 typologically\-diverse languages\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 4277–4302\.External Links:[Link](https://aclanthology.org/2023.acl-long.235/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.235)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- T\. Han, X\. Liu, R\. Takanabu, Y\. Lian, C\. Huang, D\. Wan, W\. Peng, and M\. Huang \(2021\)MultiWOZ 2\.3: a multi\-domain task\-oriented dialogue dataset enhanced with annotation corrections and co\-reference annotation\.InNatural Language Processing and Chinese Computing,L\. Wang, Y\. Feng, Y\. Hong, and R\. He \(Eds\.\),Cham,pp\. 206–218\.External Links:ISBN 978\-3\-030\-88483\-3Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- K\. Huang, A\. Prabhakar, S\. Dhawan, Y\. Mao, H\. Wang, S\. Savarese, C\. Xiong, P\. Laban, and C\. Wu \(2025\)CRMArena: understanding the capacity of LLM agents to perform professional CRM tasks in realistic environments\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 3830–3850\.External Links:[Link](https://aclanthology.org/2025.naacl-long.194/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.194),ISBN 979\-8\-89176\-189\-6Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- M\. Kautsar, R\. Nurdini, S\. Cahyawijaya, G\. Winata, and A\. Purwarianti \(2023\)IndoToD: a multi\-domain Indonesian benchmark for end\-to\-end task\-oriented dialogue systems\.InProceedings of the First Workshop in South East Asian Language Processing,D\. Wijaya, A\. F\. Aji, C\. Vania, G\. I\. Winata, and A\. Purwarianti \(Eds\.\),Nusa Dua, Bali, Indonesia,pp\. 85–99\.External Links:[Link](https://aclanthology.org/2023.sealp-1.7/),[Document](https://dx.doi.org/10.18653/v1/2023.sealp-1.7)Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- Kimi Team, T\. Bai, Y\. Bai, Y\. Bao, S\. H\. Cai, Y\. Cao, Y\. Charles, H\. S\. Che, C\. Chen, G\. Chen, H\. Chen, J\. Chen, J\. Chen, J\. Chen, J\. Chen, K\. Chen, L\. Chen, R\. Chen, X\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, D\. Cheng, M\. Chu, J\. Cui, J\. Deng, M\. Diao, H\. Ding, M\. Dong, M\. Dong, Y\. Dong, Y\. Dong, A\. Du, C\. Du, D\. Du, L\. Du, Y\. Du, Y\. Fan, S\. Fang, Q\. Feng, Y\. Feng, G\. Fu, K\. Fu, H\. Gao, T\. Gao, Y\. Ge, S\. Geng, C\. Gong, X\. Gong, Z\. Gongque, Q\. Gu, X\. Gu, Y\. Gu, L\. Guan, Y\. Guo, X\. Hao, W\. He, W\. He, Y\. He, C\. Hong, H\. Hu, J\. Hu, Y\. Hu, Z\. Hu, K\. Huang, R\. Huang, W\. Huang, Z\. Huang, T\. Jiang, Z\. Jiang, X\. Jin, Y\. Jing, G\. Lai, A\. Li, C\. Li, C\. Li, F\. Li, G\. Li, G\. Li, H\. Li, H\. Li, J\. Li, J\. Li, J\. Li, L\. Li, M\. Li, W\. Li, W\. Li, X\. Li, X\. Li, Y\. Li, Y\. Li, Y\. Li, Y\. Li, Z\. Li, Z\. Li, W\. Liao, J\. Lin, X\. Lin, Z\. Lin, Z\. Lin, C\. Liu, C\. Liu, H\. Liu, L\. Liu, S\. Liu, S\. Liu, S\. Liu, T\. Liu, T\. Liu, W\. Liu, X\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Z\. Liu, Z\. Liu, E\. Lu, H\. Lu, Z\. Lu, J\. Luo, T\. Luo, Y\. Luo, L\. Ma, Y\. Ma, S\. Mao, Y\. Mei, X\. Men, F\. Meng, Z\. Meng, Y\. Miao, M\. Ni, K\. Ouyang, S\. Pan, B\. Pang, Y\. Qian, R\. Qin, Z\. Qin, J\. Qiu, B\. Qu, Z\. Shang, Y\. Shao, T\. Shen, Z\. Shen, J\. Shi, L\. Shi, S\. Shi, F\. Song, P\. Song, T\. Song, X\. Song, H\. Su, J\. Su, Z\. Su, L\. Sui, J\. Sun, J\. Sun, T\. Sun, F\. Sung, Y\. Tai, C\. Tang, H\. Tang, X\. Tang, Z\. Tang, J\. Tao, S\. Teng, C\. Tian, P\. Tian, A\. Wang, B\. Wang, C\. Wang, C\. Wang, C\. Wang, D\. Wang, D\. Wang, D\. Wang, F\. Wang, H\. Wang, H\. Wang, H\. Wang, H\. Wang, H\. Wang, J\. Wang, J\. Wang, J\. Wang, K\. Wang, L\. Wang, Q\. Wang, S\. Wang, S\. Wang, S\. Wang, W\. Wang, X\. Wang, X\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, C\. Wei, M\. Wei, C\. Wen, Z\. Wen, C\. Wu, H\. Wu, J\. Wu, R\. Wu, W\. Wu, Y\. Wu, Y\. Wu, Y\. Wu, Z\. Wu, C\. Xiao, J\. Xie, X\. Xie, Y\. Xie, Y\. Xin, B\. Xing, B\. Xu, J\. Xu, J\. Xu, J\. Xu, L\. H\. Xu, L\. Xu, S\. Xu, W\. Xu, X\. Xu, X\. Xu, Y\. Xu, Y\. Xu, Y\. Xu, Z\. Xu, Z\. Xu, J\. Yan, Y\. Yan, G\. Yang, H\. Yang, J\. Yang, K\. Yang, N\. Yang, R\. Yang, X\. Yang, X\. Yang, Y\. Yang, Y\. Yang, Y\. Yang, Z\. Yang, Z\. Yang, Z\. Yang, H\. Yao, D\. Ye, W\. Ye, Z\. Ye, B\. Yin, C\. Yu, L\. Yu, T\. Yu, T\. Yu, E\. Yuan, M\. Yuan, X\. Yuan, Y\. Yue, W\. Zeng, D\. Zha, H\. Zhan, D\. Zhang, H\. Zhang, J\. Zhang, P\. Zhang, Q\. Zhang, R\. Zhang, X\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Z\. Zhang, C\. Zhao, F\. Zhao, J\. Zhao, S\. Zhao, X\. Zhao, Y\. Zhao, Z\. Zhao, H\. Zheng, R\. Zheng, S\. Zheng, T\. Zheng, J\. Zhong, L\. Zhong, W\. Zhong, M\. Zhou, R\. Zhou, X\. Zhou, Z\. Zhou, J\. Zhu, L\. Zhu, X\. Zhu, Y\. Zhu, Z\. Zhu, J\. Zhuang, W\. Zhuang, Y\. Zou, and X\. Zu \(2026\)Kimi k2\.5: visual agentic intelligence\.External Links:2602\.02276,[Link](https://arxiv.org/abs/2602.02276)Cited by:[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px3.p1.3)\.
- M\. Kulkarni, V\. Mazzia, J\. Gaspers, C\. Hench, and J\. FitzGerald \(2025\)MASSIVE\-agents: a benchmark for multilingual function\-calling in 52 languages\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 20193–20215\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1099/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1099)Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p2.1)\.
- C\. Liu, W\. Zhang, J\. Ying, M\. Aljunied, A\. T\. Luu, and L\. Bing \(2025\)SeaExam and SeaBench: benchmarking LLMs with local multilingual questions in Southeast Asia\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 6134–6151\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.341/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.341),ISBN 979\-8\-89176\-195\-7Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- H\. Lovenia, R\. Mahendra, S\. M\. Akbar, L\. J\. V\. Miranda, J\. Santoso, E\. Aco, A\. Fadhilah, J\. Mansurov, J\. M\. Imperial, O\. P\. Kampman, J\. R\. A\. Moniz, M\. R\. S\. Habibi, F\. Hudi, R\. Montalan, R\. Ignatius, J\. A\. Lopo, W\. Nixon, B\. F\. Karlsson, J\. Jaya, R\. Diandaru, Y\. Gao, P\. Amadeus, B\. Wang, J\. C\. B\. Cruz, C\. Whitehouse, I\. H\. Parmonangan, M\. Khelli, W\. Zhang, L\. Susanto, R\. A\. Ryanda, S\. L\. Hermawan, D\. J\. Velasco, M\. D\. A\. Kautsar, W\. F\. Hendria, Y\. Moslem, N\. Flynn, M\. F\. Adilazuarda, H\. Li, J\. Lee, R\. Damanhuri, S\. Sun, M\. R\. Qorib, A\. Djanibekov, W\. Q\. Leong, Q\. V\. Do, N\. Muennighoff, T\. Pansuwan, I\. F\. Putra, Y\. Xu, T\. N\. Chia, A\. Purwarianti, S\. Ruder, W\. Tjhi, P\. Limkonchotiwat, A\. F\. Aji, S\. Keh, G\. I\. Winata, R\. Zhang, F\. Koto, Z\. Yong, and S\. Cahyawijaya \(2024\)SEACrowd: a multilingual multimodal data hub and benchmark suite for Southeast Asian languages\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 5155–5203\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.296/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.296)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- H\. Luo, Z\. Li, J\. Attieh, S\. Devkota, O\. de Gibert, X\. Huang, S\. Ji, P\. Lin, B\. S\. P\. V\. Mantina, A\. Sreenidhi, R\. Vázquez, M\. Wang, S\. Yusofi, F\. Yuan, and J\. Tiedemann \(2025\)GlotEval: a test suite for massively multilingual evaluation of large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,I\. Habernal, P\. Schulam, and J\. Tiedemann \(Eds\.\),Suzhou, China,pp\. 602–614\.External Links:[Link](https://aclanthology.org/2025.emnlp-demos.43/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-demos.43),ISBN 979\-8\-89176\-334\-0Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- R\. Mushkani, H\. Berard, A\. Cohen, and S\. Koseki \(2025\)Position: the right to AI\.InForty\-second International Conference on Machine Learning Position Paper Track,External Links:[Link](https://openreview.net/forum?id=IxCvgUme5S)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p1.1)\.
- X\. Nguyen, W\. Zhang, X\. Li, M\. Aljunied, Z\. Hu, C\. Shen, Y\. K\. Chia, X\. Li, Q\. T\. Jianyu Wang, L\. Cheng, G\. Chen, Y\. Deng, S\. Yang, C\. Liu, H\. Zhang, and L\. Bing \(2024\)SeaLLMs \- large language models for southeast asia\.External Links:[Link](https://arxiv.org/pdf/2312.00738)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1)\.
- OpenAI \(2024\)MMMLU: multilingual massive multitask language understanding\.Note:[https://huggingface\.co/datasets/openai/MMMLU](https://huggingface.co/datasets/openai/MMMLU)Hugging Face datasetCited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- S\. G\. Patil, H\. Mao, C\. C\. Ji, F\. Yan, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InProceedings of the 42nd International Conference on Machine Learning,External Links:[Link](https://gorilla.cs.berkeley.edu/leaderboard)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- T\. Patwardhan, R\. Dias, E\. Proehl, G\. Kim, M\. Wang, O\. Watkins, S\. P\. Fishman, M\. Aljubeh, P\. Thacker, L\. Fauconnet, N\. S\. Kim, S\. Miserendino, G\. Chabot, D\. Li, P\. Chao, M\. Sharman, A\. Barr, A\. Glaese, and J\. Tworek \(2026\)GDPval: evaluating AI model performance on real\-world economically valuable tasks\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hcuEdq6eKD)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- A\. Purwarianti, D\. Adhista, A\. Baptiso, M\. Mahfuzh, Y\. Sabila, A\. Adila, S\. Cahyawijaya, and A\. F\. Aji \(2025\)NusaDialogue: dialogue summarization and generation for underrepresented and extremely low\-resource languages\.InProceedings of the Second Workshop in South East Asian Language Processing,D\. Wijaya, A\. F\. Aji, C\. Vania, G\. I\. Winata, and A\. Purwarianti \(Eds\.\),Online,pp\. 82–100\.External Links:[Link](https://aclanthology.org/2025.sealp-1.8/)Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- B\. A\. Putra \(2024\)Governing ai in southeast asia: asean’s way forward\.Frontiers in Artificial IntelligenceVolume 7 \- 2024\.External Links:[Link](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2024.1411838),[Document](https://dx.doi.org/10.3389/frai.2024.1411838),ISSN 2624\-8212Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, d\. li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 9695–9717\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/28e50ee5b72e90b50e7196fde8ea260e-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- Qwen\-Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px3.p1.3)\.
- S\. Ray, K\. Dhandhania, V\. Barres, and K\. Narasimhan \(2026\)τ\\tau\-Voice: benchmarking full\-duplex voice agents on real\-world domains\.External Links:2603\.13686,[Link](https://arxiv.org/abs/2603.13686)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p3.2),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3),[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px2.p1.2)\.
- D\. Romero, C\. Lyu, H\. A\. Wibowo, T\. Lynn, I\. Hamed, A\. N\. Kishore, A\. Mandal, A\. Dragonetti, A\. Abzaliev, A\. L\. Tonja, B\. F\. Balcha, C\. Whitehouse, C\. Salamea, D\. J\. Velasco, D\. I\. Adelani, D\. L\. Meur, E\. Villa\-Cueva, F\. Koto, F\. Farooqui, F\. Belcavello, G\. Batnasan, G\. Vallejo, G\. Caulfield, G\. Ivetta, H\. Song, H\. B\. Ademtew, H\. Maina, H\. Lovenia, I\. A\. Azime, J\. C\. B\. Cruz, J\. Gala, J\. Geng, J\. Ortiz\-Barajas, J\. Baek, J\. Dunstan, L\. A\. Alemany, K\. R\. Y\. Nagasinghe, L\. Benotti, L\. F\. D'Haro, M\. Viridiano, M\. Estecha\-Garitagoitia, M\. C\. B\. Cabrera, M\. Rodríguez\-Cantelar, M\. Jouitteau, M\. Mihaylov, M\. F\. M\. Imam, M\. F\. Adilazuarda, M\. Gochoo, M\. Otgonbold, N\. Etori, O\. Niyomugisha, P\. M\. Silva, P\. Chitale, R\. Dabre, R\. Chevi, R\. Zhang, R\. Diandaru, S\. Cahyawijaya, S\. Góngora, S\. Jeong, S\. Purkayastha, T\. Kuribayashi, T\. Jayakumar, T\. T\. Torrent, T\. Ehsan, V\. Araujo, Y\. Kementchedjhieva, Z\. Burzo, Z\. W\. Lim, Z\. X\. Yong, O\. Ignat, J\. Nwatu, R\. Mihalcea, T\. Solorio, and A\. F\. Aji \(2024\)CVQA: culturally\-diverse multilingual visual question answering benchmark\.External Links:2406\.05967Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- P\. Seshadri, S\. Cahyawijaya, A\. Odumakinde, S\. Singh, and S\. Goldfarb\-Tarrant \(2026\)Lost in simulation: llm\-simulated users are unreliable proxies for human users in agentic evaluations\.External Links:2601\.17087,[Link](https://arxiv.org/abs/2601.17087)Cited by:[§6\.1](https://arxiv.org/html/2606.28715#S6.SS1.SSS0.Px3.p1.1)\.
- Q\. Shi, A\. Zytek, P\. Razavi, K\. Narasimhan, and V\. Barres \(2026\)τ\\tau\-Knowledge: evaluating conversational agents over unstructured knowledge\.External Links:2603\.04370,[Link](https://arxiv.org/abs/2603.04370),[Document](https://dx.doi.org/10.48550/arXiv.2603.04370)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p3.2),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3),[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px2.p1.2),[§6\.1](https://arxiv.org/html/2606.28715#S6.SS1.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan, A\. Luo, A\. Helyar, A\. Madry, A\. Efremov, A\. Spyra, A\. Baker\-Whitcomb, A\. Beutel, A\. Karpenko, A\. Makelov, A\. Neitz, A\. Wei, A\. Barr, A\. Kirchmeyer, A\. Ivanov, A\. Christakis, A\. Gillespie, A\. Tam, A\. Bennett, A\. Wan, A\. Huang, A\. M\. Sandjideh, A\. Yang, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Gheorghe, A\. G\. Garcia, A\. Braunstein, A\. Liu, A\. Schmidt, A\. Mereskin, A\. Mishchenko, A\. Applebaum, A\. Rogerson, A\. Rajan, A\. Wei, A\. Kotha, A\. Srivastava, A\. Agrawal, A\. Vijayvergiya, A\. Tyra, A\. Nair, A\. Nayak, B\. Eggers, B\. Ji, B\. Hoover, B\. Chen, B\. Chen, B\. Barak, B\. Minaiev, B\. Hao, B\. Baker, B\. Lightcap, B\. McKinzie, B\. Wang, B\. Quinn, B\. Fioca, B\. Hsu, B\. Yang, B\. Yu, B\. Zhang, B\. Brenner, C\. R\. Zetino, C\. Raymond, C\. Lugaresi, C\. Paz, C\. Hudson, C\. Whitney, C\. Li, C\. Chen, C\. Cole, C\. Voss, C\. Ding, C\. Shen, C\. Huang, C\. Colby, C\. Hallacy, C\. Koch, C\. Lu, C\. Kaplan, C\. Kim, C\. Minott\-Henriques, C\. Frey, C\. Yu, C\. Czarnecki, C\. Reid, C\. Wei, C\. Decareaux, C\. Scheau, C\. Zhang, C\. Forbes, D\. Tang, D\. Goldberg, D\. Roberts, D\. Palmie, D\. Kappler, D\. Levine, D\. Wright, D\. Leo, D\. Lin, D\. Robinson, D\. Grabb, D\. Chen, D\. Lim, D\. Salama, D\. Bhattacharjee, D\. Tsipras, D\. Li, D\. Yu, D\. Strouse, D\. Williams, D\. Hunn, E\. Bayes, E\. Arbus, E\. Akyurek, E\. Y\. Le, E\. Widmann, E\. Yani, E\. Proehl, E\. Sert, E\. Cheung, E\. Schwartz, E\. Han, E\. Jiang, E\. Mitchell, E\. Sigler, E\. Wallace, E\. Ritter, E\. Kavanaugh, E\. Mays, E\. Nikishin, F\. Li, F\. P\. Such, F\. de Avila Belbute Peres, F\. Raso, F\. Bekerman, F\. Tsimpourlas, F\. Chantzis, F\. Song, F\. Zhang, G\. Raila, G\. McGrath, G\. Briggs, G\. Yang, G\. Parascandolo, G\. Chabot, G\. Kim, G\. Zhao, G\. Valiant, G\. Leclerc, H\. Salman, H\. Wang, H\. Sheng, H\. Jiang, H\. Wang, H\. Jin, H\. Sikchi, H\. Schmidt, H\. Aspegren, H\. Chen, H\. Qiu, H\. Lightman, I\. Covert, I\. Kivlichan, I\. Silber, I\. Sohl, I\. Hammoud, I\. Clavera, I\. Lan, I\. Akkaya, I\. Kostrikov, I\. Kofman, I\. Etinger, I\. Singal, J\. Hehir, J\. Huh, J\. Pan, J\. Wilczynski, J\. Pachocki, J\. Lee, J\. Quinn, J\. Kiros, J\. Kalra, J\. Samaroo, J\. Wang, J\. Wolfe, J\. Chen, J\. Wang, J\. Harb, J\. Han, J\. Wang, J\. Zhao, J\. Chen, J\. Yang, J\. Tworek, J\. Chand, J\. Landon, J\. Liang, J\. Lin, J\. Liu, J\. Wang, J\. Tang, J\. Yin, J\. Jang, J\. Morris, J\. Flynn, J\. Ferstad, J\. Heidecke, J\. Fishbein, J\. Hallman, J\. Grant, J\. Chien, J\. Gordon, J\. Park, J\. Liss, J\. Kraaijeveld, J\. Guay, J\. Mo, J\. Lawson, J\. McGrath, J\. Vendrow, J\. Jiao, J\. Lee, J\. Steele, J\. Wang, J\. Mao, K\. Chen, K\. Hayashi, K\. Xiao, K\. Salahi, K\. Wu, K\. Sekhri, K\. Sharma, K\. Singhal, K\. Li, K\. Nguyen, K\. Gu\-Lemberg, K\. King, K\. Liu, K\. Stone, K\. Yu, K\. Ying, K\. Georgiev, K\. Lim, K\. Tirumala, K\. Miller, L\. Ahmad, L\. Lv, L\. Clare, L\. Fauconnet, L\. Itow, L\. Yang, L\. Romaniuk, L\. Anise, L\. Byron, L\. Pathak, L\. Maksin, L\. Lo, L\. Ho, L\. Jing, L\. Wu, L\. Xiong, L\. Mamitsuka, L\. Yang, L\. McCallum, L\. Held, L\. Bourgeois, L\. Engstrom, L\. Kuhn, L\. Feuvrier, L\. Zhang, L\. Switzer, L\. Kondraciuk, L\. Kaiser, M\. Joglekar, M\. Singh, M\. Shah, M\. Stratta, M\. Williams, M\. Chen, M\. Sun, M\. Cayton, M\. Li, M\. Zhang, M\. Aljubeh, M\. Nichols, M\. Haines, M\. Schwarzer, M\. Gupta, M\. Shah, M\. Y\. Guan, M\. Huang, M\. Dong, M\. Wang, M\. Glaese, M\. Carroll, M\. Lampe, M\. Malek, M\. Sharman, M\. Zhang, M\. Wang, M\. Pokrass, M\. Florian, M\. Pavlov, M\. Wang, M\. Chen, M\. Wang, M\. Feng, M\. Bavarian, M\. Lin, M\. Abdool, M\. Rohaninejad, N\. Soto, N\. Staudacher, N\. LaFontaine, N\. Marwell, N\. Liu, N\. Preston, N\. Turley, N\. Ansman, N\. Blades, N\. Pancha, N\. Mikhaylin, N\. Felix, N\. Handa, N\. Rai, N\. Keskar, N\. Brown, O\. Nachum, O\. Boiko, O\. Murk, O\. Watkins, O\. Gleeson, P\. Mishkin, P\. Lesiewicz, P\. Baltescu, P\. Belov, P\. Zhokhov, P\. Pronin, P\. Guo, P\. Thacker, Q\. Liu, Q\. Yuan, Q\. Liu, R\. Dias, R\. Puckett, R\. Arora, R\. T\. Mullapudi, R\. Gaon, R\. Miyara, R\. Song, R\. Aggarwal, R\. Marsan, R\. Yemiru, R\. Xiong, R\. Kshirsagar, R\. Nuttall, R\. Tsiupa, R\. Eldan, R\. Wang, R\. James, R\. Ziv, R\. Shu, R\. Nigmatullin, S\. Jain, S\. Talaie, S\. Altman, S\. Arnesen, S\. Toizer, S\. Toyer, S\. Miserendino, S\. Agarwal, S\. Yoo, S\. Heon, S\. Ethersmith, S\. Grove, S\. Taylor, S\. Bubeck, S\. Banesiu, S\. Amdo, S\. Zhao, S\. Wu, S\. Santurkar, S\. Zhao, S\. R\. Chaudhuri, S\. Krishnaswamy, Shuaiqi, Xia, S\. Cheng, S\. Anadkat, S\. P\. Fishman, S\. Tobin, S\. Fu, S\. Jain, S\. Mei, S\. Egoian, S\. Kim, S\. Golden, S\. Mah, S\. Lin, S\. Imm, S\. Sharpe, S\. Yadlowsky, S\. Choudhry, S\. Eum, S\. Sanjeev, T\. Khan, T\. Stramer, T\. Wang, T\. Xin, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Degry, T\. Shadwell, T\. Fu, T\. Gao, T\. Garipov, T\. Sriskandarajah, T\. Sherbakov, T\. Korbak, T\. Kaftan, T\. Hiratsuka, T\. Wang, T\. Song, T\. Zhao, T\. Peterson, V\. Kharitonov, V\. Chernova, V\. Kosaraju, V\. Kuo, V\. Pong, V\. Verma, V\. Petrov, W\. Jiang, W\. Zhang, W\. Zhou, W\. Xie, W\. Zhan, W\. McCabe, W\. DePue, W\. Ellsworth, W\. Bain, W\. Thompson, X\. Chen, X\. Qi, X\. Xiang, X\. Shi, Y\. Dubois, Y\. Yu, Y\. Khakbaz, Y\. Wu, Y\. Qian, Y\. T\. Lee, Y\. Chen, Y\. Zhang, Y\. Xiong, Y\. Tian, Y\. Cha, Y\. Bai, Y\. Yang, Y\. Yuan, Y\. Li, Y\. Zhang, Y\. Yang, Y\. Jin, Y\. Jiang, Y\. Wang, Y\. Wang, Y\. Liu, Z\. Stubenvoll, Z\. Dou, Z\. Wu, and Z\. Wang \(2026\)OpenAI gpt\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px3.p1.3)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, S\. Ruder, W\. Ko, A\. Bosselut, A\. Oh, A\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global MMLU: understanding and addressing cultural and linguistic biases in multilingual evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18761–18799\.External Links:[Link](https://aclanthology.org/2025.acl-long.919/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.919),ISBN 979\-8\-89176\-251\-0Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- Y\. Susanto, A\. V\. Hulagadri, J\. R\. Montalan, J\. G\. Ngui, X\. Yong, W\. Q\. Leong, H\. Rengarajan, P\. Limkonchotiwat, Y\. Mai, and W\. C\. Tjhi \(2025\)SEA\-HELM: Southeast Asian holistic evaluation of language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 12308–12336\.External Links:[Link](https://aclanthology.org/2025.findings-acl.636/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.636),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- N\. Urailertprasert, P\. Limkonchotiwat, S\. Suwajanakorn, and S\. Nutanong \(2024\)SEA\-VQA: Southeast Asian cultural context dataset for visual question answering\.InProceedings of the 3rd Workshop on Advances in Language and Vision Research \(ALVR\),J\. Gu, T\. \(\. Fu, D\. Hudson, A\. Celikyilmaz, and W\. Wang \(Eds\.\),Bangkok, Thailand,pp\. 173–185\.External Links:[Link](https://aclanthology.org/2024.alvr-1.15/),[Document](https://dx.doi.org/10.18653/v1/2024.alvr-1.15)Cited by:[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- G\. I\. Winata, A\. F\. Aji, S\. Cahyawijaya, R\. Mahendra, F\. Koto, A\. Romadhony, K\. Kurniawan, D\. Moeljadi, R\. E\. Prasojo, P\. Fung, T\. Baldwin, J\. H\. Lau, R\. Sennrich, and S\. Ruder \(2023\)NusaX: multilingual parallel sentiment dataset for 10 Indonesian local languages\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,A\. Vlachos and I\. Augenstein \(Eds\.\),Dubrovnik, Croatia,pp\. 815–834\.External Links:[Link](https://aclanthology.org/2023.eacl-main.57/),[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.57)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28715#S2.SS2.p1.1)\.
- F\. F\. Xu, Y\. Song, B\. Li, Y\. Tang, K\. Jain, M\. Bao, Z\. Z\. Wang, X\. Zhou, Z\. Guo, M\. Cao, M\. Yang, H\. Y\. Lu, A\. Martin, Z\. Su, L\. M\. Maben, R\. Mehta, W\. Chi, L\. K\. Jang, Y\. Xie, S\. Zhou, and G\. Neubig \(2026\)TheAgentCompany: benchmarking LLM agents on consequential real world tasks\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=LZnKNApvhG)Cited by:[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.External Links:2406\.12045,[Link](https://arxiv.org/abs/2406.12045)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p3.2),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3),[§4](https://arxiv.org/html/2606.28715#S4.SS0.SSS0.Px2.p1.2),[§8](https://arxiv.org/html/2606.28715#S8.p2.2)\.
- F\. Ye, J\. Manotumruksa, and E\. Yilmaz \(2022\)MultiWOZ 2\.4: a multi\-domain task\-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation\.InProceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue,O\. Lemon, D\. Hakkani\-Tur, J\. J\. Li, A\. Ashrafzadeh, D\. H\. Garcia, M\. Alikhani, D\. Vandyke, and O\. Dušek \(Eds\.\),Edinburgh, UK,pp\. 351–360\.External Links:[Link](https://aclanthology.org/2022.sigdial-1.34/),[Document](https://dx.doi.org/10.18653/v1/2022.sigdial-1.34)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- X\. Zang, A\. Rastogi, S\. Sunkara, R\. Gupta, J\. Zhang, and J\. Chen \(2020\)MultiWOZ 2\.2 : a dialogue dataset with additional annotation corrections and state tracking baselines\.InProceedings of the 2nd Workshop on Natural Language Processing for Conversational AI,T\. Wen, A\. Celikyilmaz, Z\. Yu, A\. Papangelis, M\. Eric, A\. Kumar, I\. Casanueva, and R\. Shah \(Eds\.\),Online,pp\. 109–117\.External Links:[Link](https://aclanthology.org/2020.nlp4convai-1.13/),[Document](https://dx.doi.org/10.18653/v1/2020.nlp4convai-1.13)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.28715#S2.SS1.p1.3)\.
- W\. Zhang, H\. P\. Chan, Y\. Zhao, M\. Aljunied, J\. Wang, C\. Liu, Y\. Deng, Z\. Hu, W\. Xu, Y\. K\. Chia, X\. Li, and L\. Bing \(2024\)SeaLLMs 3: open foundation and chat multilingual large language models for southeast asian languages\.External Links:[Link](https://arxiv.org/abs/2407.19672)Cited by:[§1](https://arxiv.org/html/2606.28715#S1.p2.1)\.

## Appendix AMachine Translation Pipeline Details

### A\.1τ2\\tau^\{2\}\-BenchContext

Eachτ2\\tau^\{2\}\-Benchdomain \(airline, retail, telecom\) exposes the agent to two coupled surfaces\. First,data/tau2/domains/\{domain\}/contains the benchmark content seen during an interaction: task definitions intasks\.json, domain policies and workflow documents in markdown files\*\.md, and structured databases indb\.json,db\.toml, or user\-side DB files\. Second,src/tau2/domains/\{domain\}/contains the executable tool interface: toolkits intools\.pyanduser\_tools\.py, schema definitions indata\_model\.pyanduser\_data\_model\.py, and in telecom, tool return string templates\.

At runtime, the agent must read policy, reason about tasks, call tools, inspect tool outputs, and communicate with a simulated user entirely in the target language\.

### A\.2Translation Principles

Adaptingτ2\\tau^\{2\}\-Bench to non\-English languages requires translating thenatural\-language contentseen by the language model while leaving untouched theruntime\-canonical tokensconsumed by the tool execution layer: enum literals, entity identifiers, status codes, and tool argument names\. Conflating these two layers would corrupt evaluation: a translated product name that reaches the execution engine might trigger a lookup failure, while an untranslated persona description breaks the monolingual interaction we intend to test\. We therefore build the pipeline around four principles:Runtime invariance\.Canonical tokens remain English at execution time\. Only the LLM\-visible surface is localized\.

Terminological consistency\.A canonical value should surface with the same target\-language realization everywhere it appears\.

Format fidelity\.Translation is applied only to natural\-language leaves\. Structural keys, numeric fields, booleans, and program identifiers are never modified\.

Bidirectional transparency\.Localized values can be mapped back to canonical English forms before execution and before metric computation\.

The implementation has two phases: anoffline translation phasethat materializes language\-specific assets, and aruntime localization phasethat patches the environment presented to the agent\.

### A\.3Phase 1\. Offline Translation

The offline phase consumes a domain's static artifacts, translate withGemini 3\.1 Flash Lite, and writes translated outputs todata/tau2/domains/\{domain\}/\{lang\_id\}/\.

Artifact ClassSource FilesTranslatable ContentRuntime Alignment PurposeTask setuptasks\.jsonPersona text, instructions, reason\-for\-call, historyDefines user personas and evaluation\-facing assertions\.Policy & workflow\*\.mdFull document proseGoverns the agent's allowed behavior and procedures\.Tool descriptionstools\.py,user\_tools\.pyDocstrings of tool\-decorated methodsConditions how the agent interprets tool semantics\.Schemas & literalsdata\_model\.py,user\_data\_model\.pyDescriptions, enum values, andLiteral\[\.\.\.\]Defines choices appearing in dynamic tool schemas\.Database contentdb\.json,db\.tomlNatural\-language leaves \(e\.g\., names, notes\)Supplies textual fields surfaced through tool results\.Tool return messagestool\_returns\.jsonExact messages and parameterized templatesEnsures dynamic strings returned by tools remain localized\.

Table 2:Artifact classes processed during offline translation and their role in maintaining runtime evaluation alignment\.#### Step 1\. Artifact discovery

For each selected domain, the pipeline discovers files fromdata/tau2/domains/\{domain\}/andsrc/tau2/domains/\{domain\}/, then assigns each file a processing kind \(markdown,json,toml, orpython\)\. Refer to Table[2](https://arxiv.org/html/2606.28715#A1.T2)for the list of artifacts and their role in translation\.

#### Step 2\. Segment extraction with path\-sensitive rules\.

Each file is reduced to minimal translatable segments together with metadata describing its source path and type\.

- •Markdown files yield a single full\-document segment\.
- •Task JSON is walked recursively, but only a curated allowlist of paths is translated, including persona fields, task instructions, reason\-for\-call, natural\-language assertions, and user\- or assistant\-visible message history\.
- •Database JSON/TOML is translated only at conservative natural\-language leaf keys such asname,title,description,summary, andnotes\. Domain\-specific additions may extend this set: the airline domain also translatesaddress1,address2, andcity\(user profile address text that is natural language and safe to localize\)\.
- •Tool Python files are parsed withast; only docstrings attached to@is\_toolor@is\_discoverable\_toolmethods are extracted\.
- •Google\-style tool docstrings are decomposed into short description, long description, parameter descriptions, returns text, and raises text so that each part can be translated independently and later reassembled\.
- •Schema Python files are converted into JSON artifacts whose translatable fields include class descriptions, field descriptions, enum values, andLiteral\[\.\.\.\]alternatives\.
- •Tool return files expose both exact response strings and template strings as translatable segments\.

#### Step 3\. Canonical\-token masking\.

Before translation, the pipeline collects and masks strings that must remain canonical: IDs such asorder\_\*orbooking\_\*, status values, structural task markers, tool names, and docstring section headers such asArgsorReturns\. Protection is partly global and partly contextual\. For example, airline literals such asbasic\_economyorround\_tripare protected when they occur in cabin\-class or trip\-type contexts, but not when similar surface strings appear as ordinary prose\. The masking layer replaces protected strings with opaque placeholders and restores them after translation\.

#### Step 4\. Schema\-first literal translation\.

Schema enum labels andLiteral\[\.\.\.\]values are translated first in a dedicated literal mode\. This produces language\-specific schema artifacts from which the pipeline builds a domain literal map, including alias forms such as underscore, space\-separated, or hyphenated variants of the same canonical value\. This step fixes the localized terminology before any longer\-form prose is translated\.

#### Step 5\. Standard translation with glossary injection\.

All remaining segments, including task prose, policy text, tool docstrings, database leaves, and tool\-return templates, are translated in standard mode\. The model receives the schema\-derived literal map as a glossary, and segments are pre\-masked so that localized forms are restored consistently\. Requests are deduplicated when possible, batched as structured JSON, and executed concurrently through LiteLLM\. The current implementation requires the exact Vertex routevertex\_ai/gemini\-3\.1\-flash\-lite\-preview\. Batch failures can be retried, recursively split, or rerun individually if placeholder restoration fails\.

#### Step 6\. Format\-aware writing and manifest recording\.

The translated text is written back using format\-specific writers: markdown is emitted directly; JSON/TOML files are patched only at extracted addresses; tool docstrings are reconstructed and saved astools\.jsonoruser\_tools\.json; schema artifacts are written asdata\_model\.jsonanduser\_data\_model\.json; and DB files with no translated leaves are still copied through so the translated directory remains complete\. Each language directory also receivestranslation\_manifest\.json, which records the output file, component, source language, target language, model, translation timestamp, and SHA\-256 fingerprints of the source files\. This manifest enables later staleness checks without forcing retranslation\.

### A\.4Phase 2\. Runtime Localization

The offline artifacts cover static content, but the agent also sees dynamic tool schemas and tool outputs at inference time\. These are handled by a runtime localization layer built from the paired source and translated schema artifacts\.

#### Step 7\. Build runtime localization maps\.

ASchemaRuntimeLocalizerconstructs four resources from the source and localized schema artifacts: a description map, a canonical\-to\-localized literal map, a localized\-to\-canonical inverse map, and optional maps for exact and templated tool\-return localizations\.

#### Step 8\. Localize the tool schema shown to the agent\.

The environment'sget\_tools\(\)path is wrapped so that the agent receives localized tool schemas: descriptions are translated, enum choices are shown in the target language, and default or example literal values are localized where appropriate\. The tool implementation itself is unchanged\.

#### Step 9\. Normalize localized arguments and localize tool outputs\.

Before tool execution, localized enum arguments supplied by the agent are mapped back to their canonical English values\. After execution, structured response payloads and tool\-return messages are localized back into the target language so the interaction remains monolingual from the agent's perspective\.

#### Step 10\. Canonicalize localized values for evaluation\.

Prior to metric computation, localized payloads are canonicalized back to English, making pass\-rate comparisons directly comparable across languages and against the original English benchmark\.

### A\.5Translated artifact statistics

Tables[3](https://arxiv.org/html/2606.28715#A1.T3),[4](https://arxiv.org/html/2606.28715#A1.T4),[5](https://arxiv.org/html/2606.28715#A1.T5), and[6](https://arxiv.org/html/2606.28715#A1.T6)provide summary statistics for translated artifacts resulting from offline translation phase \(Appendix[A\.3](https://arxiv.org/html/2606.28715#A1.SS3)\)\.

DomainFiles / LanguageTotal TranslatedAirline525Retail525Telecom1365Total23115Table 3:No\. static artifact files translated per domain\. Each domain is translated into 5 target languages\.DomainTasksDocstringsPoliciesReturn MsgsAirline2501450Retail5701650Telecom570432560

Table 4:No\. instances per type translated across all 5 target languages\.DomainTotal ModelsValue SetsLocalized ValuesAirline231521Retail15614Telecom181251

Table 5:Schema artifacts and literal inventories\.DomainCollectionsRecord BreakdownAirline3flights: 300, users: 500, reservations: 2kRetail3products: 50, users: 500, orders: 1kTelecom6plans: 5, devices: 29, lines: 9, customers: 4, bills: 6

Table 6:Database artifacts\. Structure is preserved; only designated leaf fields are translated\.

## Appendix BDetail on Mixed Language Tool Adaptation

For the mixed language tool adaptation, our goal is to investigate the stability of agents when handling tools in mixed languages\. We conducted a similar experiment to the \(S3\) L2 Tool Adaptation where we modify the language of the tool specification into L2, but instead of converting to a specific L2, in this experiment we convert the tool specifications to a mix of several L2 languages\. Specifically, we explore 5 different mixed language settings by gradually increasing the number of languages in the tool specification while maintaining a similar composition across domains: Mix\-2 uses English and Thai, Mix\-3 adds Vietnamese, Mix\-4 adds Indonesian, and Mix\-5 adds Chinese\. For each run, the language of a tool is fixed for the entire run to reduce noise from random sampling\. For instance, ifget\_itemtool is assigned to Thai, it remains in Thai across all examples\. For each non\-English language added, three tools are localized into that language, while the remaining tools stay in English; thus, the English\-tool count decreases as more languages are added\.

## Appendix CHyperparameters

HyperparameterAgent LLMUser Sim\. LLMTemperature0\.00\.0Nucleus sampling \(pp\)1\.01\.0Max\. generation tokens——

Table 7:Inference hyperparameters for the L2 Interaction Setting\. Values follow theτ2\\tau^\{2\}\-Bench defaults; parameters not listed are left at the provider model defaults\.All models in the L2 Interaction Setting are run with the default inference parameters fromτ2\\tau^\{2\}\-Bench, as shown in Table[7](https://arxiv.org/html/2606.28715#A3.T7)\.

## Appendix DCompute Budget

ScenarioAgent LLM Cost \(USD\)English Baseline$70\.40L2 Tool Adaptation$340\.73L2 Interaction$403\.27L2 Domain Adaptation$440\.43Total$1,254\.83

Table 8:Approximate agent LLM inference cost \(USD\) for each experimental scenario, aggregated across three evaluation trials and estimated from token usage with published API pricing\.For agent model inference, we use two models across all experimental settings: GPT\-5\-mini accessed via the OpenRouter API, and Kimi\-K2\.5 accessed via Azure AI Foundry\. Table[8](https://arxiv.org/html/2606.28715#A4.T8)summarizes the approximate total inference cost per scenario\. For user model inference, we self\-host Qwen3\-235B\-A22B\-Instruct\-2507 on 8×A100 GPUs \(40GB each\), resulting in a total of 4,032 GPU hours\.

In the translation pipeline section, we usedGemini 3\.1 Flash\-Lite Previewfor three domains and five target languages\. Aggregated across five languages, the pipeline consumes 8\.97M input tokens and 7\.21M output tokens\. Based on the Gemini 3\.1 Flash\-Lite pricing of $0\.25/M input tokens and $1\.50/M output tokens, the total translation cost is approximately $13\.05\.

## Appendix EAgent language use and drift analysis

We characterize agent language behavior along three axes: how often the agent produces text in the expected target language \([Figure9](https://arxiv.org/html/2606.28715#A5.F9)\), where in a conversation any off\-target text appears \([Figure10](https://arxiv.org/html/2606.28715#A5.F10)\), and how much of the off\-target text is specifically English \([Figure11](https://arxiv.org/html/2606.28715#A5.F11)\)\.

![Refer to caption](https://arxiv.org/html/2606.28715v1/x9.png)Figure 9:Average turn\-level agent language correctness by setting and L2\. Language correctness is calculated as the fraction of eligible turns whose detected language \(by fastText\) matches the expected L2 for one turn\. We exclude simulated user language correctness because it is consistently above 0\.95 across scenarios\.#### Correctness is high overall and degrades only in crosslingual interaction\.

[Figure9](https://arxiv.org/html/2606.28715#A5.F9)shows that L2 Tool is essentially perfect across all tested languages \(mean turn correctness≈1\.00\\approx 1\.00for Thai, Vietnamese, Indonesian, Chinese, and Filipino\), indicating that translating tool schemas does not by itself induce agent language drift: with an English dialogue the agent reliably stays in English\. L2 Domain also remains high \(0\.920\.92–0\.980\.98\), with a gentle ordering in which Thai and Vietnamese are strongest \(0\.980\.98\) and Filipino is weakest \(0\.920\.92\)\. The clearest degradation is confined to L2 Interaction, where correctness falls to0\.750\.75–0\.820\.82and is again lowest for Filipino \(0\.750\.75\) and Chinese \(0\.780\.78\)\. The gap between settings is large relative to the gap between languages: moving from L2 Domain to L2 Interaction costs roughly0\.150\.15–0\.200\.20in correctness for every language, whereas the spread across languages within a setting is at most∼0\.07\\sim 0\.07\. Drift is therefore primarily a property of the interaction regime rather than of any single language\.

![Refer to caption](https://arxiv.org/html/2606.28715v1/x10.png)Figure 10:Proportion of non\-L2 turns as a function of agent turn position, separated by setting and by speaker role \(agent vs\. user\)\.
#### Text of off\-target language accumulates over the course of a conversation\.

[Figure10](https://arxiv.org/html/2606.28715#A5.F10)decomposes this drift by turn position\. In L2 Tool, the non\-L2 proportion stays negligible throughout for both roles, consistent with scenario design where dialogue should be in Enlglish\. In L2 Interaction and L2 Domain, the agent's non\-L2 share rises monotonically with turn index \(Pearsonr≈0\.65r\\approx 0\.65andr≈0\.73r\\approx 0\.73between turn position and off\-target proportion, respectively\), pointing to an accumulation or self\-priming effect rather than a fixed per\-turn error rate\. The two settings differ in their dynamics: in L2 Interaction both the simulated user and the agent contribute off\-target turns and the magnitude is largest \(the agent share peaks near0\.120\.12in late turns\), whereas in L2 Domain the simulated user remains essentially on\-target \(≈0\.015\\approx 0\.015throughout\) and the modest rise to≈0\.07\\approx 0\.07is agent\-initiated and concentrated in the later turns\. This is consistent with[Figure9](https://arxiv.org/html/2606.28715#A5.F9): L2 Interaction is where both parties' code\-switching reinforces one another\.

#### The off\-target text is predominantly English, and only for Filipino is it non\-trivial\.

[Figure11](https://arxiv.org/html/2606.28715#A5.F11)shows that when the agent leaves the target language it overwhelmingly switches to English, but that this is rare for most languages\. For Thai, Vietnamese, Indonesian, and Chinese the per\-task median English share is0in both settings, with right\-skewed distributions whose means sit at only a few percent\. Filipino is the clear outlier: its English share is highest in both settings \(mean0\.110\.11, median0\.060\.06in L2 Interaction; mean0\.080\.08, median0\.040\.04in L2 Domain\), and the per\-language mean line rises toward Filipino in both panels\. The right\-skew indicates that English switching is driven by a minority of tasks rather than being uniform, so even the elevated Filipino mean reflects a tail of high\-switching tasks rather than pervasive drift\.

![Refer to caption](https://arxiv.org/html/2606.28715v1/x11.png)Figure 11:Per\-task share of agent turns emitted in English, by target language, for the two scenarios in which the dialogue language is L2\. Each dot is a task; the overlaid line tracks the per\-language mean\.
#### Drift is real but is not the dominant explanation for performance\.

Taken together, language drift is genuine and scenario\-dependent: strongest in L2 Interaction, concentrated in Filipino, accumulating over turns, and realized mainly as English code\-switching\. It is nonetheless too localized and too weakly associated with task success to serve as the primary explanation for performance differences\. Across all scenarios, agent language correctness is essentially uncorrelated withpass3\\mathrm\{pass\}^\{3\}\(overall Pearsonr=−0\.10r=\-0\.10,R2≈0\.01R^\{2\}\\approx 0\.01,n=166n=166\); the only setting with a strong positive correlation is L2 Tool, where correctness is saturated near1\.001\.00and thus carries no usable variance, while in L2 Interaction and L2 Domain the scenario\-level correlations are weak \(\|r\|≲0\.25\|r\|\\lesssim 0\.25\)\. Language drift is therefore best read as a measurable secondary symptom of the harder crosslingual regimes \(L2 Interaction\) rather than a direct cause of the pass3gaps reported in the main text\.

## Appendix FAll results

DomainModelENp@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}AirlineGPT\-5\-mini0\.6930\.5930\.5400\.779Qwen3\-235B0\.5000\.3870\.3400\.680Kimi\-K2\.50\.7070\.6000\.5400\.764RetailGPT\-5\-mini0\.6550\.4970\.4120\.629Qwen3\-235B0\.5790\.4150\.3420\.591Kimi\-K2\.50\.5610\.4420\.3950\.704TelecomGPT\-5\-mini0\.7130\.5880\.5090\.714Qwen3\-235B0\.3570\.2050\.1320\.370Kimi\-K2\.50\.9970\.9530\.9300\.933
Table 9:English Baseline \(S1\) results\. Where EN denotes English, p@1 denotes pass@1,p2p^\{2\}denotes pass2, andp3p^\{3\}denotes pass3\. Bold values indicate the best score within each domain–metric group\. Qwen3\-235B refers toQwen3\-235B\-A22B\-Instruct\-2507\.DomainModelVITHIDZhTLp@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}AirlineGPT\-5\-mini0\.6400\.5000\.4200\.6560\.5930\.4730\.4000\.6740\.6270\.4470\.3600\.5740\.6200\.4600\.3800\.6130\.6200\.5270\.4800\.774Qwen3\-235B0\.4800\.3470\.2800\.5830\.4870\.3730\.3200\.6570\.4930\.3470\.2600\.5270\.4600\.3000\.2200\.4780\.4400\.3130\.2600\.591Kimi\-K2\.50\.7330\.6600\.6000\.8180\.6870\.5800\.5200\.7570\.7400\.6530\.6000\.8110\.7600\.6400\.5800\.7630\.7470\.6530\.6000\.804RetailGPT\-5\-mini0\.6960\.5700\.5000\.7180\.6990\.5380\.4300\.6150\.6840\.5470\.4560\.6670\.6750\.5060\.4120\.6100\.6550\.5180\.4300\.656Qwen3\-235B0\.5560\.4180\.3330\.6000\.5260\.3830\.3250\.6170\.6080\.4620\.3860\.6350\.5910\.4420\.3510\.5940\.5730\.4440\.3860\.674Kimi\-K2\.50\.6870\.5500\.4650\.6770\.5730\.4300\.3600\.6270\.6400\.4830\.3860\.6030\.7220\.5970\.5260\.7290\.6750\.5410\.4740\.701TelecomGPT\-5\-mini0\.5610\.4300\.3680\.6560\.6050\.4360\.3420\.5650\.5820\.4530\.4040\.6930\.7280\.6080\.5180\.7110\.7400\.6170\.5350\.723Qwen3\-235B0\.2600\.1290\.0610\.2360\.2220\.1080\.0610\.2760\.3330\.1700\.1050\.3160\.2280\.1320\.0970\.4230\.1550\.0880\.0610\.396Kimi\-K2\.50\.9120\.8360\.7720\.8460\.8540\.7630\.7020\.8220\.9150\.8390\.7720\.8430\.9300\.8680\.8160\.8770\.8830\.7840\.7020\.795

Table 10:L2 Interaction \(S2\) results across Vietnamese \(VI\), Thai \(TH\), Indonesian \(ID\), Chinese \(ZH\), and Filipino \(TL\)\. Here, p@1 denotes pass@1,p2p^\{2\}denotes pass2, andp3p^\{3\}denotes pass3\. Bold values indicate the best score within each domain–metric group\. Qwen3\-235B refers toQwen3\-235B\-A22B\-Instruct\-2507\.DomainModelVITHIDZhTLp@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}AirlineGPT\-5\-mini0\.6530\.5270\.4600\.7040\.6200\.4730\.4000\.6450\.6870\.5670\.4800\.6990\.6870\.5930\.5200\.7570\.6670\.5800\.5400\.810Qwen3\-235B0\.4470\.3130\.2600\.5820\.3800\.2530\.2200\.5790\.4000\.2330\.1800\.4500\.4800\.3130\.2000\.4170\.4670\.3200\.2400\.514RetailGPT\-5\-mini0\.6810\.5320\.4300\.6310\.6460\.4770\.3770\.5840\.6750\.5260\.4300\.6370\.7190\.5790\.4820\.6700\.6610\.5150\.4390\.664Qwen3\-235B0\.5470\.4120\.3510\.6420\.5610\.4210\.3510\.6260\.5500\.4180\.3510\.6380\.5940\.4330\.3600\.6060\.5700\.4210\.3420\.600TelecomGPT\-5\-mini0\.6640\.5500\.4820\.7260\.7130\.5880\.5180\.7270\.7130\.5960\.5180\.7270\.7050\.5760\.5000\.7090\.7190\.5940\.5090\.708Qwen3\-235B0\.2540\.1350\.0880\.3460\.3390\.2080\.1400\.4130\.2690\.1550\.1050\.3900\.2220\.1350\.0960\.4320\.3010\.1670\.1050\.349

Table 11:Tool Adaptation \(S3\) results in monolingual settings across Vietnamese \(VI\), Thai \(TH\), Indonesian \(ID\), Chinese \(ZH\), and Filipino \(TL\)\. Here, p@1 denotes pass@1,p2p^\{2\}denotes pass2, andp3p^\{3\}denotes pass3\. Bold values indicate the best score within each domain–metric group\. Qwen3\-235B refers toQwen3\-235B\-A22B\-Instruct\-2507\.DomainModelBiTriQuadMultip@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}AirlineGPT\-5\-mini0\.6800\.5600\.4800\.7060\.6600\.5600\.4800\.7270\.6400\.5200\.4800\.7500\.6400\.5130\.4400\.688Qwen3\-235B0\.4400\.2800\.2000\.4550\.4130\.2670\.2000\.4840\.4130\.2870\.2400\.5810\.4130\.2670\.2200\.533RetailGPT\-5\-mini0\.6670\.5410\.4650\.6970\.6670\.5380\.4560\.6840\.6520\.4850\.3860\.5920\.6960\.5470\.4470\.642Qwen3\-235B0\.5960\.4560\.3860\.6480\.5350\.3980\.3250\.6070\.5850\.4470\.3770\.6440\.5700\.4010\.2980\.523TelecomGPT\-5\-mini0\.6640\.5410\.4820\.7260\.6960\.5580\.4910\.7050\.6700\.5500\.4910\.7330\.6520\.5200\.4650\.713Qwen3\-235B0\.2980\.1550\.0880\.2950\.3920\.2130\.1400\.3570\.3650\.1810\.1050\.2880\.3420\.1810\.1140\.333

Table 12:Tool Adaptation \(S3\) results for multilingual settings, including bilingual \(Bi\), trilingual \(Tri\), quadlingual \(Quad\), and multilingual \(Multi\) configurations\. Here, p@1 denotes pass@1,p2p^\{2\}denotes pass2, andp3p^\{3\}denotes pass3\. Bold values indicate the best score within each domain–metric group\. Qwen3\-235B refers toQwen3\-235B\-A22B\-Instruct\-2507\.DomainModelVITHIDZhTLp@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}p@1p2p^\{2\}p3p^\{3\}ρ3\\rho^\{3\}AirlineGPT\-5\-mini0\.5400\.4000\.3200\.5930\.5930\.4530\.3600\.6070\.6000\.5400\.5000\.8330\.6070\.4930\.4400\.7250\.5600\.4330\.3800\.679Qwen3\-235B0\.4870\.4600\.4400\.9030\.4600\.4130\.3800\.8260\.4800\.4400\.4200\.8750\.5000\.4730\.4600\.9200\.4730\.4200\.4000\.846Kimi\-K2\.50\.5600\.4270\.3400\.6070\.5470\.3870\.3000\.5480\.6000\.4870\.4200\.7000\.6600\.5070\.4400\.6670\.6000\.4870\.4000\.667RetailGPT\-5\-mini0\.5850\.4270\.3330\.5690\.3250\.1750\.1230\.3780\.5410\.3920\.3250\.6010\.5960\.4500\.3680\.6170\.4880\.3330\.2630\.539Qwen3\-235B0\.2890\.2050\.1490\.5160\.1780\.1430\.1320\.7420\.2430\.1700\.1400\.5760\.2660\.1810\.1490\.5600\.2600\.1870\.1580\.608Kimi\-K2\.50\.5670\.3920\.2680\.4730\.3270\.1900\.1140\.3490\.4330\.2870\.2020\.4670\.6070\.4500\.3660\.6030\.4440\.2920\.2190\.493TelecomGPT\-5\-mini0\.5640\.4270\.3600\.6380\.5990\.4470\.3600\.6010\.4970\.3250\.2460\.4950\.7600\.6520\.5880\.7740\.6810\.5470\.4740\.696Qwen3\-235B0\.3830\.2130\.1490\.3890\.2840\.1430\.0790\.2780\.4270\.2950\.2110\.4940\.3300\.2050\.1490\.4520\.3300\.2050\.1490\.452Kimi\-K2\.50\.6990\.5290\.4040\.5780\.6930\.5230\.430\.6200\.7980\.6810\.5880\.7370\.8630\.7630\.6840\.7930\.7430\.5820\.4740\.638

Table 13:L2 Domain \(S4\) results across Vietnamese \(VI\), Thai \(TH\), Indonesian \(ID\), Chinese \(ZH\), and Filipino \(TL\)\. Here, p@1 denotes pass@1,p2p^\{2\}denotes pass2, andp3p^\{3\}denotes pass3\. Bold values indicate the best score within each domain–metric group\. Qwen3\-235B refers toQwen3\-235B\-A22B\-Instruct\-2507\.
## Appendix GDetailed Error Analysis Results

SettingVariantAgent ErrorsUser ErrorsCriticalMinorCorrectCriticalMinorCorrectEnglish BaselineEnglish812247384864L2 InteractionVietnamese932532434859Thai1051431424563Indonesian881844364965Chinese901842445155Filipino941838533661ToolAdaptationVI Tools812346435156TH Tools882834394566ID Tools942036414168ZH Tools842343354471FIL Tools921741454560Mix\-2 \(EN\+TH\)902436523860Mix\-3 \(\+VI\)861747335463Mix\-4 \(\+ID\)892635354570Mix\-5 \(\+ZH\)1021533404169L2 DomainVietnamese832740434166Thai971736553065Indonesian922434423969Chinese1031532404862Filipino1011732504753Table 14:Error analysis on theAirlinedomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=150n\{=\}150per condition\)\. Counts of simulations by maximum severity level\.SettingVariantAgent ErrorsUser ErrorsCriticalMinorCorrectCriticalMinorCorrectEnglish BaselineEnglish987017479135128L2 InteractionVietnamese1097815596103143Thai1257314494124124Indonesian1176516086121135Chinese1097316081126135Filipino10388151105129108ToolAdaptationVI Tools977517075130137TH Tools1027916170123149ID Tools948616292130120ZH Tools956418387121134FIL Tools887917579129134Mix\-2 \(EN\+TH\)818118058132152Mix\-3 \(\+VI\)1006917388123131Mix\-4 \(\+ID\)1007217088112142Mix\-5 \(\+ZH\)879316282129131L2 DomainVietnamese1129313790122130Thai1806010212791124Indonesian1247614293122127Chinese13184127101113128Filipino13174137123118101Table 15:Error analysis on theRetaildomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=342n\{=\}342per condition\)\. Counts of simulations by maximum severity level\.SettingVariantAgent ErrorsUser ErrorsCriticalMinorCorrectCriticalMinorCorrectEnglish BaselineEnglish11011311950130162L2 InteractionVietnamese10111612552146144Thai1259012759133150Indonesian12210012060136146Chinese957617160129153Filipino918816362141139ToolAdaptationVI Tools10311712245156141TH Tools8710614946147149ID Tools8910814548140154ZH Tools1099114255135152FIL Tools7511215545115182Mix\-2 \(EN\+TH\)9110614554138150Mix\-3 \(\+VI\)9511713055144143Mix\-4 \(\+ID\)9110214956128158Mix\-5 \(\+ZH\)9410114743148151L2 DomainVietnamese1257114654112176Thai1338512464105173Indonesian1398112259120163Chinese1038615345128169Filipino1318912264130148Table 16:Error analysis on theTelecomdomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=342n\{=\}342per condition\)\. Counts of simulations by maximum severity level\.Eng\. BaselineL2 InteractionTool AdaptationL2 DomainError TagENVITHIDZHFILVITHIDZHFILMix\-2Mix\-3Mix\-4Mix\-5VITHIDZHFILGuideline ViolationC:100 M:20C:116 M:38C:113 M:23C:89 M:31C:85 M:27C:110 M:28C:82 M:19C:105 M:30C:87 M:32C:92 M:26C:89 M:21C:110 M:22C:97 M:16C:96 M:24C:103 M:26C:90 M:36C:130 M:28C:115 M:18C:108 M:21C:114 M:24HallucinationC:6 M:7C:3 M:6C:4 M:8C:6 M:5C:9 M:1C:8 M:5C:0 M:4C:5 M:2C:5 M:6C:2 M:4C:6 M:5C:3 M:2C:3 M:5C:2 M:5C:7 M:2C:8 M:2C:14 M:10C:3 M:5C:7 M:8C:6 M:7Inconsistent BehaviorC:4 M:8C:7 M:11C:11 M:9C:6 M:15C:5 M:7C:5 M:3C:12 M:5C:11 M:3C:5 M:6C:5 M:10C:3 M:5C:3 M:5C:13 M:4C:8 M:9C:6 M:5C:6 M:13C:9 M:10C:6 M:8C:7 M:10C:6 M:5Incorrect InterpretationC:33 M:25C:67 M:29C:56 M:30C:47 M:33C:46 M:38C:46 M:36C:25 M:29C:46 M:35C:33 M:40C:42 M:31C:39 M:29C:34 M:29C:33 M:24C:36 M:34C:46 M:21C:54 M:37C:71 M:35C:43 M:25C:47 M:32C:58 M:30Interruption ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Irrelevant Tool CallC:0 M:0C:3 M:4C:0 M:4C:0 M:4C:0 M:2C:0 M:0C:4 M:0C:0 M:1C:0 M:0C:0 M:2C:0 M:1C:0 M:1C:1 M:1C:1 M:3C:0 M:2C:2 M:0C:1 M:4C:1 M:0C:0 M:1C:0 M:1Missed Required ActionC:71 M:16C:82 M:28C:98 M:18C:74 M:15C:85 M:20C:90 M:25C:79 M:22C:75 M:22C:78 M:17C:75 M:27C:85 M:22C:78 M:20C:78 M:14C:87 M:33C:84 M:21C:77 M:15C:86 M:16C:96 M:25C:94 M:19C:110 M:23OtherC:0 M:1C:2 M:4C:0 M:0C:0 M:1C:1 M:3C:0 M:0C:0 M:2C:0 M:1C:2 M:1C:0 M:0C:0 M:0C:0 M:1C:2 M:0C:1 M:1C:0 M:1C:1 M:0C:1 M:0C:0 M:1C:0 M:1C:0 M:4Premature TerminationC:28 M:1C:20 M:1C:30 M:2C:21 M:1C:18 M:0C:25 M:1C:22 M:1C:21 M:4C:19 M:1C:23 M:3C:34 M:1C:29 M:1C:35 M:2C:26 M:1C:45 M:2C:21 M:9C:25 M:2C:28 M:0C:27 M:2C:30 M:1Revealed Info EarlyC:0 M:1C:0 M:2C:0 M:1C:1 M:0C:1 M:0C:2 M:2C:3 M:1C:1 M:1C:0 M:0C:0 M:0C:1 M:1C:1 M:1C:2 M:1C:0 M:0C:0 M:0C:1 M:2C:0 M:2C:3 M:1C:1 M:2C:0 M:0Tool Call Argument ErrorC:5 M:4C:3 M:1C:7 M:6C:4 M:6C:3 M:7C:1 M:4C:2 M:6C:2 M:6C:1 M:4C:1 M:0C:5 M:0C:3 M:3C:2 M:5C:1 M:4C:1 M:2C:5 M:3C:5 M:3C:7 M:8C:6 M:4C:6 M:2Tool Call Schema ErrorC:0 M:0C:1 M:0C:2 M:2C:0 M:2C:0 M:1C:0 M:4C:0 M:0C:0 M:1C:0 M:0C:2 M:0C:0 M:0C:0 M:1C:1 M:0C:0 M:1C:0 M:2C:1 M:1C:0 M:0C:0 M:1C:0 M:1C:0 M:0Wrong SequenceC:9 M:3C:9 M:7C:8 M:3C:3 M:4C:11 M:4C:7 M:3C:12 M:6C:3 M:3C:5 M:3C:12 M:6C:5 M:2C:5 M:1C:11 M:1C:5 M:1C:7 M:1C:8 M:3C:7 M:3C:13 M:5C:12 M:2C:11 M:5

Table 17:Agent error tag counts — all settings,Airlinedomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=150n\{=\}150per condition\)\.C:= critical\-severity count;M:= minor\-severity count\. Each value is the number of simulations in which the tag appeared at that severity level at least once\.Eng\. BaselineL2 InteractionTool AdaptationL2 DomainError TagENVITHIDZHFILVITHIDZHFILMix\-2Mix\-3Mix\-4Mix\-5VITHIDZHFILGuideline ViolationC:13 M:26C:18 M:22C:19 M:25C:15 M:30C:28 M:23C:26 M:31C:19 M:36C:9 M:17C:16 M:13C:14 M:25C:22 M:18C:23 M:19C:16 M:22C:19 M:20C:18 M:14C:16 M:20C:29 M:14C:20 M:13C:18 M:19C:21 M:24HallucinationC:26 M:27C:22 M:24C:31 M:24C:19 M:37C:23 M:31C:39 M:23C:39 M:28C:39 M:20C:25 M:21C:32 M:28C:35 M:29C:46 M:19C:30 M:26C:25 M:21C:33 M:20C:35 M:22C:48 M:13C:24 M:13C:33 M:18C:29 M:21Inconsistent BehaviorC:15 M:34C:22 M:34C:15 M:28C:17 M:42C:25 M:35C:25 M:31C:15 M:45C:19 M:27C:13 M:40C:14 M:37C:24 M:33C:21 M:37C:19 M:36C:15 M:37C:25 M:33C:16 M:23C:26 M:30C:15 M:27C:18 M:37C:19 M:39Incorrect InterpretationC:4 M:7C:3 M:8C:2 M:9C:2 M:4C:3 M:9C:11 M:7C:2 M:11C:3 M:2C:3 M:5C:1 M:6C:1 M:4C:6 M:8C:4 M:9C:3 M:4C:1 M:7C:3 M:8C:3 M:2C:3 M:6C:0 M:13C:5 M:8Interruption ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Irrelevant Tool CallC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Missed Required ActionC:0 M:1C:3 M:1C:2 M:0C:0 M:1C:1 M:2C:2 M:1C:1 M:1C:1 M:0C:3 M:1C:1 M:0C:1 M:2C:1 M:1C:0 M:0C:1 M:3C:0 M:1C:2 M:3C:0 M:1C:0 M:0C:0 M:0C:3 M:3OtherC:0 M:1C:0 M:1C:0 M:0C:0 M:0C:0 M:2C:1 M:0C:0 M:0C:0 M:2C:0 M:0C:1 M:0C:0 M:1C:0 M:0C:0 M:3C:1 M:0C:0 M:0C:0 M:2C:0 M:0C:0 M:1C:0 M:0C:2 M:1Premature TerminationC:9 M:10C:6 M:12C:12 M:15C:11 M:9C:11 M:18C:8 M:10C:4 M:7C:10 M:17C:10 M:16C:7 M:10C:9 M:15C:9 M:7C:7 M:9C:6 M:11C:3 M:11C:8 M:14C:10 M:8C:9 M:11C:9 M:12C:11 M:16Revealed Info EarlyC:0 M:2C:1 M:0C:0 M:2C:0 M:0C:0 M:1C:0 M:2C:0 M:2C:0 M:0C:0 M:0C:1 M:1C:0 M:1C:0 M:1C:0 M:0C:0 M:0C:0 M:1C:0 M:3C:0 M:0C:0 M:0C:0 M:3C:0 M:1Tool Call Argument ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Tool Call Schema ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Wrong SequenceC:0 M:0C:0 M:0C:0 M:0C:0 M:1C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:1C:0 M:1C:0 M:0C:1 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:1C:0 M:0

Table 18:User error tag counts — all settings,Airlinedomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=150n\{=\}150per condition\)\.C:= critical\-severity count;M:= minor\-severity count\. Each value is the number of simulations in which the tag appeared at that severity level at least once\.Eng\. BaselineL2 InteractionTool AdaptationL2 DomainError TagENVITHIDZHFILVITHIDZHFILMix\-2Mix\-3Mix\-4Mix\-5VITHIDZHFILGuideline ViolationC:70 M:74C:99 M:80C:105 M:81C:88 M:83C:71 M:59C:86 M:94C:68 M:52C:70 M:66C:69 M:69C:62 M:55C:78 M:56C:49 M:55C:67 M:48C:63 M:53C:78 M:72C:73 M:76C:128 M:54C:92 M:53C:112 M:93C:92 M:59HallucinationC:7 M:11C:9 M:12C:15 M:13C:17 M:17C:13 M:11C:10 M:10C:9 M:11C:6 M:13C:3 M:11C:16 M:7C:15 M:10C:11 M:10C:11 M:12C:17 M:18C:12 M:18C:18 M:17C:30 M:10C:18 M:17C:9 M:18C:14 M:13Inconsistent BehaviorC:10 M:15C:17 M:24C:13 M:24C:8 M:15C:8 M:28C:8 M:16C:10 M:15C:6 M:14C:7 M:17C:8 M:10C:5 M:21C:10 M:15C:10 M:13C:7 M:16C:8 M:26C:13 M:27C:29 M:25C:12 M:22C:24 M:27C:22 M:22Incorrect InterpretationC:34 M:39C:32 M:54C:42 M:59C:38 M:54C:26 M:44C:37 M:39C:31 M:53C:40 M:47C:28 M:39C:30 M:40C:36 M:34C:22 M:33C:29 M:32C:38 M:43C:27 M:65C:48 M:88C:142 M:62C:50 M:56C:63 M:63C:50 M:55Interruption ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Irrelevant Tool CallC:2 M:3C:1 M:3C:1 M:4C:5 M:5C:0 M:5C:0 M:2C:1 M:3C:1 M:5C:0 M:0C:0 M:2C:2 M:2C:0 M:1C:1 M:3C:0 M:3C:1 M:5C:0 M:1C:11 M:1C:1 M:4C:1 M:5C:1 M:1Missed Required ActionC:75 M:53C:78 M:62C:95 M:51C:88 M:55C:93 M:53C:94 M:63C:62 M:59C:79 M:66C:74 M:64C:69 M:63C:69 M:69C:58 M:56C:63 M:71C:73 M:49C:59 M:67C:73 M:66C:160 M:62C:75 M:65C:98 M:83C:101 M:81OtherC:0 M:2C:0 M:3C:1 M:7C:2 M:1C:0 M:3C:0 M:3C:0 M:3C:0 M:3C:0 M:4C:2 M:1C:1 M:2C:1 M:2C:0 M:4C:0 M:1C:0 M:6C:0 M:3C:1 M:5C:1 M:2C:1 M:5C:2 M:1Premature TerminationC:10 M:5C:5 M:2C:7 M:2C:11 M:0C:5 M:7C:9 M:5C:6 M:3C:7 M:3C:8 M:3C:12 M:3C:6 M:1C:12 M:4C:17 M:2C:5 M:1C:6 M:2C:12 M:4C:28 M:4C:15 M:4C:14 M:1C:7 M:3Revealed Info EarlyC:0 M:0C:1 M:5C:4 M:4C:4 M:2C:1 M:2C:2 M:3C:2 M:3C:1 M:2C:1 M:5C:0 M:6C:0 M:0C:0 M:4C:3 M:1C:0 M:5C:1 M:0C:1 M:3C:2 M:3C:2 M:3C:0 M:3C:2 M:2Tool Call Argument ErrorC:8 M:5C:7 M:9C:7 M:18C:8 M:6C:9 M:10C:5 M:3C:9 M:7C:5 M:7C:8 M:9C:6 M:3C:8 M:6C:5 M:14C:7 M:4C:8 M:9C:5 M:8C:35 M:49C:48 M:39C:39 M:28C:48 M:25C:37 M:34Tool Call Schema ErrorC:1 M:2C:3 M:5C:4 M:6C:1 M:5C:0 M:4C:4 M:9C:0 M:6C:2 M:6C:1 M:8C:4 M:2C:4 M:5C:1 M:2C:0 M:9C:4 M:5C:2 M:7C:6 M:1C:3 M:9C:5 M:9C:6 M:5C:2 M:4Wrong SequenceC:20 M:7C:24 M:24C:20 M:17C:19 M:20C:19 M:16C:23 M:15C:16 M:14C:12 M:21C:18 M:18C:8 M:13C:11 M:9C:14 M:14C:7 M:10C:11 M:13C:11 M:18C:14 M:11C:24 M:8C:19 M:21C:12 M:15C:13 M:12

Table 19:Agent error tag counts — all settings,Retaildomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=342n\{=\}342per condition\)\.C:= critical\-severity count;M:= minor\-severity count\. Each value is the number of simulations in which the tag appeared at that severity level at least once\.Eng\. BaselineL2 InteractionTool AdaptationL2 DomainError TagENVITHIDZHFILVITHIDZHFILMix\-2Mix\-3Mix\-4Mix\-5VITHIDZHFILGuideline ViolationC:19 M:50C:32 M:51C:41 M:72C:28 M:36C:38 M:42C:31 M:56C:19 M:46C:22 M:34C:41 M:43C:29 M:57C:24 M:47C:17 M:37C:30 M:39C:23 M:29C:27 M:40C:25 M:31C:29 M:23C:29 M:32C:33 M:31C:23 M:33HallucinationC:69 M:105C:87 M:91C:71 M:82C:59 M:92C:55 M:86C:95 M:108C:71 M:115C:55 M:90C:61 M:132C:74 M:104C:54 M:123C:53 M:123C:82 M:106C:69 M:94C:64 M:129C:75 M:76C:111 M:74C:76 M:105C:119 M:110C:115 M:106Inconsistent BehaviorC:22 M:93C:38 M:85C:40 M:101C:38 M:96C:32 M:81C:44 M:101C:28 M:99C:28 M:86C:27 M:86C:30 M:103C:37 M:92C:14 M:66C:38 M:94C:31 M:76C:21 M:85C:39 M:94C:39 M:86C:34 M:102C:26 M:84C:51 M:111Incorrect InterpretationC:4 M:21C:15 M:8C:9 M:10C:6 M:8C:1 M:13C:10 M:7C:2 M:10C:5 M:16C:4 M:8C:8 M:16C:7 M:15C:7 M:12C:5 M:12C:5 M:12C:12 M:14C:5 M:20C:11 M:16C:7 M:12C:5 M:17C:13 M:15Interruption ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Irrelevant Tool CallC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Missed Required ActionC:0 M:2C:1 M:1C:2 M:4C:4 M:2C:3 M:1C:3 M:3C:2 M:1C:5 M:4C:0 M:4C:3 M:4C:2 M:3C:0 M:0C:0 M:6C:1 M:2C:2 M:2C:2 M:0C:3 M:2C:2 M:5C:2 M:3C:3 M:2OtherC:0 M:4C:2 M:3C:2 M:10C:1 M:1C:0 M:8C:1 M:1C:0 M:2C:0 M:4C:0 M:2C:0 M:2C:1 M:3C:1 M:5C:0 M:7C:0 M:2C:0 M:6C:0 M:4C:0 M:3C:0 M:4C:1 M:3C:0 M:4Premature TerminationC:9 M:19C:11 M:17C:17 M:18C:12 M:19C:16 M:22C:10 M:15C:10 M:16C:7 M:15C:15 M:14C:10 M:16C:14 M:17C:6 M:14C:10 M:15C:17 M:12C:11 M:21C:8 M:18C:28 M:24C:16 M:19C:11 M:12C:22 M:16Revealed Info EarlyC:0 M:1C:0 M:0C:2 M:0C:0 M:0C:0 M:0C:0 M:0C:2 M:2C:1 M:0C:3 M:0C:1 M:2C:0 M:1C:1 M:1C:1 M:1C:0 M:0C:0 M:0C:0 M:1C:3 M:2C:0 M:0C:0 M:1C:0 M:2Tool Call Argument ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Tool Call Schema ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Wrong SequenceC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:1C:0 M:0C:0 M:0C:0 M:1C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0

Table 20:User error tag counts — all settings,Retaildomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=342n\{=\}342per condition\)\.C:= critical\-severity count;M:= minor\-severity count\. Each value is the number of simulations in which the tag appeared at that severity level at least once\.Eng\. BaselineL2 InteractionTool AdaptationL2 DomainError TagENVITHIDZHFILVITHIDZHFILMix\-2Mix\-3Mix\-4Mix\-5VITHIDZHFILGuideline ViolationC:88 M:81C:77 M:128C:99 M:105C:103 M:116C:71 M:46C:61 M:68C:73 M:90C:69 M:77C:57 M:94C:77 M:61C:49 M:88C:80 M:83C:78 M:81C:70 M:75C:72 M:90C:117 M:104C:106 M:97C:98 M:91C:78 M:81C:106 M:115HallucinationC:4 M:8C:6 M:16C:7 M:15C:6 M:14C:3 M:5C:4 M:11C:4 M:18C:4 M:11C:6 M:8C:2 M:14C:2 M:11C:4 M:15C:3 M:14C:4 M:17C:3 M:14C:7 M:16C:5 M:11C:5 M:19C:5 M:9C:3 M:11Inconsistent BehaviorC:9 M:28C:15 M:49C:16 M:29C:12 M:40C:4 M:20C:6 M:32C:6 M:29C:6 M:25C:3 M:29C:3 M:36C:7 M:36C:8 M:27C:9 M:29C:3 M:19C:7 M:26C:10 M:37C:9 M:47C:8 M:44C:8 M:31C:12 M:38Incorrect InterpretationC:39 M:94C:38 M:84C:47 M:100C:43 M:100C:36 M:53C:29 M:60C:41 M:80C:17 M:91C:30 M:80C:37 M:74C:26 M:66C:28 M:71C:36 M:98C:26 M:96C:40 M:80C:50 M:69C:54 M:95C:53 M:77C:29 M:71C:54 M:85Interruption ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Irrelevant Tool CallC:0 M:10C:4 M:36C:7 M:24C:5 M:19C:1 M:6C:3 M:18C:1 M:11C:2 M:8C:2 M:27C:5 M:6C:2 M:11C:2 M:14C:1 M:36C:0 M:6C:1 M:20C:3 M:19C:2 M:11C:1 M:8C:0 M:4C:3 M:15Missed Required ActionC:127 M:144C:106 M:132C:142 M:118C:138 M:132C:109 M:107C:107 M:97C:140 M:154C:93 M:124C:104 M:133C:134 M:142C:96 M:147C:107 M:153C:100 M:189C:106 M:135C:118 M:118C:151 M:96C:166 M:136C:178 M:108C:111 M:119C:163 M:111OtherC:0 M:6C:1 M:6C:0 M:10C:1 M:8C:1 M:3C:1 M:8C:0 M:5C:1 M:14C:0 M:17C:2 M:5C:5 M:2C:0 M:11C:1 M:27C:0 M:2C:1 M:6C:0 M:15C:1 M:9C:1 M:7C:0 M:7C:1 M:2Premature TerminationC:25 M:4C:25 M:6C:36 M:4C:28 M:2C:23 M:2C:18 M:1C:25 M:1C:28 M:1C:28 M:4C:28 M:3C:23 M:1C:36 M:4C:28 M:4C:33 M:0C:37 M:2C:30 M:3C:39 M:2C:33 M:3C:27 M:3C:33 M:5Revealed Info EarlyC:0 M:2C:2 M:2C:2 M:2C:0 M:1C:1 M:1C:0 M:0C:1 M:1C:1 M:3C:1 M:3C:1 M:1C:1 M:0C:2 M:1C:0 M:0C:0 M:1C:1 M:4C:3 M:4C:1 M:4C:0 M:5C:0 M:0C:2 M:2Tool Call Argument ErrorC:1 M:0C:6 M:7C:4 M:19C:0 M:4C:1 M:3C:1 M:4C:0 M:2C:1 M:3C:1 M:4C:1 M:1C:1 M:7C:3 M:1C:2 M:3C:0 M:0C:0 M:6C:2 M:7C:2 M:7C:2 M:3C:2 M:5C:3 M:11Tool Call Schema ErrorC:0 M:1C:1 M:3C:0 M:6C:0 M:1C:1 M:0C:1 M:2C:0 M:0C:0 M:0C:0 M:0C:0 M:1C:2 M:0C:1 M:3C:0 M:2C:0 M:0C:0 M:0C:2 M:1C:0 M:0C:1 M:3C:0 M:0C:5 M:4Wrong SequenceC:6 M:12C:13 M:24C:7 M:27C:10 M:16C:12 M:7C:5 M:8C:12 M:15C:7 M:13C:10 M:14C:10 M:8C:9 M:9C:8 M:13C:7 M:16C:7 M:9C:12 M:18C:17 M:13C:16 M:12C:23 M:21C:12 M:19C:13 M:27

Table 21:Agent error tag counts — all settings,Telecomdomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=342n\{=\}342per condition\)\.C:= critical\-severity count;M:= minor\-severity count\. Each value is the number of simulations in which the tag appeared at that severity level at least once\.Eng\. BaselineL2 InteractionTool AdaptationL2 DomainError TagENVITHIDZHFILVITHIDZHFILMix\-2Mix\-3Mix\-4Mix\-5VITHIDZHFILGuideline ViolationC:25 M:42C:10 M:93C:30 M:77C:30 M:53C:30 M:82C:54 M:105C:33 M:52C:36 M:61C:16 M:74C:19 M:64C:35 M:58C:15 M:49C:13 M:52C:15 M:77C:12 M:56C:13 M:42C:17 M:46C:11 M:59C:23 M:85C:22 M:85HallucinationC:52 M:85C:34 M:96C:47 M:67C:68 M:74C:62 M:85C:29 M:68C:45 M:91C:48 M:88C:51 M:78C:49 M:55C:32 M:60C:48 M:93C:40 M:95C:38 M:69C:74 M:83C:39 M:48C:49 M:56C:57 M:54C:37 M:61C:52 M:47Inconsistent BehaviorC:24 M:194C:19 M:207C:26 M:186C:33 M:139C:40 M:197C:52 M:194C:17 M:186C:47 M:175C:15 M:180C:22 M:160C:23 M:135C:28 M:158C:32 M:209C:23 M:169C:14 M:184C:18 M:146C:16 M:151C:20 M:140C:10 M:115C:20 M:142Incorrect InterpretationC:1 M:50C:6 M:30C:2 M:36C:8 M:49C:7 M:38C:3 M:31C:4 M:30C:6 M:36C:5 M:49C:6 M:52C:4 M:46C:4 M:38C:7 M:35C:4 M:39C:4 M:29C:3 M:30C:7 M:34C:8 M:25C:6 M:28C:7 M:42Interruption ErrorC:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0C:0 M:0Irrelevant Tool CallC:0 M:0C:0 M:1C:0 M:1C:0 M:4C:0 M:0C:0 M:0C:0 M:1C:0 M:1C:0 M:8C:0 M:3C:0 M:10C:0 M:0C:0 M:1C:0 M:3C:0 M:0C:0 M:1C:0 M:0C:0 M:1C:0 M:3C:0 M:3Missed Required ActionC:3 M:24C:3 M:9C:6 M:16C:1 M:22C:3 M:12C:2 M:29C:2 M:11C:2 M:30C:2 M:21C:5 M:16C:9 M:11C:3 M:22C:2 M:8C:4 M:17C:1 M:21C:6 M:8C:2 M:12C:1 M:9C:3 M:14C:2 M:9OtherC:1 M:10C:1 M:9C:0 M:13C:1 M:9C:0 M:28C:0 M:15C:0 M:15C:1 M:3C:0 M:12C:0 M:32C:0 M:11C:1 M:9C:0 M:16C:1 M:10C:1 M:10C:1 M:11C:0 M:33C:0 M:20C:1 M:8C:0 M:27Premature TerminationC:13 M:16C:9 M:24C:12 M:20C:10 M:18C:14 M:21C:17 M:16C:5 M:23C:3 M:14C:12 M:20C:8 M:15C:11 M:14C:11 M:15C:12 M:17C:15 M:14C:5 M:16C:11 M:23C:17 M:12C:14 M:29C:9 M:18C:11 M:31Revealed Info EarlyC:0 M:0C:0 M:0C:0 M:2C:0 M:1C:1 M:3C:0 M:3C:0 M:1C:0 M:1C:0 M:0C:0 M:4C:0 M:15C:1 M:1C:1 M:1C:0 M:0C:0 M:0C:1 M:0C:1 M:1C:1 M:2C:0 M:2C:0 M:1Tool Call Argument ErrorC:0 M:3C:2 M:7C:0 M:2C:2 M:1C:1 M:0C:1 M:2C:0 M:10C:0 M:5C:1 M:9C:0 M:1C:0 M:2C:1 M:4C:4 M:8C:0 M:1C:0 M:1C:0 M:5C:0 M:2C:0 M:1C:0 M:3C:0 M:2Tool Call Schema ErrorC:0 M:1C:5 M:6C:0 M:0C:0 M:2C:0 M:8C:0 M:8C:0 M:3C:0 M:6C:0 M:5C:0 M:2C:0 M:4C:0 M:3C:0 M:1C:0 M:5C:0 M:1C:0 M:0C:0 M:1C:0 M:22C:1 M:1C:0 M:2Wrong SequenceC:0 M:0C:0 M:0C:0 M:11C:4 M:6C:2 M:5C:0 M:4C:0 M:7C:1 M:2C:0 M:6C:2 M:3C:0 M:3C:0 M:10C:0 M:1C:0 M:5C:0 M:2C:0 M:2C:1 M:1C:0 M:1C:0 M:1C:0 M:1

Table 22:User error tag counts — all settings,Telecomdomain \(gpt\-5\-miniagent,qwen3\-235buser simulator,DeepSeek\-V4\-Flashjudge,n=342n\{=\}342per condition\)\.C:= critical\-severity count;M:= minor\-severity count\. Each value is the number of simulations in which the tag appeared at that severity level at least once\.

Similar Articles

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Hugging Face Daily Papers

SpeechEditBench is a bilingual multi-attribute benchmark for evaluating instruction-guided speech editing across seven atomic tasks and compositional tasks, using an anchor-based evaluation protocol with three metrics. Evaluation of mainstream Speech LLMs reveals no single model excels across all dimensions, and compositional editing remains highly challenging.

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Hugging Face Daily Papers

TUA-Bench is a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.