Evaluating Chinese Ambiguity Understanding in Large Language Models
Summary
This paper introduces CHA-Gen, a Chinese ambiguity dataset grounded in Potential Ambiguity theory, and evaluates several LLMs on ambiguity detection, finding that models struggle but benefit from chain-of-thought prompting, and that instruction tuning induces overconfidence.
View Cached Full Text
Cached at: 05/18/26, 06:33 AM
# Evaluating Chinese Ambiguity Understanding in Large Language Models
Source: [https://arxiv.org/html/2605.15635](https://arxiv.org/html/2605.15635)
utokyo\]organization=Graduate School of Information Science and Technology, The University of Tokyo,city=Tokyo, country=Japan
scut\]organization=School of Software Engineering, South China University of Technology,addressline=Guangzhou Higher Education Mega Centre, Panyu District, city=Guangzhou, postcode=510006, state=Guangdong, country=China
\\credit
Conceptualization, Methodology, Software, Data curation, Formal analysis, Writing – original draft, Visualization, Validation
\\credit
Methodology, Software, Data curation, Formal analysis, Writing – original draft, Visualization, Validation
\\credit
Methodology, Software, Data curation, Formal analysis, Writing – original draft, Visualization, Validation
\[orcid=0000\-0002\-2265\-756X\]\\creditConceptualization, Methodology, Formal analysis, Data curation, Supervision, Writing – review and editing, Project administration, Funding acquisition, Resources
\\cormark
\[2\]\\cortext\[2\]Corresponding author\. E\-mail: kexu@scut\.edu\.cn
\\credit
Supervision, Funding acquisition, Writing – review and editing, Resources
Yuanzhi Luseyzluk@mail\.scut\.edu\.cnYifang Xue202230482506@mail\.scut\.edu\.cnKe XuHideki Nakayamanakayama@ci\.i\.u\-tokyo\.ac\.jp
###### Abstract
Linguistic ambiguity is critical to the robustness of Large Language Models \(LLMs\), yet existing research focuses mostly on English, with limited attention devoted to Chinese\. Existing Chinese ambiguity datasets \(e\.g\., CHAmbi\) suffer from poor scalability\. Guided by Potential Ambiguity \(PA\) Theory, we design a semi\-automatic pipeline to construct CHA\-Gen\. It is the first PA Theory\-grounded Chinese ambiguity dataset, which comprises 5,712 sentences \(2,414 ambiguous, 3,298 unambiguous\) across 18 potential ambiguous structures\. Evaluating LLMs \(e\.g\. Gemma 3, Qwen 2\.5/3 series\) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection \(improved by CoT prompting\)\. Analysis of Qwen3\-32B’s CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution\. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences\. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity\. We further observe that models exhibit a bias toward dominant interpretations\. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs’ ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs\.
###### keywords:
Linguistic Ambiguity\\sepLarge Language Model\\sepMachine Translation\\sepUncertainty Quantification
## 1Introduction
Linguistic ambiguity is a universal phenomenon across all human languages\(Piantadosi et al\.,[2012](https://arxiv.org/html/2605.15635#bib.bib21)\), referring to some sentences that exhibit more than one plausible interpretation\. Recognizing and interpreting such ambiguity is important because it helps us avoid misunderstandings and confusion in daily communication\. In recent years, Large Language Models \(LLMs\) have been developing rapidly and becoming popular in intelligent assistants\(Brown et al\.,[2020](https://arxiv.org/html/2605.15635#bib.bib3); Liu et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib14); Guo et al\.,[2025](https://arxiv.org/html/2605.15635#bib.bib7); Team et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib23)\)\. To improve the robustness and trustworthiness of these assistants, the ability to detect ambiguous sentences is essential for preventing losses caused by misunderstandings\(Bajceta et al\.,[2022](https://arxiv.org/html/2605.15635#bib.bib1); Min et al\.,[2020](https://arxiv.org/html/2605.15635#bib.bib19); Bhaskar et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib2); Wang et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib24)\)\.
While most existing research on linguistic ambiguity has focused on English\(Ortega\-Martín et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib20); Liu et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib15); Mehrabi et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib17); Kim et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib10)\), comparatively little attention has been paid to Chinese\. Due to its unique grammatical features, Chinese is generally considered as more context\-dependent than English\(Li,[2021](https://arxiv.org/html/2605.15635#bib.bib12)\)\. Consequently, in the absence of sufficient contextual cues, Chinese sentences or phrases are more prone to multiple plausible interpretations\. A relevant work is CHAmbi\(Zhang et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib28)\), which proposes a natural language inference \(NLI\) dataset containing Chinese ambiguous sentences\. However, most sentences in this dataset are collected from the Internet, rendering it non\-scalable\. Therefore, exploring methods to systematically generate or expand the corpus of ambiguous sentences is crucial for subsequent research and forms the central focus of this work\.
Potential Ambiguity Theory \(PA Theory\)\(Feng,[1989](https://arxiv.org/html/2605.15635#bib.bib5),[1995](https://arxiv.org/html/2605.15635#bib.bib6)\)posits that certain abstract syntactic configurations, referred to as*potential ambiguous structures*, may give rise to ambiguity upon instantiation when specific conditions are met\. For instance \(see Figure[1](https://arxiv.org/html/2605.15635#S1.F1)\), in English, the structureV \+ NP \+ PPis likely to lead to ambiguity when the postverbal PP can be interpreted either as modifying the NP or as an adjunct of the verb\. Specifically, the phrase “saw the child with a telescope” is ambiguous between two readings: one in which the PP modifies the NP \(the child who has a telescope\) and another in which it functions as an instrumental modifier of the verb \(using a telescope to see the child\)\. By contrast, “saw the child with a bag” is unambiguous, as it fails to trigger the instrumental interpretation\. In Chinese,Feng \([1995](https://arxiv.org/html/2605.15635#bib.bib6)\)summarizes 18 types of potential ambiguous structures along with their corresponding ambiguity\-triggering conditions\.
Figure 1:An example of a potential ambiguous structure in PA theory\.Guided by the PA Theory, we design a semi\-automatic pipeline\. The pipeline collects potential ambiguous structures together with their ambiguity\-triggering conditions, then generates and verifies instances via LLMs, followed by human validation\. The outcome is the high\-quality dataset CHA‑Gen, which contains a total of 5,712 sentences\. Among these, 2,414 are ambiguous, while 3,298 are unambiguous\. These instances span 18 distinct ambiguity structures\.
Another focus of this work is the systematic evaluation of Chinese ambiguity handling capabilities in LLMs\. Based on the proposed CHA\-Gen corpus and the existing CHAmbi dataset, we design and carry out assessment using our tailored evaluation methods: direct querying and machine translation\. The evaluated models include several general\-purpose large language models, including Gemma 3, Qwen 2\.5, and Qwen 3, across a range of model scales\. The main contributions of this work are summarized as follows:
1. 1\.We develop a semi\-automatic pipeline to construct a large corpus of Chinese sentences with structural ambiguity, CHA\-Gen111[https://github\.com/SpaJune/CHA\-Gen](https://github.com/SpaJune/CHA-Gen)\. To the best of our knowledge, CHA\-Gen is the first Chinese ambiguity dataset grounded in PA Theory\. It enriches available resources for research on Chinese linguistic ambiguity\.
2. 2\.We conduct ambiguity identification and comparison tasks to evaluate LLMs’ ability of ambiguity detection and discrimination\. LLMs still face challenges in such tasks, as shown in previous work\(Liu et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib15); Zhang et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib28)\), while chain\-of\-thought \(CoT\) prompting can significantly improve the performance in most cases\. We analyze their generated rationale and provide insights for future study\.
3. 3\.We investigate LLMs’ uncertainty in Chinese\-to\-English translation using semantic entropy as an uncertainty quantification metric\. The results show that LLMs have higher generation uncertainty for ambiguous than unambiguous sentences\. Moreover, Base models capture semantic diversity in certain ambiguous cases, whereas instruction tuning reduces uncertainty and causes overconfident predictions\. A further case study investigates how models are biased toward dominant interpretations\. These findings highlight potential limitations of instruction tuning and raise concerns about misunderstandings in cross\-lingual communication\.
The remainder of the article is organized as follows\. We review related work in Section[2](https://arxiv.org/html/2605.15635#S2)and introduce CHA\-Gen Corpus in Section[3](https://arxiv.org/html/2605.15635#S3)\. The details of the evaluation via direct querying and machine translation are presented in Section[4](https://arxiv.org/html/2605.15635#S4)and Section[5](https://arxiv.org/html/2605.15635#S5), respectively\. Section[6](https://arxiv.org/html/2605.15635#S6)summarizes the article\.
## 2Related Work
Research efforts have been devoted to constructing relevant datasets and benchmarks, as well as investigating how LLMs behave when confronted with ambiguous inputs\. We review related work on these topics\.
Datasets and benchmarks of ambiguity\.Previous research has released datasets and benchmarks focusing on visual ambiguity\(Rostamkhani et al\.,[2025](https://arxiv.org/html/2605.15635#bib.bib22); Chen et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib4)\)and vision\-based context ambiguity\(Ma et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib16); Wang et al\.,[2025](https://arxiv.org/html/2605.15635#bib.bib25)\)\.Wildenburg et al\. \([2024](https://arxiv.org/html/2605.15635#bib.bib26)\)introduce the DUST dataset, which comprises underspecified sentences, and conduct experiments to assess whether LLMs can detect syntactic ambiguity based on perplexity\. However, although significant progress has been made in disambiguation for English and the multimodal domain, available resources for Chinese remain relatively limited\. In the Chinese context,He et al\. \([2020](https://arxiv.org/html/2605.15635#bib.bib8)\)present a dataset that encompasses both lexical and syntactic ambiguities, using it to evaluate the commonsense reasoning abilities of LLMs\. More recently,Zhang et al\. \([2024](https://arxiv.org/html/2605.15635#bib.bib28)\)constructs CHAmbi, a fine\-grained Chinese benchmark specifically designed for ambiguity detection and resolution\. However, the sentences in CHAmbi are primarily collected from the Internet, which limits the scalability and practical application\. Different from CHAmbi, we develop a novel process to construct ambiguous Chinese sentences/phrases based on potential ambiguous structures\. This way enhances the scalability of corpus construction while maintaining linguistic quality\.
Inspecting LLMs’ Behavior in Handling Ambiguity\.LLMs’ ambiguity\-handling abilities have been evaluated primarily through direct querying and behavior\-based analysis\.
The idea of direct querying is straightforward\. LLM models are explicitly prompted to identify whether a given sentence is ambiguous\(Zhang et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib28); Ortega\-Martín et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib20)\)\. In particular,Wu et al\. \([2025](https://arxiv.org/html/2605.15635#bib.bib27)\)conducts a comprehensive study of different prompting strategies\. Another line of direct querying research adopts a comparative method\(Wildenburg et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib26)\), in which models are given pairs of ambiguous and unambiguous sentences to identify the more underspecified one\.
To comprehensively understand LLMs’ ability to understand ambiguity, examining how they handle ambiguous sentences across different downstream tasks is also important\.Liu et al\. \([2023](https://arxiv.org/html/2605.15635#bib.bib15)\); Zhang et al\. \([2024](https://arxiv.org/html/2605.15635#bib.bib28)\)adopt multilabel NLI tasks in order to see whether LLMs can be aware of the uncertainty of ambiguous inputs\. In addition,Liu et al\. \([2023](https://arxiv.org/html/2605.15635#bib.bib15)\)evaluates ambiguity awareness in language models by comparing continuation distributions conditioned on ambiguous sentences and their disambiguations\. Using sampled textual continuations, they assess whether continuations from valid interpretations are less surprising under the ambiguous context than those from distractor contexts\. However, these experiments require a predefined enumeration of multiple plausible disambiguations\.
Mehrparvar and Pezzelle \([2024](https://arxiv.org/html/2605.15635#bib.bib18)\)employs machine translation to detect sentence ambiguity by measuring discrepancies between back\-translations and the original sentences in representation space\. The underlying intuition is that an ambiguous sentence has multiple interpretations and thus leads to varying plausible translations with different semantics\. Inspired by this insight, we explore LLMs’ translation uncertainty behavior between ambiguous and unambiguous sentences\. This allows us to study ambiguity\-related uncertainty without explicitly enumerating alternative hypotheses, while also revealing which readings LLMs favor when ambiguity is left underspecified\.
In summary, to enable a more comprehensive evaluation, we adopt both the direct querying \(including the comparative method\) and the behavior\-based analysis with machine translation\. It is worth noting that machine translation is employed to probe LLMs’ uncertainty behavior in this work rather than to detect ambiguous sentences\. We also believe that, compared to direct querying, translation tasks facilitate a deeper understanding of LLMs’ implicit interpretations of ambiguous inputs\.
## 3CHA\-Gen Corpus
To systematically investigate the ability of LLMs in Chinese ambiguity identification and understanding, it is essential to construct a corpus that is large\-scale, well\-categorized, and finely annotated\. However, existing resources remain insufficient for this purpose\. To fill this gap, this paper presents the Chinese Ambiguity\-Generation \(CHA\-Gen\) corpus\. The following subsections elaborate on its construction process and corpus composition\.
Figure 2:The Pipeline for CHA\-Gen Corpus Construction### 3\.1Corpus Construction
We devise a semi\-automatic pipeline to construct a dedicated Chinese ambiguity corpus\. The procedure involves three key stages: preliminary generation, automatic verification and manual annotation, as shown in Figure[2](https://arxiv.org/html/2605.15635#S3.F2)\.
Table 1:The number of ambiguous/unambiguous sentences in CHA\-Gen\. Abbreviations: VP \(Verb Phrase\), NP \(Noun Phrase\), V \(Verb\), N \(Noun\), ADJ \(Adjective\), Q \(Quantifier\) and PREP \(Preposition\)\.No\.Potential Ambiguous StructureAmbiguousUnambiguous1VP\+的是\+NP3562342Q\+NP1\+的\+NP22433623N1\+N2\+N32205454ADJ\+N1\+N272505V1\+V2\+NP1551326N\+V49407N1\+的\+N2\+和\+N3471038N\+V\+NP\+AP1632549全部\+VP\+的\+NP30713010N1\+N2252211VP\+ADJ\+的\+N616712N1\+和\+N2\+的\+N36910213VP\+Q\+NP3365314V\+ADJ\+N5013715VP1\+VP2\+的\+N15816316V\+N292117VP\+N1\+的\+N211411318PREP\+N1\+的\+N2263170Total2,4143,298Preliminary Generation\.This stage aims to establish a typology of ambiguous structures and generate an initial, scalable corpus\. Following the PA Theory, we first identify 18 types of potential ambiguous structures\. For each type, a small set of ambiguous sentences is manually constructed by combining brainstorming with the collection of examples from online resources\. Building upon these human\-curated instances, we then leverage the pattern\-generation capabilities of LLMs\(Brown et al\.,[2020](https://arxiv.org/html/2605.15635#bib.bib3)\)to produce a large number of sentences with analogous structures and potential ambiguity\. Specifically, to enhance generation efficiency and diversity, we design two types of prompts:
- •Structure\-based Prompts\.These prompts instruct the LLMs to generate syntactically ambiguous sentences strictly according to potential ambiguous structures\.
- •Syntax\-based Prompts\.These prompts leverage syntactic flexibility and contextual constraints to guide LLMs in constructing ambiguous sentences\.
Note that the former prompts adhere to defined structural rules, whereas the latter prompts rely on syntactic relationships\. More details can be found in Appendix[A\.1](https://arxiv.org/html/2605.15635#A1.SS1)\.
Automatic Verification\.Since automatically generated sentences may deviate from the target ambiguous structures, this stage aims to automatically verify their structural consistency\. We leverage HanLP222https://www\.hanlp\.com/for Chinese word segmentation, Part\-of\-Speech \(POS\) tagging and Dependency Parsing \(DP\), supplemented by DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2605.15635#bib.bib7)\)for double\-checking\. The verification consists of three folds:
- •POS\-based Filtering\.Sentences showing significant discrepancies in word count or POS tag distribution against the predefined ambiguous structure are filtered\.
- •DP\-based Ambiguity Check\.For the remaining sentences, we insert specific words and analyze whether the root node of their dependency tree changes\. Sentences with altered root nodes are classified as ambiguous, others as unambiguous\.
- •LLM‑based Double‑Check\.Filtered sentences, along with their corresponding syntactic structures and explanations of potential ambiguity, are provided to DeepSeek\-R1\. The model is prompted to determine whether ambiguity exists and to provide corresponding reasoning \(see Appendix[A\.2](https://arxiv.org/html/2605.15635#A1.SS2)for details\)\. To improve efficiency in eliciting well\-structured evaluations from the LLM, we adopt the LLM\-Eval evaluation method\(Lin and Chen,[2023](https://arxiv.org/html/2605.15635#bib.bib13)\)\.
Manual Annotation\.In this stage, the instances labeled as Ambiguous by DeepSeek\-R1 are further manually reviewed\. Two annotators independently annotate the filtered sentences and conduct cross\-validation\. To improve annotation quality, the annotators inspect and refine the rationales for ambiguity or non\-ambiguity \(initially provided by the model or in draft form\), ensuring descriptions are accurate and consistent\.
### 3\.2Corpus Composition and Statistics
Through the aforementioned construction pipeline, we obtain the Chinese Ambiguity\-Generation \(CHA\-Gen\) corpus\. This corpus contains 2,414 ambiguous and 3,298 unambiguous sentences, covering 18 structural ambiguity types\. For each ambiguous structure, the corpus includes both ambiguous instances that satisfy the ambiguity\-triggering conditions and unambiguous counterparts that share similar surface forms but admit a single interpretation\. The overall statistics are summarized in Table[1](https://arxiv.org/html/2605.15635#S3.T1)\. Each ambiguous sentence in CHA\-Gen is associated with its specific ambiguous structure and the corresponding linguistic explanation\. More details please refer to Appendix[A\.3](https://arxiv.org/html/2605.15635#A1.SS3)\.
## 4Evaluation of LLMs via Direct Querying
We adopt the direct querying method to evaluate the ability of LLMs for ambiguity detection and discrimination\. Following the experimental settings of related work\(Zhang et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib28); Wildenburg et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib26)\), we design two query tasks: ambiguity identification and ambiguity comparison\. To enhance the reliability and generality of evaluation, we also incorporate the CHAmbi dataset for validation\.
Table 2:Prompts for LLMs\. Chinese prompts are used in our experiments, and English versions are provided for reference\.Prompt StrategiesPromptDirect Prompt for Identification Task请判断下面的句子/词组是否具有歧义。只回答“是”或“否”。句子:\{sentence\}Please determine whether the following sentence/phrase is ambiguous\. Answer only "Yes" or "No"\.Sentence:\{sentence\}Direct Prompt for Comparison Task下面有两个句子/词组,请判断哪个更容易引起歧义。只回答“1”或“2”。句子1:\{sent\_1\}句子2:\{sent\_2\}Below are two sentences/phrases\. Please determine which one is more likely to cause ambiguity\. Answer only "1" or "2"\.Sentence 1:\{sent\_1\}Sentence 2:\{sent\_2\}CoT Prompt for Identification Task请判断下面的句子/词组是否具有歧义。请逐步思考,写出你的思考过程,在最后一行单独写出结论“是”或“否”。句子:\{sentence\}Please determine whether the following sentence/phrase is ambiguous\. Please reason step by step and write out your reasoning process\. In the final line, provide the conclusion only: "Yes" or "No"\.Sentence:\{sentence\}CoT Prompt for Comparison Task下面有两个句子/词组,请判断哪个更容易引起歧义。请逐步思考,写出你的思考过程,在最后一行单独写出结论“1”或“2”。句子1:\{sent\_1\}句子2:\{sent\_2\}Below are two sentences/phrases\. Please determine which one is more likely to cause ambiguity\. Please reason step by step and write out your reasoning process\. In the final line, provide the conclusion only: "1" or "2"\.Sentence 1:\{sent\_1\}Sentence 2:\{sent\_2\}
### 4\.1Evaluation Setup
The formal settings and evaluation procedures for the query tasks are specified below\.
Query Task 1: Ambiguity Identification\.Given a single input sentence, LLMs are required to determine whether the sentence is ambiguous\. We employ two prompting strategies \(see Table[2](https://arxiv.org/html/2605.15635#S4.T2)\): a Direct Prompt that instructs the model to output only “Yes” or “No”, and a Chain\-of\-Thought \(CoT\) Prompt that elicits step\-by\-step reasoning before a final “Yes” or “No” decision\. We record the generation probabilities of 是 \(Yes\) and 否 \(No\) at the first token position\. The prediction is the option with the highest probability, and we evaluate the performance using Accuracy and Macro\-F1 metrics\. The statistics of the benchmark datasets are summarized in Table[3\(a\)](https://arxiv.org/html/2605.15635#S4.T3.st1)\.
Query Task 2: Ambiguity Comparison\.Given a pair of sentences, LLMs are required to identify which one is relatively more ambiguous\. We employ two prompting strategies \(see Table[2](https://arxiv.org/html/2605.15635#S4.T2)\): a Direct Prompt that instructs the model to output “1” or “2” to indicate the more ambiguous sentence, and a CoT Prompt that guides it to conduct step\-by\-step reasoning before concluding with “1” or “2”\. We construct the sentence pairs as follows: For each ambiguous sentence in the CHA\-Gen dataset, we select a corresponding unambiguous counterpart based on surface similarity\. For the CHAmbi dataset, we directly use its provided unambiguous counterpart to construct sentence pairs\. In addition, to mitigate order effectsZheng et al\. \([2024](https://arxiv.org/html/2605.15635#bib.bib29)\), each sentence pair is queried twice with their positions swapped\. This ensures that options “1” and “2” appear equally in the comparison task, effectively balancing the number of ambiguous and unambiguous sentences\. Evaluation performance is also evaluated using Accuracy and Macro‑F1\. The statistics of the benchmark datasets are presented in Table[3\(b\)](https://arxiv.org/html/2605.15635#S4.T3.st2)\.
Table 3:Statistics of benchmark datasets\(a\)Ambiguity identification taskCHA\-GenCHAmbi\# Amb\. Sent2,414893\# Unamb\. Sent3,2981,784Total5,7122,677Avg\. Len\.7\.3819\.06
\(b\)Ambiguity comparison taskCHA\-GenCHAmbi\# Pairs2,4141,784\# Unique Amb\. Sent2,414877\# Unique Unamb\. Sent8561,772Avg\. Len\.7\.5518\.42
Table 4:Results on ambiguity identification and ambiguity comparison tasks across two datasets\.ModelAmbiguity Identification TaskAmbiguity Comparison TaskCHA\-GenCHAmbiCHA\-GenCHAmbiAcc↑\\uparrowMacro\-F1↑\\uparrowAcc↑\\uparrowMacro\-F1↑\\uparrowAcc↑\\uparrowMacro\-F1↑\\uparrowAcc↑\\uparrowMacro\-F1↑\\uparrowGemma3\-1B\-IT0\.52850\.52600\.34630\.29760\.50580\.42270\.26180\.2603Gemma3\-4B\-IT0\.42510\.30280\.35260\.32470\.55010\.55010\.42710\.3372Gemma3\-12B\-IT0\.45400\.36590\.34070\.32100\.51640\.41560\.42680\.3834Gemma3\-27B\-IT0\.50110\.45600\.36350\.36320\.53790\.50110\.42850\.4234Qwen2\.5\-3B\-Instruct0\.58260\.39480\.63620\.39650\.54140\.53850\.56050\.5603Qwen2\.5\-7B\-Instruct0\.55320\.50550\.60810\.44540\.51510\.38760\.38140\.3779Qwen2\.5\-14B\-Instruct0\.62660\.59060\.55700\.48020\.52170\.42350\.62440\.6105Qwen2\.5\-32B\-Instruct0\.50040\.45360\.45310\.44850\.53790\.49450\.50480\.4946Qwen3\-4B0\.42650\.30480\.35860\.35760\.55430\.55420\.43670\.3739Qwen3\-8B0\.47720\.43060\.42250\.41400\.54850\.53550\.41200\.3585Qwen3\-14B0\.47930\.41490\.42850\.42210\.55800\.49630\.44900\.4361Qwen3\-32B0\.51580\.48190\.49910\.47460\.54660\.48570\.38760\.3876
\(a\)Accuracy and Macro\-F1 in ambiguity identification task\.
\(b\)Accuracy and Macro\-F1 in ambiguity comparison task\.
Figure 3:Overall performance comparison under various prompts and thinking modes\.
### 4\.2Evaluation Results and Discussion
The evaluation results are presented in Table[4](https://arxiv.org/html/2605.15635#S4.T4)\. In the ambiguity identification task, Qwen2\.5\-14B\-Instruct achieves the highest Macro\-F1 score among all evaluated models across both datasets\. In the ambiguity comparison task on CHA\-Gen, Gemma3\-4B\-IT and Qwen3\-4B achieve the best Macro\-F1, followed by Qwen2\.5\-3B\-Instruct and Qwen3\-8B\. On CHAmbi, Qwen2\.5\-14B\-Instruct again demonstrates higher performance, confirming its robustness across both tasks\.
Notably, the overall performance across all models and datasets is relatively low \(around 0\.5\), with Accuracy and Macro\-F1 scores often dropping even further for most dataset\-model combinations\. This indicates that many models perform at a level comparable to random guessing on the proposed tasks\. Importantly, factors \(larger model sizes or higher version iterations\) typically associated with enhanced general language capabilities, such as Qwen2\.5→\\rightarrowQwen3, do not necessarily translate to better performance, highlighting the unique challenges of ambiguity detection and discrimination\.
Effects of Explicit Reasoning and CoT Prompt\.This part aims to evaluate how explicit reasoning and CoT Prompt shape model behavior and performance\. To implement the explicit reasoning setting, we take advantage of the configurable thinking mode in Qwen3, which can be flexibly enabled or disabled\. The performance changes induced on both tasks are reported separately in Figure[3\(a\)](https://arxiv.org/html/2605.15635#S4.F3.sf1)and Figure[3\(b\)](https://arxiv.org/html/2605.15635#S4.F3.sf2)\. For the ambiguity identification task, most models achieve higher Macro\-F1 under CoT prompt, albeit with a concurrent decrease in overall Accuracy\. This observation suggests that CoT prompt encourages models to produce more balanced predictions\. For the ambiguity comparison task, where ground\-truth labels \(1 and 2\) are balanced, Accuracy and Macro\-F1 exhibit consistent trends under CoT prompting, further validating the effectiveness of explicit reasoning for ambiguity judgments\.
Answer Distributions\.To delve into the observed performance trends, we analyze the answer distribution of each model, which is visualized in Figure[4](https://arxiv.org/html/2605.15635#S4.F4)\. We observe that many models exhibit strong biases towards specific answers \(e\.g\., consistently favoring “Yes” or “1”\) in the absence of explicit reasoning and CoT prompts\. This phenomenon is particularly pronounced in Gemma3\-based models\. Such systematic biases, independent of the actual input content, indicate that many models struggle to perform reliable single\-token judgments\. In contrast, when reasoning is enabled or a CoT prompt is applied, answer distributions become more balanced in most cases, suggesting that LLMs effectively leverage relevant domain knowledge through explicit reasoning\.
Figure 4:The answer distributions of CHAmbi and CHA\-Gen in ambiguity identification and comparison tasks, with the dashed line indicating the optimal balance point\.Overall, forming such judgments via single\-word responses poses inherent difficulties for modern LLMs\. Directly instructing LLMs to provide a one\-word decision for ambiguity assessment, as done in prior work\(Zhang et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib28)\), may not constitute a reliable evaluation protocol and can underestimate their true capabilities\. Nevertheless, our results show that ambiguity detection and discrimination remain highly challenging even when explicit reasoning is enabled\. This observation motivates us to further analyze the models’ generated rationales to gain deeper insights into their ambiguity detection processes\.
### 4\.3Further Analysis of Rationales in CoT Prompting
We further examine the rationales generated by Qwen3\-32B to gain a deeper understanding\. Several examples from CHA\-Gen are provided in Table[5](https://arxiv.org/html/2605.15635#S4.T5)\. The typical failure cases are summarized as follows:
- •Ambiguity Blindness\.The most typical failure stems from an insufficient sensitivity to potential sources of ambiguity, where the model fails to recognize latent ambiguities in the sentence\. In this case, the model prematurely concludes that the sentence is unambiguous, without sufficiently exploring alternative interpretations \(See sample 5 in Table[5](https://arxiv.org/html/2605.15635#S4.T5)\)\.
- •Ambiguity Misattribution\.Another failure arises from a mismatch between the model’s interpretation of ambiguity and that of humans\. In such cases, the model correctly identifies certain linguistic phenomena \(e\.g\., subject omission or nominalization\) but incorrectly treats them as sufficient evidence of ambiguity\. As a result, it reasons in an inappropriate direction, constructing multiple implausible interpretations and leading to misguided judgments, which frequently happens even when the final prediction turns out to be correct \(See Samples 1–3 in Table[5](https://arxiv.org/html/2605.15635#S4.T5)\)\. This highlights a fundamental discrepancy between human and LLMs in their conceptualization of ambiguity\.
- •Premature Ambiguity Resolution\.In some cases, Qwen3\-32B correctly identifies the presence of ambiguity and generates multiple plausible interpretations\. However, before making the final decision, it erroneously dismisses one interpretation as less natural or less likely in real\-world usage, thereby prematurely collapsing the ambiguity and producing an incorrect, unambiguous judgment \(See Sample 4 in Table[5](https://arxiv.org/html/2605.15635#S4.T5)\)\.
Based on these observations, improving LLM performance in recognizing ambiguous sentences requires enhanced linguistic understanding and more robust reasoning capabilities\. Deeper, more systematic reasoning may help models better perceive and maintain multiple interpretations\. Additionally, addressing Ambiguity Misattribution and Premature Ambiguity Resolution may benefit from a more precise and nuanced definition of “ambiguity,” aligning model reasoning more closely with human judgments\.
Table 5:Examples of incorrect CoT rationales in the ambiguity detection task\. The shown rationales are translated into English\.Sample IDSentencesModel PredictionGenerated Rationale \(Excerpt\)Error1Sent1: 信任的是少数几位朋友。*\(The entity being trusted / trust \(sth\.\) is a few friends\.\)*Sent2: 相信的是科学证据。*\(Belief is placed in scientific evidence\.\)*Qwen3\-32B: Sent2 is more ambiguous\. \(Incorrect\)“Scientific evidence” is a relatively abstract and broad concept, and different individuals may interpret it differently \[…\] prone to varying interpretations among different people, leaving considerable room for misinterpretation\.Ambiguity Misattribution: Lexical2Sent1: 关注的是他们。*\(The ones being focused on/ focus on \(sth\.\) are them\. \)*Sent2: 被期待的是他们。*\(The ones being expected are them\. \)*Qwen3\-32B: Sent2 is more ambiguous \(Incorrect\)The verb “expect” carries emotional connotations, while the nominalized structure “the expected \(one\)” easily prompts questions like “Who expects?” and “What is expected?”, thereby increasing the likelihood of ambiguity\.Ambiguity Misattribution: Pragmatic3Sent1: 喜欢的是小明。*\(The one being liked/ like \(sth\.\) is Xiaoming\. \)*Sent2: 喜欢的电影类型是喜剧。*\(\(My\) Favorite movie genre is comedy\.\)*Qwen3\-32B: Sent1 is more ambiguous\. \(Correct\)There are several possible interpretations, such as: \- I like Xiaoming\. \- He likes Xiaoming\. \- Everyone likes Xiaoming\. \[…\] \*\*Sentence 1\*\* Because the subject is missing, it can easily lead to misunderstanding about “who likes Xiaoming,” thus \*\*making it more prone to ambiguity\*\*\.Ambiguity Misattribution: Referential4Sentence: 全部损坏的机器设备*\(All damaged machinery and equipment\)**\(Completely damaged machinery and equipment\(s\)\)*Qwen3\-32B: The sentence is unambiguous\. \(Incorrect\)\[…\]\- Interpretation 1: “All” modifies “damaged machinery and equipment,” meaning “all machinery and equipment are damaged\.”\- Interpretation 2: “All” modifies “damaged,” indicating “damaged to a very thorough extent,” i\.e\., “the machinery and equipment are completely destroyed\.”\[…\]While two interpretations exist, Chinese grammatical conventions suggest “all” more likely modifies the entire “damaged machinery and equipment,” meaning “all damaged machinery and equipment\.”\[…\]Conclusion: NoPremature Ambiguity Resolution5Sentence: 照顾花园的园丁*\(The gardener who tends the garden\)**\(Take care of the gardener of the garden\)*Qwen3\-32B: The sentence is unambiguous\. \(Incorrect\)\[…\] 1\. Sentence Structure Analysis:\- The subject is “gardener\.”\- “Tends the garden” is a modifying phrase describing the gardener’s duties or characteristics\.\[…\] Here, “tending the garden” modifies “gardener” with no alternative possible interpretation\. \[…\] No alternative interpretations exist\.Ambiguity Blindness
## 5Evaluation of LLMs via Machine Translation
Simply querying LLMs to recognize ambiguity is insufficient to assess whether they really perceive an input as ambiguous\. Instead, it is necessary to examine models’ behavior when confronted with ambiguous inputs, specifically whether they can exhibit uncertainty\. We focus on the Chinese\-to\-English translation task\. Ambiguous source sentences may admit multiple plausible interpretations\. Accordingly, we hypothesize that ambiguous sentences give rise to multiple valid translations with greater semantic diversity\. Our goal is to measure each model’s sensitivity to ambiguity by quantifying the uncertainty reflected in its subsequent outputs\.
### 5\.1Evaluation Setup
For each sentence, we prompt the models to sample 50 translations, forming a set of translationsTT\. The instructions used for sampling translations are provided in Appendix[B](https://arxiv.org/html/2605.15635#A2)\. We assess the semantic uncertainty ofTTto determine whether it can cover diverse meanings using uncertainty quantification \(UQ\)\. Concretely, the semantic entropy proposed by\(Kuhn et al\.,[2023](https://arxiv.org/html/2605.15635#bib.bib11)\)is adopted to measure uncertainty\. Formally, givenTT, we cluster these translations into semantic setsCiC\_\{i\}, each with different meanings\. Based on the frequency of each semantic set, we can compute the semantic entropy ofTT:
Ent\(T\)=\-∑i=0npilogpi,Ent\(T\)=\\text\{\-\}\\sum\_\{i=0\}^\{n\}p\_\{i\}\\log p\_\{i\},\(1\)wherepi=\|Ci\|∑in\|Ci\|p\_\{i\}=\\frac\{\|C\_\{i\}\|\}\{\\sum\_\{i\}^\{n\}\|C\_\{i\}\|\}andnnis the number of semantic sets decided by the clustering algorithm\. Specifically, we adopt the Agglomerative Clustering \(AC\) algorithm\. Its input is derived from the paraphrase identification model, which assesses the semantic equivalence of all translation pairs\. Below, we elaborate on the paraphrase identification model and implementation details\.
Paraphrase Identification Model\.Previous work typically relies on NLI models to judge the equivalence of two sentences\. However, based on our observations, these models don’t work well in our cases\. Different translations often differ only in surface form, with subtle changes that may alter meaning, such as “Fully enclosed space” vs\. “All enclosed space”\. LLMs, like ChatGPT, handle these fine\-grained distinctions more reliably due to their stronger language understanding\. Nevertheless, directly applying LLMs to evaluate large numbers of sentence pairs is prohibitively expensive\. To balance accuracy and efficiency, we adopt a distillation\-based strategy\. In more detail, we sample 200K pairs from all the generated translation sentences and instruct GPT\-4\.1 to judge their semantic equivalence\. These LLM\-generated labels are then used to fine\-tune an NLI model into a paraphrase identification model\. Note that, different from standard NLI setups, we discard entailment relations and frame the task as a binary classification problem with only two labels: equivalent and not equivalent\.
Implementation Details\.For each source sentence, we first aggregate outputs from all translation models to form a unified translation set\. We then apply the fine\-tuned paraphrase identification model to assess semantic equivalence for all sentence pairs within this set\. Using these pairwise judgments, we employ the AC algorithm to group candidate translations into semantic clusters\. Finally, based on these clusters, we compute the semantic entropy of each model’s translation set using Equation[1](https://arxiv.org/html/2605.15635#S5.E1)\.
We evaluate models from the Qwen2\.5 and Qwen3 families, including both Base and Instruct variants\. In addition to semantic entropy \(Ent\), the average of mean Quality Estimation scores333Checkpoint: Unbabel/wmt23\-cometkiwi\-da\-xl\(Avg QE\) across all translations in each set, and the average of the best QE score \(Max QE\) from each translation set are employed as metrics\. Moreover, we use the same ambiguous/unambiguous paired sentences employed for ambiguity comparison in Section[4\.1](https://arxiv.org/html/2605.15635#S4.SS1)\.
### 5\.2Evaluation Results and Discussion
Translation quality and semantic uncertainty are reported in Table[6](https://arxiv.org/html/2605.15635#S5.T6)\. None of the evaluated models exhibits a substantial difference in overall translation quality, ensuring that the subsequent analyses are not biased by extremely low\-quality translations\. Base models tend to achieve higher Max QE scores on the best translation within each set, but exhibit lower Avg QE scores across all translations than their Instruct counterparts\. This phenomenon aligns with expectations: Base models produce less controllable outputs and thus higher diversity, which is also reflected in their higher semantic entropy values\.
Table 6:Translation quality and semantic uncertainty on CHA\-Gen and CHAmbi\. Avg QE denotes the mean quality across all sampled translations; Max QE denotes the best translation per set; Ent denotes semantic entropy \(higher indicates greater semantic diversity\)\.CHA\-GenCHAmbiFamilyModelAvg QEMax QEEntAvg QEMax QEEntQwen2\.57B\-Base0\.65310\.76981\.52780\.72810\.79930\.75847B\-Instruct0\.66650\.73390\.42410\.74250\.78340\.201214B\-Base0\.64930\.76961\.48220\.73020\.79960\.647114B\-Instruct0\.67920\.73380\.34390\.74510\.78080\.156732B\-Base0\.65730\.76821\.16530\.73410\.79940\.557932B\-Instruct0\.68650\.73160\.29590\.74940\.77950\.1393Qwen38B\-Base0\.65760\.76721\.25320\.73580\.79900\.57858B\-Instruct0\.69060\.73990\.38750\.75620\.78360\.155714B\-Base0\.66590\.76921\.14160\.73970\.79980\.547414B\-Instruct0\.69460\.73520\.29260\.76030\.78190\.119832B\-Instruct0\.69060\.74830\.46460\.75940\.78820\.1805Ambiguous vs Unambiguous Sentences\.We compare the average semantic entropy of translations generated for ambiguous and unambiguous sentences, as reported in Table[7](https://arxiv.org/html/2605.15635#S5.T7)\. Across both datasets and all models, translations of ambiguous sentences consistently exhibit higher semantic entropy than those of unambiguous sentences, supporting our hypothesis\. However, the magnitude of this gap varies across models\. For example, Base models generally yield a larger entropy gap between ambiguous and unambiguous sentences than their instruction\-tuned counterparts\. This suggests that instruction tuning reduces generation uncertainty, which in turn narrows the entropy gap and induces translation bias\.
Table 7:Semantic entropy comparison on translations of ambiguous/unambiguous sets\.Aggregatedenotes the semantic entropy on gathering all models’ outputs for each source sentence\.CHA\-GenCHAmbiFamilyModelAmb\. SetUnamb\. SetGapAmb\. SetUnamb\. SetGapQwen2\.57B\-Base1\.60611\.30710\.29900\.80970\.73290\.076814B\-Base1\.54731\.29860\.24870\.68000\.63080\.049232B\-Base1\.22700\.99140\.23560\.62400\.52520\.09887B\-Instruct0\.43980\.37990\.05990\.23120\.18630\.044914B\-Instruct0\.36140\.29480\.06660\.16720\.15150\.015732B\-Instruct0\.31300\.24790\.06510\.16890\.12460\.0443Qwen38B\-Base1\.31781\.07100\.24680\.63880\.54860\.090214B\-Base1\.19770\.98330\.21440\.63280\.50500\.12788B\-Instruct0\.41440\.31140\.10300\.17980\.14370\.036114B\-Instruct0\.31040\.24230\.06810\.15450\.10260\.051932B\-Instruct0\.50140\.36100\.14040\.21730\.16220\.0511Aggregate1\.29231\.01340\.27890\.66160\.53040\.1312Category\-wise Analysis in CHAmbi\.We analyze the fraction of \(ambiguous, unambiguous\) sentence pairs in CHAmbi for which the ambiguous sentence exhibits higher semantic entropy in its translation set than its unambiguous counterpart, as illustrated in Figure[5](https://arxiv.org/html/2605.15635#S5.F5)\. Among all ambiguity categories described in CHAmbi, incompleteness yields the lowest fraction\. This is likely because incompleteness stems from missing information, which induces vagueness rather than supporting multiple well\-defined interpretations\. The second\-lowest fraction is observed for coreference ambiguities\. Such ambiguities can often be translated directly into English without resolving pronouns, thereby preserving the original ambiguity and reducing divergence between ambiguous and unambiguous cases\. Overall, Base models consistently achieve higher fractions than instruction\-tuned variants\. This aligns with the trends observed in the previous experiment\.
Figure 5:Fraction of CHAmbi sentence pairs where ambiguous sentences exhibit higher semantic entropy in their translation sets than their unambiguous counterparts, broken down by ambiguity types\.Structure\-sensitive Analysis in CHA\-Gen\.We apply a similar analysis to CHA\-Gen, with results shown in Figure[6](https://arxiv.org/html/2605.15635#S5.F6)\. Consistent with the CHAmbi results, instruction\-tuned models exhibit smaller differences in semantic entropy between ambiguous and unambiguous sentences compared to Base models\. Several structural patterns show a high fraction of cases where ambiguous sentences have higher semantic entropy, including “N1\+N2,” “VP\+ADJ\+N,” “VP\+N1\+的\+N2,” and “VP\+的是\+NP\.” This suggests that LLMs are relatively more sensitive to ambiguity arising from these syntactic structures\. That is, when inputs involve specific ambiguous syntactic structures, the “confusion” and “hesitation” regarding multiple possible interpretations of the LLM are more strongly reflected as output instability\.
Figure 6:Fraction of CHA\-Gen sentence pairs for which ambiguous sentences exhibit higher semantic entropy in their translation sets than their unambiguous counterparts, broken down by ambiguity structures\.Table 8:Distribution for different interpretations over the model’s generated translation sets of ambiguous Chinese sentences \(50 samples per model\)\. For each source sentence, we show two major translation clusters and include one representative translation per cluster\. Percentages do not sum to 100% because translations belonging to other clusters or judged as noisy or low quality are excluded\.Source sentenceTranslationsQwen3\-8BQwen3\-14BQwen2\.5\-7BQwen2\.5\-14BBaseInstructBaseInstructBaseInstructBaseInstruct保护主人的猫。A: The cat that protects its owner\.36%100%4%0%40%100%4%0%B: Protect the owner’s cat\.46%0%86%100%42%0%86%100%全部密封的包裹。A: Fully enclosed package\.62%100%24%96%18%0%52%100%B: All sealed bagages\.26%0%70%4%60%76%38%0%保护的是弱者。A: It protects the weak\.100%100%100%100%92%98%98%100%B: It’s the weak that protect it\.0%0%0%0%0%0%0%0%帮助的是那个老人。A: It is the old man who is getting help\.34%0%22%0%36%22%68%100%B: The one who helped was the old man\.46%100%62%100%12%74%14%0%帮助的是弱者。A: It is the weaks who are helped\.56%78%62%92%44%70%46%42%B: Those who help are the weak\.14%6%10%8%14%4%4%0%
### 5\.3Case Study
We conduct this case study focusing on two targeted aspects\. The first aspect aims to compare the diversity of translations generated by Base models and their Instruct\-tuned counterparts, exploring how instruction tuning affects the variability of model outputs\. The second aspect focuses on investigating the discrepancies in how different models understand ambiguous linguistic structures, particularly examining the factors that lead models to adopt different interpretive paths for structurally similar or inherently ambiguous sentences\. Specifically, we examine several concrete examples to elaborate on these two aspects\.
Base vs Instruct\.The first two examples in Table[8](https://arxiv.org/html/2605.15635#S5.T8)illustrate differences in interpretation distributions across model variants\. In these cases, Base models exhibit greater translation uncertainty and diversity than their Instruct counterparts\. For instance, Qwen2\.5\-7B\-Base produces a nearly balanced distribution between Translation A and Translation B for “保护主人的猫。”, while Qwen2\.5\-14B\-Base shows a similar balance for “全部密封的包裹。”\. After instruction tuning, however, most models become strongly biased toward a single interpretation\. These observations suggest that the uncertainty inherent in ambiguous sentences is better preserved by Base models pretrained on large\-scale Internet corpora, whereas instruction tuning tends to suppress such uncertainty and promote deterministic interpretations\.
Dominant Interpretation Bias\.The last three example sentences in Table[8](https://arxiv.org/html/2605.15635#S5.T8)share the structure “VP \+ 的是 \+ NP”, such as “保护\(protect\)的是\(is\)弱者\(the weak\)”, “帮助\(help\)的是\(is\)那个老人\(that old man\)”, and “帮助\(help\)的是\(is\)弱者\(the weak\)”\. Such structures may induce ambiguity when the agent and patient roles of the action are unspecified\.
Despite their high syntactic similarity, models exhibit distinct interpretation behaviors across these three sentences\. Specifically, for sentences involving “弱者 \(the weak\)”, “保护的是弱者” and “帮助的是弱者”, models show a strong tendency to parse “弱者” as the patient of the action \(i\.e\., the recipient of protection or help\)\. This preference stems from the cognitive habit in daily language where “弱者” is often the action recipient\. “Protecting the weak” and “helping the weak” are far more frequent in real\-world corpora than alternative interpretations \(e\.g\., “the weak protecting others” or “the weak helping others”\)\.
This leads to two key findings\. First, even with identical syntactic structures, the dominant status of specific interpretations in common usage reduces a sentence’s perceived ambiguity\. From a formal semantic perspective, however, such sentences still retain theoretically multiple interpretive possibilities in specific contexts\. Second, the observed model bias indicates that LLMs tend to choose frequent or fixed interpretive paths rather than fully capturing and expressing the inherent ambiguous space of sentences\.
Furthermore, this behavioral pattern explains why not all theoretically ambiguous sentences in our dataset \(even in Base models\) exhibit obvious translation uncertainty\. In particular, sentences differ in degree of interpretive ambiguity, which stems from the frequency distribution of their possible interpretations in natural language\. This variation in interpretive ambiguity directly modulates models’ ability to recognize and characterize ambiguity\. Such over\-reliance on dominant interpretations undermines the robustness of LLMs in handling truly ambiguous sentences\.
This finding aligns with previous work\(Itzhak et al\.,[2024](https://arxiv.org/html/2605.15635#bib.bib9)\)showing that instruction tuning can introduce systematic cognitive biases in model behavior\. While instruction tuning substantially improves model usability and alignment with user expectations, it may unintentionally amplify preferences for dominant or high\-frequency interpretations, thereby suppressing alternative plausible readings\. In real\-world cross\-lingual communication, contexts are considerably more complex\. Therefore, ambiguities are often implicit and considerably harder to detect than those in our controlled experimental settings\. Consequently, such biases may result in misunderstandings or misrepresentations\.
On the other hand, our observation that Base models exhibit fewer systematic biases and produce more diverse outputs suggests a possible direction for mitigation\. Beyond instruction tuning alone, incorporating the diversity of Base model generations could serve as an auxiliary signal to retain interpretive multiplicity\.
## 6Conclusion
In this paper, we present CHA\-Gen, a novel corpus designed to enrich existing resources for Chinese ambiguity research\. Built on PA theory, it is systematically constructed via a pipeline integrating automated data curation and rigorous manual verification\. This design effectively achieves a balance between scalability and annotation accuracy\. The corpus comprises 2,414 ambiguous and 3,298 unambiguous sentences, covering 18 distinct ambiguous structures to support comprehensive LLM ambiguity evaluation\.
Evaluations are conducted through complementary approaches: direct querying and machine translation\. The former assesses LLMs’ ability to detect and discriminate ambiguity, while the latter probes their interpretive uncertainty\. Overall results demonstrate that Chinese ambiguity remains highly challenging for current LLMs\. In more detail, 1\) direct querying evaluation shows that better general LLM capabilities \(larger scale or higher versions\) do not guarantee improved ambiguity detection\. While CoT prompting balances predictions and reduces bias, rationales reveal three failure patterns: ambiguity blindness, misattribution, and premature resolution\. This study indicates a divergence from human judgment and provides insights for future improvement\. 2\) Machine translation evaluation demonstrates that specific ambiguity categories \(e\.g\., incompleteness, coreference\) and certain syntactic structures \(e\.g\., N1\+N2, VP\+ADJ\+N\) induce higher semantic entropy\. Instruction tuning tends to suppress output uncertainty, and a dominant interpretation bias persists across models\. This further highlights their limited ability to capture the full linguistic ambiguity and the concerns of misunderstanding in cross\-lingual communication\.
In conclusion, this study reveals the limitations of current LLMs in handling Chinese ambiguity, emphasizing the necessity of improving evaluation strategies and conducting targeted model optimizations\. We hope that the proposed dataset and estimation methods will support and accelerate progress in this research area\.
## 7Limitations
Our constructed corpus currently focuses on syntactic ambiguity, which allows for more controlled and consistent data generation\. While this enables focused analysis, a more comprehensive evaluation would benefit from incorporating additional types of ambiguity, such as polysemy, word segmentation, and other linguistic phenomena\. Another potential limitation concerns sentence length\. Our corpus primarily consists of relatively short utterances and phrases\. Although such examples support analysis across different levels of linguistic complexity, real\-world scenarios frequently involve longer and more syntactically complex sentences\. In future work, we plan to explore \(semi\-\)automatic strategies to generate and expand the corpus with more diverse and complex ambiguous instances toward a more holistic evaluation\.
## Appendix ADetails of Datasets
### A\.1Structure/syntax\-based Prompts for Corpus Construction
Structure\-based prompts and syntax\-based prompts for CHA\-Gen corpus construction described in Section[3\.1](https://arxiv.org/html/2605.15635#S3.SS1)are given in Table[9](https://arxiv.org/html/2605.15635#A1.T9)and Table[10](https://arxiv.org/html/2605.15635#A1.T10), respectively\.
Table 9:Examples of syntax\-based prompts for CHA\-Gen corpus construction\.Potential Ambiguous StructurePromptVP1\+VP2\+的\+N生成20句VP\+的\+N结构的中文句子,并在每个句子前分别加一个动词,使句子符合逻辑, 例如:看打球的学生。 词汇尽量丰富,生成的句子尽量不重复,生成的句子间用"。"结尾且进行编号如1\.\(Generate 20 sentences with the structure "VP \+ 的 \+ N," where a verb is added before each sentence to make it logically complete\. An example sentence is: "看打球的学生\(Students who watch ball games\\Watch students playing ball games\)"\. Each sentence is numbered and ends with a period\.\)N\+V\+NP\+AP生成20个的N1\+V\+N2的短语,并在后面加上一个简单形容词来又能形容N2和形容短语,即形成N1\+V\+N2\+ADJ的句子,使句子符合逻辑, 例如:张三笑李四很笨。 词汇尽量丰富,生成的句子尽量不重复,生成的句子间用"。"结尾且进行编号如1\.\(Generate 20 sentences with the structure "N1 \+ V \+ N2 \+ ADJ," where an adjective is added to describe both N2 and the entire phrase\. An example sentence is: "张三笑李四很笨\(Zhang San laughs at Li Si for being very foolish\\It is foolish that Zhang San laughs at Li Si\)"\. Each sentence is numbered and ends with a period\.\)
### A\.2Validation Prompt for Corpus Construction
Table 10:Examples of structure\-based prompts for CHA\-Gen corpus construction\.Potential Ambiguous StructurePromptN1\+N2\+N3生成10句N1\+N2\+N3结构的句子, 要求N1一定是名词,N3既能被N1修饰也能被N2修饰,整句话只有3个名词, 例子:进口飞机引擎。 生成的句子间用"。"结尾且进行编号如1\.\(Generate 10 sentences with the structure N1 \+ N2 \+ N3, where N1 is a noun, N3 can be modified by either N1 or N2, and each sentence contains exactly three nouns\. An example sentence is: "进口飞机引擎\(Imported aircraft engine\\Engine of imported aircraft\)"\. Each generated sentence is numbered and ends with a period\.\)VP\+的是\+NP生成10句VP\+的\+是\+NP结构的句子, 要求VP既能作主动,也能作被动,NP是有生命的物体, 例子:反对的是少数人。 生成的句子间用"。"结尾且进行编号如1\.\(Generate 10 sentences with the structure "VP \+ 的 \+ 是 \+ NP," where the VP \(verb phrase\) can function both actively and passively, and the NP \(noun phrase\) must be a living entity\. An example sentence is: "反对的是少数人\(Those who oppose are a minority\\A minority is being opposed\)"\. Each generated sentence is numbered and ends with a period\.\)
Table[11](https://arxiv.org/html/2605.15635#A1.T11)gives the validation prompt and schema used in CHA\-Gen corpus construction\.
Table 11:Prompt Template for Ambiguity Validation请你根据输入的输入的歧义结构“ambType”和歧义理由“ambReason”判断“source”这个句子是否具有歧义,你的输出需要包含以下信息:ambiguity:判断句子是否存在歧义,如果存在歧义,将其设置为true,否则设置为false。reasons:解释句子产生歧义的原因。example:根据歧义类型和理由,生成多个可能的上下文,展示句子的不同含义。每个上下文必须完整包含原句。输出时仅保留括号内的内容\.\(Please judge whether the sentence "source" is ambiguous according to the Potential Ambiguous Structure "ambType" and linguistic explanation "ambReason" of the input\. Your output needs to include the following information:ambiguity: judge whether the sentence is ambiguous\. If there is ambiguity, set it to true, otherwise set it to false\. Reasons: explain the reasons why sentences are ambiguous\. Example: generate multiplepossible contexts to show the different meanings of sentences according to the types and reasons of ambiguity\. Each context must contain the original sentence completely\. Only the contents in brackets \{\} are retained when outputting\)输出应该以符合JSON模式的JSON实例进行格式化输出:\(The output should be formatted as a JSON instance that conforms to the JSON schema below:\)\{"properties" : \{"foo": \{"title": "Foo", "description": "list of integer", "type": "array", "items": \{"type": "integer"\}\}\}, "required": \["foo"\]\}\. The object \{"foo": \[0, 1\]\} is a well\-formatted instance of the schema\. The object \{"properties": \{"foo": \[0, 1\]\}\} is not well\-formatted\.这是我们任务输出的格式:\(This is the schema of our task output:\)\{"properties": \{"ambiguity": \{"title": "Ambiguity", "description": "bool","type": "bool"\}, "reasons": \{"title": "Reasons", "description": "text", "type": "string"\}, "examples": \{"title": "Examples", "description": "array of ambiguity examples", "type": "array", "items": \{"type": "string"\}\}\}, "required": \["ambiguity", "reasons", "examples"\]\}接下来请回答正式输入:\(Next, please answer the formal input:\)
### A\.3CHA\-Gen Corpus
TableLABEL:tab:ambi\-structurepresents the Potential Ambiguous Structures and the corresponding examples of CHA\-Gen\. The notations "VP, NP, N, Q, ADJ, V, PREP" stand for "Verb Phrase, Noun Phrase, Noun, Quantifier, Adjective, Verb, Preposition", respectively\.
Table 12:Potential Ambiguous Structures and examples of the CHA\-Gen corpus\.Potential Ambiguous StructureExampleLinguistic explanationVP\+的是\+NP反对的是少数人 \(Those who oppose are a minority \\A minority is being opposed\)句子可以理解为少数人作为动作主体进行反对,也可理解为被反对的对象是少数人 \(The sentence can be understood as a minority acting as the subject of opposition, or as the object of opposition being the minority\)N1\+N2\+N3进口飞机引擎 \(Imported aircraft engine \\ Engine of imported aircraft\)句子可以理解为进口的飞机引擎,也可以理解为进口飞机的引擎 \(The sentence can be understood as an imported aircraft engine, or as an engine of the imported aircraft\)VP\+Q\+NP分析了十分钟数据 \(Data were analyzed for ten minutes \\ A ten\-minute data was analyzed\)句子中的“十分钟”既可以理解为分析动作持续的时间(分析用了十分钟),也可以理解为被分析的数据内容(十分钟长度的数据) \(The phrase ’ten minutes’ in the sentence can be understood as both the duration of the analysis action \(which took ten minutes\) and the content of the analyzed data \(which is ten minutes long\)\)N1\+N2牛奶面包 \(Milk\-flavored bread \\ Milk and bread\)句子可以理解为并列关系的两种食物(牛奶和面包),也可理解为用牛奶制作的面包(偏正结构) \(The Sentence can be understood as two parallel foods \(milk and bread\), or as bread made from milk \(with a skewed structure\)\)Q\+NP1\+的\+NP2三个学校校长 \(principal from three schools \\ Three principals from a school\)句子可指3所学校的校长(每校1人)或3名担任校长职务的人(可能属于同一学校) \(The sentence can refer to the principals of three schools \(one person per school\) or three individuals holding the position of principal \(who may belong to the same school\)\)ADJ\+N1\+N2小型宠物商店 \(Small pet store \\ Store for small pets\)句子可以理解为小型的(宠物商店),也可以理解为(小型宠物)的商店 \(The sentence can be understood as a small \(pet store\) or a store for \(small pets\)\)VP1\+VP2\+的\+N看打球的学生 \(Students who watch ball games \\ Watch students playing ball games\)句子既可以理解为学生在看打球(的比赛),也可以理解为他人看这些打球的学生。 \(The sentence can be understood as students watching basketball games, or as watching students playing games\.\)V1\+V2\+NP评估改善工作流程 \(Evaluate and improve workflow \\ Evaluate improved workflow\)短语结构存在双重解析可能:1\. “评估改善后的工作流程”(评估对象为已被改善的流程) 2\. “评估并改善工作流程”(并列动作) \(There is a possibility of dual parsing in phrase structures: 1 Evaluate the improved workflow \(evaluating the improved process\) 2 Evaluate and improve workflow \(parallel action\)\)N1\+和\+N2\+的\+N3桌子和椅子的腿 \(Legs of both tables and chairs \\ Chairs and legs of tables\)句子可以理解为桌子的腿和椅子腿,也可以理解为桌子椅子的共有腿 \(The sentence can be understood as a leg of a table and a leg of a chair, or as the common leg of a table and a chair\)V\+N翻译文件 \(Translate documents \\ Translated documents\)句子既可表示翻译的动作行为,也可指用于翻译的特定文件 \(The sentence can represent both the action behavior of translation and specific translated documents\)N1\+的\+N2\+和\+N3手机的屏幕和外壳 \(The screen of a mobile phone and a casing \\ The screen and casing of a mobile phone\)“手机”可以同时修饰“屏幕”和“外壳”\(联合修饰\),也可以仅修饰“屏幕”而让“外壳”成为独立并列项\(单边修饰\)。 \(The ’phone’ can modify both the ’screen’ and the ’casing’ \(joint modification\), or it can modify only the ’screen’ and make the ’casing’ a separate item \(unilateral modification\)\)N\+V软件开发 \(Software development \\ Develop a software\)句子既可以作为整体名词概念指代软件工程领域,也可以理解为名词\+动词结构表示对软件进行开发的动作过程(对软件进行开发) \(The sentence can be used as a holistic noun concept to refer to the field of software engineering, or it can be understood as a noun verb structure to represent the action process of developing software \(developing software\)\)PREP\+N1\+的\+N2关于老师的意见 \(Opinions about the teacher \\ About teacher’s opinions\)可以理解为关于(老师所做的意见),也可以理解为这个意见是关于老师的 \(The sentence can be understood as about \(the teacher’s opinion\), or it can be understood as an opinion about the teacher\)VP\+ADJ\+的\+N照顾周到的妈妈 \(Thoughtful mother \\ Taking care of thoughtful mother\)句子可以表示照顾(周到的妈妈),“照顾”修饰“周到的妈妈”形成动宾结构; 也可以表示(照顾周到的)妈妈,理解为妈妈是照顾周到的 \(The sentence can express care \(thoughtful mother\), and "care" modifies "thoughtful mother" to form an object verb structure; It can also mean a thoughtful mother, understood as a mother who is thoughtful\)VP\+N1\+的\+N2拥抱父亲的儿子 \(The son who embrace his father \\ Embracing father’s son\)既可以理解为(拥抱父亲)的儿子,也可以理解为拥抱(父亲的儿子) \(The sentence can be understood as the son who embrace his father, or as embracing the son of the father\)V\+ADJ\+N晒干毛巾 \(Sun dried towels \\ Sunbathing dry towels\)句子可理解为:“晒/干毛巾”(将干毛巾拿去晾晒)或“晒干/毛巾”(通过晾晒使毛巾变干) \(The sentence can be understood as: ’Drying/drying towels’ \(taking dry towels to air dry\) or’ sun drying/drying towels’ \(drying towels by air drying\)\)N\+V\+NP\+AP同学评价老师很严格 \(The classmate evaluated the teacher for being very strict \\ Classmates evaluate teachers as very strict\)句子可以理解为同学对“老师很严格”进行评价或者同学评价老师的行为本身很严格 \(The sentence can be understood as a classmate’s evaluation of ’the teacher is very strict’ or a classmate’s evaluation of the teacher’s behavior as being very strict\)全部 \\ 部分\+VP\+的\+NP全部打开的窗户 \(A completely opened window \\ All of the opened windows\)句子可以理解为所有被打开了的窗户,也可以理解为窗户被完全打开 \(The sentence can be understood as all the windows that have been opened, or as windows that have been fully opened\)
## Appendix BPrompts for Translation Sampling
The prompts used for sampling English translations are presented in Table[13](https://arxiv.org/html/2605.15635#A2.T13)\. To better control the model outputs, we enclose the expected translations in double quotation marks\. For Instruct models, after applying the corresponding chat templates, we additionally prepend “English:"” to further constrain the generated outputs\.
Table 13:Prompt templates used for sampling English translations\.\(a\)Template for Base modelsTranslate this from Chinese into English:
Chinese: "\{src\}"
English: "\(b\)Template for Instruct models with the corresponding chat template \(Qwen3 as an example\)<\|im\_start\|\>user
Translate this from Chinese into English:
Chinese: "\{src\}"<\|im\_end\|\>
<\|im\_start\|\>assistant
<think\></think\>
English: "
## Appendix CDeclaration of generative AI
During the preparation of this work, the authors used ChatGPT and DeepSeek as writing\-support tools to improve clarity and refine the language, and used Gemini Nano Banana to enhance sketches and generate small illustrations\. All content was carefully reviewed and revised by the authors, who take full responsibility for the final version\.
## References
- Bajceta et al\. \(2022\)Bajceta, A\., Leon, M\., Afzal, W\., Lindberg, P\., Bohlin, M\., 2022\.Using NLP tools to detect ambiguities in system requirements \- A comparison study, in: Fischbach, J\., Condori\-Fernández, N\., Dörr, J\., Ruiz, M\., Steghöfer, J\., Pasquale, L\., Zisman, A\., Guizzardi, R\.S\.S\., Horkoff, J\., Perini, A\., Susi, A\., Daneva, M\., Herrmann, A\., Schneider, K\., Mennig, P\., Dalpiaz, F\., Dell’Anna, D\., Kopczynska, S\., Montgomery, L\., Darby, A\.G\., Sawyer, P\. \(Eds\.\), Joint Proceedings of REFSQ\-2022 Workshops, Doctoral Symposium, and Posters & Tools Track co\-located with the 28th International Conference on Requirements Engineering: Foundation for Software Quality \(REFSQ 2022\), Aston, Birmingham, UK, March 21, 2022, CEUR\-WS\.org\.URL:[https://ceur\-ws\.org/Vol\-3122/NLP4RE\-paper\-3\.pdf](https://ceur-ws.org/Vol-3122/NLP4RE-paper-3.pdf)\.
- Bhaskar et al\. \(2023\)Bhaskar, A\., Tomar, T\., Sathe, A\., Sarawagi, S\., 2023\.Benchmarking and improving text\-to\-SQL generation under ambiguity, in: Bouamor, H\., Pino, J\., Bali, K\. \(Eds\.\), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore\. pp\. 7053–7074\.URL:[https://aclanthology\.org/2023\.emnlp\-main\.436/](https://aclanthology.org/2023.emnlp-main.436/), doi:[10\.18653/v1/2023\.emnlp\-main\.436](https://arxiv.org/doi.org/10.18653/v1/2023.emnlp-main.436)\.
- Brown et al\. \(2020\)Brown, T\.B\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\., Dhariwal, P\., Neelakantan, A\., Shyam, P\., Sastry, G\., Askell, A\., Agarwal, S\., Herbert\-Voss, A\., Krueger, G\., Henighan, T\., Child, R\., Ramesh, A\., Ziegler, D\.M\., Wu, J\., Winter, C\., Hesse, C\., Chen, M\., Sigler, E\., Litwin, M\., Gray, S\., Chess, B\., Clark, J\., Berner, C\., McCandlish, S\., Radford, A\., Sutskever, I\., Amodei, D\., 2020\.Language models are few\-shot learners, in: Larochelle, H\., Ranzato, M\., Hadsell, R\., Balcan, M\., Lin, H\. \(Eds\.\), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual\.URL:[https://proceedings\.neurips\.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a\-Abstract\.html](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)\.
- Chen et al\. \(2024\)Chen, X\., Wang, C\., Xue, Y\., Zhang, N\., Yang, X\., Li, Q\., Shen, Y\., Liang, L\., Gu, J\., Chen, H\., 2024\.Unified hallucination detection for multimodal large language models, in: Ku, L\.W\., Martins, A\., Srikumar, V\. \(Eds\.\), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), Association for Computational Linguistics, Bangkok, Thailand\. pp\. 3235–3252\.URL:[https://aclanthology\.org/2024\.acl\-long\.178/](https://aclanthology.org/2024.acl-long.178/), doi:[10\.18653/v1/2024\.acl\-long\.178](https://arxiv.org/doi.org/10.18653/v1/2024.acl-long.178)\.
- Feng \(1989\)Feng, Z\., 1989\.Structural description of chinese scientific terms and potential ambiguity\.Journal of Chinese Information Processing , 1–16\.
- Feng \(1995\)Feng, Z\., 1995\.On potential nature of ambiguous construction\.Journal of Chinese Information Processing , 14–24\.
- Guo et al\. \(2025\)Guo, D\., Yang, D\., Zhang, H\., Song, J\., Zhang, R\., Xu, R\., Zhu, Q\., Ma, S\., Wang, P\., Bi, X\., et al\., 2025\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948 \.
- He et al\. \(2020\)He, J\., Wang, T\., Xiong, D\., Liu, Q\., 2020\.The box is in the pen: Evaluating commonsense reasoning in neural machine translation, in: Cohn, T\., He, Y\., Liu, Y\. \(Eds\.\), Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online\. pp\. 3662–3672\.URL:[https://aclanthology\.org/2020\.findings\-emnlp\.327/](https://aclanthology.org/2020.findings-emnlp.327/), doi:[10\.18653/v1/2020\.findings\-emnlp\.327](https://arxiv.org/doi.org/10.18653/v1/2020.findings-emnlp.327)\.
- Itzhak et al\. \(2024\)Itzhak, I\., Stanovsky, G\., Rosenfeld, N\., Belinkov, Y\., 2024\.Instructed to bias: Instruction\-tuned language models exhibit emergent cognitive bias\.Transactions of the Association for Computational Linguistics 12, 771–785\.URL:[https://aclanthology\.org/2024\.tacl\-1\.43/](https://aclanthology.org/2024.tacl-1.43/), doi:[10\.1162/tacl\_a\_00673](https://arxiv.org/doi.org/10.1162/tacl_a_00673)\.
- Kim et al\. \(2024\)Kim, H\.J\., Kim, Y\., Park, C\., Kim, J\., Park, C\., Yoo, K\.M\., Lee, S\.g\., Kim, T\., 2024\.Aligning language models to explicitly handle ambiguity, in: Al\-Onaizan, Y\., Bansal, M\., Chen, Y\.N\. \(Eds\.\), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Miami, Florida, USA\. pp\. 1989–2007\.URL:[https://aclanthology\.org/2024\.emnlp\-main\.119/](https://aclanthology.org/2024.emnlp-main.119/), doi:[10\.18653/v1/2024\.emnlp\-main\.119](https://arxiv.org/doi.org/10.18653/v1/2024.emnlp-main.119)\.
- Kuhn et al\. \(2023\)Kuhn, L\., Gal, Y\., Farquhar, S\., 2023\.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, in: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023, OpenReview\.net\.URL:[https://openreview\.net/forum?id=VD\-AYtP0dve](https://openreview.net/forum?id=VD-AYtP0dve)\.
- Li \(2021\)Li, F\., 2021\.Study on chinese semantic content based on syntactic differences between chinese and english, in: 7th International Conference on Social Science and Higher Education \(ICSSHE 2021\), Atlantis Press\. pp\. 542–545\.
- Lin and Chen \(2023\)Lin, Y\.T\., Chen, Y\.N\., 2023\.LLM\-eval: Unified multi\-dimensional automatic evaluation for open\-domain conversations with large language models, in: Chen, Y\.N\., Rastogi, A\. \(Eds\.\), Proceedings of the 5th Workshop on NLP for Conversational AI \(NLP4ConvAI 2023\), Association for Computational Linguistics, Toronto, Canada\. pp\. 47–58\.URL:[https://aclanthology\.org/2023\.nlp4convai\-1\.5/](https://aclanthology.org/2023.nlp4convai-1.5/), doi:[10\.18653/v1/2023\.nlp4convai\-1\.5](https://arxiv.org/doi.org/10.18653/v1/2023.nlp4convai-1.5)\.
- Liu et al\. \(2024\)Liu, A\., Feng, B\., Xue, B\., Wang, B\., Wu, B\., Lu, C\., Zhao, C\., Deng, C\., Zhang, C\., Ruan, C\., et al\., 2024\.Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437 \.
- Liu et al\. \(2023\)Liu, A\., Wu, Z\., Michael, J\., Suhr, A\., West, P\., Koller, A\., Swayamdipta, S\., Smith, N\., Choi, Y\., 2023\.We’re afraid language models aren’t modeling ambiguity, in: Bouamor, H\., Pino, J\., Bali, K\. \(Eds\.\), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore\. pp\. 790–807\.URL:[https://aclanthology\.org/2023\.emnlp\-main\.51/](https://aclanthology.org/2023.emnlp-main.51/), doi:[10\.18653/v1/2023\.emnlp\-main\.51](https://arxiv.org/doi.org/10.18653/v1/2023.emnlp-main.51)\.
- Ma et al\. \(2024\)Ma, X\., Liu, X\., Wong, D\.F\., Rao, J\., Li, B\., Ding, L\., Chao, L\.S\., Tao, D\., Zhang, M\., 2024\.3AM: An ambiguity\-aware multi\-modal machine translation dataset, in: Calzolari, N\., Kan, M\.Y\., Hoste, V\., Lenci, A\., Sakti, S\., Xue, N\. \(Eds\.\), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\), ELRA and ICCL, Torino, Italia\. pp\. 1–13\.URL:[https://aclanthology\.org/2024\.lrec\-main\.1/](https://aclanthology.org/2024.lrec-main.1/)\.
- Mehrabi et al\. \(2023\)Mehrabi, N\., Goyal, P\., Verma, A\., Dhamala, J\., Kumar, V\., Hu, Q\., Chang, K\.W\., Zemel, R\., Galstyan, A\., Gupta, R\., 2023\.Resolving ambiguities in text\-to\-image generative models, in: Rogers, A\., Boyd\-Graber, J\., Okazaki, N\. \(Eds\.\), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), Association for Computational Linguistics, Toronto, Canada\. pp\. 14367–14388\.URL:[https://aclanthology\.org/2023\.acl\-long\.804/](https://aclanthology.org/2023.acl-long.804/), doi:[10\.18653/v1/2023\.acl\-long\.804](https://arxiv.org/doi.org/10.18653/v1/2023.acl-long.804)\.
- Mehrparvar and Pezzelle \(2024\)Mehrparvar, B\., Pezzelle, S\., 2024\.Detecting and translating language ambiguity with multilingual LLMs, in: Sälevä, J\., Owodunni, A\. \(Eds\.\), Proceedings of the Fourth Workshop on Multilingual Representation Learning \(MRL 2024\), Association for Computational Linguistics, Miami, Florida, USA\. pp\. 310–323\.URL:[https://aclanthology\.org/2024\.mrl\-1\.26/](https://aclanthology.org/2024.mrl-1.26/), doi:[10\.18653/v1/2024\.mrl\-1\.26](https://arxiv.org/doi.org/10.18653/v1/2024.mrl-1.26)\.
- Min et al\. \(2020\)Min, S\., Michael, J\., Hajishirzi, H\., Zettlemoyer, L\., 2020\.AmbigQA: Answering ambiguous open\-domain questions, in: Webber, B\., Cohn, T\., He, Y\., Liu, Y\. \(Eds\.\), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), Association for Computational Linguistics, Online\. pp\. 5783–5797\.URL:[https://aclanthology\.org/2020\.emnlp\-main\.466/](https://aclanthology.org/2020.emnlp-main.466/), doi:[10\.18653/v1/2020\.emnlp\-main\.466](https://arxiv.org/doi.org/10.18653/v1/2020.emnlp-main.466)\.
- Ortega\-Martín et al\. \(2023\)Ortega\-Martín, M\., García\-Sierra, Ó\., Ardoiz, A\., Álvarez, J\., Armenteros, J\.C\., Alonso, A\., 2023\.Linguistic ambiguity analysis in chatgpt\.arXiv preprint arXiv:2302\.06426 \.
- Piantadosi et al\. \(2012\)Piantadosi, S\.T\., Tily, H\., Gibson, E\., 2012\.The communicative function of ambiguity in language\.Cognition 122, 280–291\.
- Rostamkhani et al\. \(2025\)Rostamkhani, M\., Ansari, B\., Sabzevari, H\., Rahmani, F\., Eetemadi, S\., 2025\.Illusory VQA: benchmarking and enhancing multimodal models on visual illusions, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2025, Nashville, TN, USA, June 11\-15, 2025, Computer Vision Foundation / IEEE\. pp\. 2995–3004\.URL:[https://openaccess\.thecvf\.com/content/CVPR2025W/MAR/html/Rostamkhani\_Illusory\_VQA\_Benchmarking\_and\_Enhancing\_Multimodal\_Models\_on\_Visual\_Illusions\_CVPRW\_2025\_paper\.html](https://openaccess.thecvf.com/content/CVPR2025W/MAR/html/Rostamkhani_Illusory_VQA_Benchmarking_and_Enhancing_Multimodal_Models_on_Visual_Illusions_CVPRW_2025_paper.html)\.
- Team et al\. \(2023\)Team, G\., Anil, R\., Borgeaud, S\., Alayrac, J\.B\., Yu, J\., Soricut, R\., Schalkwyk, J\., Dai, A\.M\., Hauth, A\., Millican, K\., et al\., 2023\.Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805 \.
- Wang et al\. \(2023\)Wang, B\., Gao, Y\., Li, Z\., Lou, J\.G\., 2023\.Know what I don’t know: Handling ambiguous and unknown questions for text\-to\-SQL, in: Rogers, A\., Boyd\-Graber, J\., Okazaki, N\. \(Eds\.\), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada\. pp\. 5701–5714\.URL:[https://aclanthology\.org/2023\.findings\-acl\.352/](https://aclanthology.org/2023.findings-acl.352/), doi:[10\.18653/v1/2023\.findings\-acl\.352](https://arxiv.org/doi.org/10.18653/v1/2023.findings-acl.352)\.
- Wang et al\. \(2025\)Wang, X\., Kang, Z\., Zhai, W\., Lou, X\., Lai, Y\., Wang, Z\., Wang, Y\., Huang, K\., Wang, Y\., Li, P\., Liu, Y\., 2025\.MUCAR: Benchmarking multilingual cross\-modal ambiguity resolution for multimodal large language models, in: Christodoulopoulos, C\., Chakraborty, T\., Rose, C\., Peng, V\. \(Eds\.\), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Suzhou, China\. pp\. 15026–15048\.URL:[https://aclanthology\.org/2025\.emnlp\-main\.760/](https://aclanthology.org/2025.emnlp-main.760/), doi:[10\.18653/v1/2025\.emnlp\-main\.760](https://arxiv.org/doi.org/10.18653/v1/2025.emnlp-main.760)\.
- Wildenburg et al\. \(2024\)Wildenburg, F\., Hanna, M\., Pezzelle, S\., 2024\.Do pre\-trained language models detect and understand semantic underspecification? ask the DUST\!, in: Ku, L\.W\., Martins, A\., Srikumar, V\. \(Eds\.\), Findings of the Association for Computational Linguistics: ACL 2024, Association for Computational Linguistics, Bangkok, Thailand\. pp\. 9598–9613\.URL:[https://aclanthology\.org/2024\.findings\-acl\.572/](https://aclanthology.org/2024.findings-acl.572/), doi:[10\.18653/v1/2024\.findings\-acl\.572](https://arxiv.org/doi.org/10.18653/v1/2024.findings-acl.572)\.
- Wu et al\. \(2025\)Wu, X\., Li, H\., Liu, H\., Ji, X\., Li, R\., Chen, Y\., Zhang, Y\., 2025\.Uncovering the fragility of trustworthy llms through chinese textual ambiguity\.CoRR abs/2507\.23121\.URL:[https://doi\.org/10\.48550/arXiv\.2507\.23121](https://doi.org/10.48550/arXiv.2507.23121), doi:[10\.48550/ARXIV\.2507\.23121](https://arxiv.org/doi.org/10.48550/ARXIV.2507.23121),[arXiv:2507\.23121](http://arxiv.org/abs/2507.23121)\.
- Zhang et al\. \(2024\)Zhang, Q\., Cai, S\., Zhao, J\., Pechenizkiy, M\., Fang, M\., 2024\.CHAmbi: A new benchmark on Chinese ambiguity challenges for large language models, in: Al\-Onaizan, Y\., Bansal, M\., Chen, Y\.N\. \(Eds\.\), Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics, Miami, Florida, USA\. pp\. 14883–14898\.URL:[https://aclanthology\.org/2024\.findings\-emnlp\.875/](https://aclanthology.org/2024.findings-emnlp.875/), doi:[10\.18653/v1/2024\.findings\-emnlp\.875](https://arxiv.org/doi.org/10.18653/v1/2024.findings-emnlp.875)\.
- Zheng et al\. \(2024\)Zheng, C\., Zhou, H\., Meng, F\., Zhou, J\., Huang, M\., 2024\.Large language models are not robust multiple choice selectors, in: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024, OpenReview\.net\.URL:[https://openreview\.net/forum?id=shr9PXz7T0](https://openreview.net/forum?id=shr9PXz7T0)\.Similar Articles
Exploring the Capability Boundaries of LLMs in Mastering Chinese Chouxiang Language
This paper introduces Mouse, a specialized benchmark for evaluating LLMs on Chinese Chouxiang Language tasks across six NLP domains, revealing that current state-of-the-art models have significant limitations with this subcultural internet language despite performing well on contextual understanding tasks.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
This paper investigates how similar large language model uncertainty is to human uncertainty, exploring alignment, calibration, and activation patterns in LLMs across multiple datasets and the impact of instruction fine-tuning.
Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution
Introduces PRIG, a gradient attribution method that localizes prompt ambiguity in large language models by training a linear probe to distinguish clear from ambiguous prompts and attributing the probe score to token representations in the residual stream, achieving strong performance on synthetic and human-written benchmarks.
Can LLMs Take Retrieved Information with a Grain of Salt?
This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.
LLMs for automatic annotation of Mandarin narrative transcripts
This paper evaluates LLMs for automatically annotating narrative macrostructure in spoken Mandarin, finding that the best model achieves near-human reliability while reducing annotation time by 65%, though performance degrades on semantically complex or lexically diverse narratives.