Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

arXiv cs.CL Papers

Summary

This paper investigates the trade-off between plausibility and faithfulness in cross-lingual explanations from LLMs, finding that English-pivot explanations achieve higher span agreement with human rationales but suffer reduced causal faithfulness compared to native-language explanations.

arXiv:2605.19274v1 Announce Type: new Abstract: LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:24 AM

# The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations
Source: [https://arxiv.org/html/2605.19274](https://arxiv.org/html/2605.19274)
Somnath Banerjee1, Pranav Jha1, Rima Hazra2,3, Animesh Mukherjee1

1Indian Institute of Technology Kharagpur2TCG Crest 3National University of Singapore

###### Abstract

LLMs deployed multilingually are often audited via English explanations for non\-English inputs\. We evaluate*extractive*explanations “where the model identifies input token spans as evidence alongside a generated rationale” and uncover a systematictrade\-off: English\-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model’s prediction, as measured by both comprehensiveness and sufficiency\. Across 3 tasks, 5 languages, and 2 multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to5\.7×5\.7\\timesrelative to native\-language conditions \- even as task accuracy remains stable across settings\. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement\. We recommend auditing explanations in the input language, reporting multi\-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces\.

Lost in Interpretation: The Plausibility\-Faithfulness Trade\-off in Cross\-Lingual Explanations

## 1Introduction

As LLMs are increasingly deployed in global contextsEiden \([2024](https://arxiv.org/html/2605.19274#bib.bib20)\); Jadhavet al\.\([2025](https://arxiv.org/html/2605.19274#bib.bib21)\), they routinely operate under cross\-lingual constraints where the user’s input language differs from the system’s reporting language\. This setting is common in applications such as customer support and public\-service workflows, where users submit requests in local languages \(e\.g\.,Chinese,Hindi,Malay\) while downstream analysts, auditors, or operational teams require English explanations for triage and decision\-making[Amazon Web Services \(2024\)](https://arxiv.org/html/2605.19274#bib.bib22);[19](https://arxiv.org/html/2605.19274#bib.bib23)\. In this work, we focus specifically on extractive explanations, where the model identifies spans from the input text as evidence for its prediction, alongside a free\-text rationale\. This dual\-output format “evidence spans plus narrative explanation” is common in deployed systems that require both auditability \(which spans mattered?\) and interpretability \(why did the model decide this?\)\. Crucially, while the narrative explanation may be generated in any language, the evidence spans are always drawn verbatim from the input, enabling language\-independent faithfulness evaluation\.

For example, in a banking support pipeline, a user reports in Hindi,mera UPI debit ho gaya lekin balance nahi aaya\(money is debited but not received\)\. The intended English summary for the backend team isUPI transaction: the amount is debited from the customer’s account, but the beneficiary did not receive the credit\. Please check pending status or credit reversal\.Instead, the model sometimes produces the incorrect summaryUPI transaction failed,collapsing a debit\-without\-credit event into a generic failure and thereby altering the operational interpretation\.

![Refer to caption](https://arxiv.org/html/2605.19274v1/x1.png)Figure 1:The plausibility–faithfulness trade\-off in e\-SNLI \(Qwen2\.5\-7B\)\. Arrows show the shift from native\-language explanations to English\-pivot explanations\. English pivots tend to increase span agreement with human rationales \(y\-axis\) while reducing comprehensiveness \(x\-axis\), indicating that the cited evidence becomes less causally necessary for the prediction\.This reporting language choice typically reflects developer preferences, organizational policy, and the dominance of English\-centric tooling and evaluation benchmarks\. This language mismatch introduces a critical and largely unexamined question –when a model explains a decision in a language different from the input, does the explanation lose faithfulness? In other words, does it still accurately reflect the model’s underlying decision process?Figure[1](https://arxiv.org/html/2605.19274#S1.F1)illustrates the central tension we investigate – whether generating explanations in English, rather than in the input language, increases perceived plausibility while reducing faithfulness\.

Prior research has extensively studied the faithfulness of self\-explanations in monolingual English settingsJacovi and Goldberg \([2020](https://arxiv.org/html/2605.19274#bib.bib1)\)and established the fundamental distinction betweenplausibility\(how human\-like an explanation appears\) andfaithfulness\(the causal link between the explanation and the model’s prediction\)Wiegreffeet al\.\([2021](https://arxiv.org/html/2605.19274#bib.bib7)\)\. More recently, efforts have shifted toward cross\-lingual settings, exploring attribution faithfulness across translated pairsVamvas and Sennrich \([2023](https://arxiv.org/html/2605.19274#bib.bib9)\)\. Yet, these studies typically keep the input and reporting languages aligned, failing to treat thereporting language itselfas an independent experimental variable\.

We address this gap by systematically evaluating explanation\-language mismatch and treating the reporting language as a controlled experimental variable\. We build upon the work ofHuanget al\.\([2023](https://arxiv.org/html/2605.19274#bib.bib10)\), who suggest that multilingual LLMs often display anEnglish biasin reasoning\. We hypothesize that this bias, often described as a trade\-off between adequacy and fluencyConneauet al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib15)\), results in a phenomenon we term theplausibility\-faithfulness trade\-off\. Across three tasks – natural language inference \(NLI\)Camburuet al\.\([2018](https://arxiv.org/html/2605.19274#bib.bib12)\), fact verificationThorneet al\.\([2018](https://arxiv.org/html/2605.19274#bib.bib14)\), and hate speech detectionMathewet al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib13)\)– we observe that English explanations often achieve higher span agreement with human rationales than explanations generated in the input language, particularly for reasoning\-intensive and factual tasks\. A human\-rated subsample confirms that span agreement is moderately correlated with perceived plausibility \(ρ=0\.67\\rho=0\.67,p<0\.001p<0\.001; Appendix[G](https://arxiv.org/html/2605.19274#A7)\)\. However, deletion\-based perturbation tests following ERASER\-style evaluationDeYounget al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib8)\)indicate that these same English explanations are frequently less faithful to the features that causally drive the model’s predictions\. Through this study, we make the following contributions:

1. 1\.We provide the first controlled empirical study of explanation\-language mismatch, treating the reporting language as an independent variable and measuring its effect on both span\-level agreement and perturbation\-based faithfulness \(comprehensiveness and sufficiency\)\.
2. 2\.We identify a plausibility–faithfulness trade\-off: across multiple languages and tasks, English\-pivot explanations can show higher span agreement with human rationales while their cited evidence becomes less causally necessary for the model’s prediction, as confirmed by multiple faithfulness probes and prompt sensitivity analysis\.
3. 3\.We show that this effect is task\-dependent: for socially nuanced classification, English pivots degrade both dimensions, revealing a distinct failure mode tied to loss of pragmatic cues\.

## 2Related Work

Faithfulness and plausibility in explanations\.Jacovi and Goldberg \([2020](https://arxiv.org/html/2605.19274#bib.bib1)\)formalized the distinction between faithfulness \(whether an explanation reflects the model’s actual reasoning\) and plausibility \(whether it appears convincing to humans\), establishing that these properties are independent and can diverge\.Wiegreffeet al\.\([2021](https://arxiv.org/html/2605.19274#bib.bib7)\)operationalized faithfulness measurement for text classifiers via perturbation tests\. The ERASER benchmarkDeYounget al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib8)\)standardized evaluation through comprehensiveness and sufficiency metrics, which we adopt\. Recent work has shown that LLM self\-explanations can be persuasive yet unfaithful, functioning as post\-hoc rationalizationsTurpinet al\.\([2023](https://arxiv.org/html/2605.19274#bib.bib16)\); Lanhamet al\.\([2023](https://arxiv.org/html/2605.19274#bib.bib17)\)\. Cross\-lingual explainability\.Vamvas and Sennrich \([2023](https://arxiv.org/html/2605.19274#bib.bib9)\); Banerjeeet al\.\([2025a](https://arxiv.org/html/2605.19274#bib.bib5)\)studied attribution faithfulness across translated pairs, finding that translation can shift saliency maps\.Banerjeeet al\.\([2025b](https://arxiv.org/html/2605.19274#bib.bib6)\)explored cross\-lingual transfer of explainable NLP capabilities from safety perspective\. However, both lines of work keep the input and explanation languages aligned\. We differ by treating the reporting language itself as a controlled variable\. English bias in multilingual LLMs\.Huanget al\.\([2023](https://arxiv.org/html/2605.19274#bib.bib10)\)demonstrated that multilingual LLMs often reason more effectively in English, andConneauet al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib15)\)characterized the adequacy–fluency tension in multilingual models\. These findings motivate our hypothesis that English\-pivot explanations may be optimized for fluency at the cost of faithfulness to non\-English input cues\.

## 3Experimental framework

To investigate the relationship between the reporting language and model faithfulness, we design a controlled experimental setup that isolates the language of the explanation while keeping the task and input semantics constant\.

### 3\.1Linguistic conditions

For each dataset, we evaluate three experimental conditions that differ only in the input and reporting languages, thereby isolating the effect of reporting\-language mismatch\. 1\. Condition A \(E​N→E​NEN\\rightarrow EN\): The input and the explanation are both in English\. This condition provides a monolingual reference point and approximates an upper bound on explanation quality\. 2\. Condition B \(Ln​a​t​i​v​e→Ln​a​t​i​v​eL\_\{\{native\}\}\\rightarrow L\_\{\{native\}\}\): The input and the explanation are both in the same non\-English language\. This condition captures language\-aligned multilingual usage\. 3\. Condition C \(Ln​a​t​i​v​e→E​NL\_\{native\}\\rightarrow EN\): The input is in a non\-English language, but the explanation is generated in English\. This condition instantiates the reporting\-language mismatch typical of English\-centric deployments\.

### 3\.2Datasets and tasks

We use three benchmark datasets spanning distinct reasoning demands: \(1\)e\-SNLICamburuet al\.\([2018](https://arxiv.org/html/2605.19274#bib.bib12)\): natural language inference, which tests compositional and logical reasoning\. \(2\)FEVERThorneet al\.\([2018](https://arxiv.org/html/2605.19274#bib.bib14)\): fact verification, which requires evidence identification and factual consistency with supporting context\. \(3\)HateXplainMathewet al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib13)\): hate\-speech classification, which depends on sensitivity to social nuance\.

### 3\.3Multilingual data construction

We evaluate five languages:English \(E​NEN\)and four non\-English languages that commonly appear in multilingual applications \-Chinese \(Z​HZH\),Hindi \(H​IHI\),Arabic \(A​RAR\), andBengali \(B​NBN\)\. For each dataset, we construct semantically matched test sets by translating the original English test instances into each target language while preserving the task format and gold labels \(see example in Fig[2](https://arxiv.org/html/2605.19274#S3.F2)\)\. We construct semantically matched test sets by translating the original English test instances into each target language usingNLLB\-200 \(3\.3B distilled\)Costa\-jussà and others \([2022](https://arxiv.org/html/2605.19274#bib.bib3)\), accessed via the official Hugging Face checkpoint\. We selected NLLB\-200 for three reasons\. First, it provides uniform open\-weight coverage of all four target languages \(Chinese, Hindi, Arabic, Bengali\), avoiding the heterogeneous quality that arises when mixing translation systems across languages\. Second, per\-direction chrF\+\+ and spBLEU scores for every language pair used in our study are publicly reported on the FLORES\-200 evaluation benchmarkCosta\-jussà and others \([2022](https://arxiv.org/html/2605.19274#bib.bib3)\); NLLB Team and others \([2024](https://arxiv.org/html/2605.19274#bib.bib4)\), enabling readers to cross\-reference baseline translation quality\. Third, the model is fully open\-weight and reproducible\.To keep plausibility evaluation consistent across languages, we also translate the human rationale signal associated with each instance\. Fore\-SNLI, we translate the natural\-language explanation; forFEVER, we use the gold evidence sentences as the rationale signal; forHateXplain, we translate the full text and the annotated highlight spans\. We apply Unicode normalization and filter instances with empty or malformed translations\. To verify translation quality, we conduct a structured audit: for each target language, two bilingual annotators independently evaluate 50 randomly sampled instances through Prolific111https://www\.prolific\.com/on \(1\) semantic preservation \(whether the meaning is faithfully retained on a 3\-point scale: preserved / minor shift / major shift\) and \(2\) label validity \(whether the gold label remains correct for the translated instance\)\.Full audit statistics are reported in Appendix[H](https://arxiv.org/html/2605.19274#A8)\. Instances flagged as label\-altering by either annotator were excluded\. In addition, we compute chrF\+\+ scoresPopović \([2015](https://arxiv.org/html/2605.19274#bib.bib2)\)between back\-translations and the original English to provide an automatic cross\-check\. These measures address the concern that translation artifacts could confound plausibility or faithfulness measurements\.

![Refer to caption](https://arxiv.org/html/2605.19274v1/x2.png)Figure 2:Sample translation example\.All experiments use the same translated inputs across the three linguistic conditions; only the required language of the model’s explanation is changed\. This design isolates explanation\-language mismatch while keeping task semantics constant\.

### 3\.4Evaluation metrics

We evaluate explanations along four complementary dimensions\. All span\-level metrics are computed over the evidence spansℰm​\(x\)\\mathcal\{E\}\_\{m\}\(x\)that the model copies verbatim from the input\. Notation: For an input instancexx, letℐ​\(x\)\\mathcal\{I\}\(x\)denote its tokenized input sequence\. Letℰm​\(x\)⊆ℐ​\(x\)\\mathcal\{E\}\_\{m\}\(x\)\\subseteq\\mathcal\{I\}\(x\)be the set of input\-token indices covered by the model\-produced evidence spans, and letℰh​\(x\)⊆ℐ​\(x\)\\mathcal\{E\}\_\{h\}\(x\)\\subseteq\\mathcal\{I\}\(x\)be the set of input\-token indices covered by the human rationale annotation \(when available\)\.222For tasks where human rationales are provided as free\-form text \(e\.g\., e\-SNLI explanations\), we align them to the input via exact substring matching at the character level \(for non\-Latin scripts\) or word level \(for English\), and treat the matched input\-token indices asℰh​\(x\)\\mathcal\{E\}\_\{h\}\(x\)\. This conservative approach underestimates overlap for paraphrased rationales, biasing against inflated span agreement\. Worked examples for each language are provided in Appendix[B\.3](https://arxiv.org/html/2605.19274#A2.SS3)\. Span agreement: We measure the overlap between model\-identified and human\-annotated evidence using token\-level F1:

SpanAgr​\(x\)=2​\|ℰm​\(x\)∩ℰh​\(x\)\|\|ℰm​\(x\)\|\+\|ℰh​\(x\)\|\\text\{SpanAgr\}\(x\)\\;=\\;\\frac\{2\\,\|\\mathcal\{E\}\_\{m\}\(x\)\\cap\\mathcal\{E\}\_\{h\}\(x\)\|\}\{\|\\mathcal\{E\}\_\{m\}\(x\)\|\+\|\\mathcal\{E\}\_\{h\}\(x\)\|\}\(1\)We adopt the term*span agreement*rather than “plausibility” to reflect that this metric captures lexical overlap with human rationales, not perceived explanation quality \- which would require human judgmentJacovi and Goldberg \([2020](https://arxiv.org/html/2605.19274#bib.bib1)\)\. Comprehensiveness: Comprehensiveness captures whether the model’s cited evidence is causally*necessary*for its predictionDeYounget al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib8)\)\. We construct a perturbed inputx′=mask​\(x,ℰm​\(x\)\)x^\{\\prime\}=\\textsc\{mask\}\(x,\\mathcal\{E\}\_\{m\}\(x\)\)by replacing the tokens indexed byℰm​\(x\)\\mathcal\{E\}\_\{m\}\(x\)with a sentinel token \(preserving sequence length\) and recomputing the prediction:

Comp=1N​∑i=1N𝟙​\[f​\(xi\)≠f​\(mask​\(xi,ℰm​\(xi\)\)\)\]\\text\{Comp\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathds\{1\}\\\!\\left\[f\(x\_\{i\}\)\\neq f\\\!\\left\(\\textsc\{mask\}\(x\_\{i\},\\,\\mathcal\{E\}\_\{m\}\(x\_\{i\}\)\)\\right\)\\right\]\(2\)A higher comprehensiveness indicates that removing the model – identified evidence is more likely to change the prediction, implying stronger causal necessity\. Sufficiency: Sufficiency captures whether the cited evidence alone is enough to sustain the predictionDeYounget al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib8)\)\. We construct a reduced inputx′′=KEEP​\(x,ℰm​\(x\)\)x^\{\\prime\\prime\}=\\mathrm\{KEEP\}\(x,\\mathcal\{E\}\_\{m\}\(x\)\)by masking all tokens except those inℰm​\(x\)\\mathcal\{E\}\_\{m\}\(x\)and recomputing:

Suff=1N​∑i=1N𝟏​\[f​\(xi\)=f​\(KEEP​\(xi,ℰm​\(xi\)\)\)\]\\mathrm\{Suff\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\bigl\[f\(x\_\{i\}\)=f\(\\mathrm\{KEEP\}\(x\_\{i\},\\mathcal\{E\}\_\{m\}\(x\_\{i\}\)\)\)\\bigr\]\(3\)
Higher sufficiency means the evidence spans alone reproduce the original prediction, indicating that the cited evidence is*self\-contained*\. Together, comprehensiveness and sufficiency provide complementary views of faithfulness as follows\. Comprehensiveness↑\\uparrow\- removing the evidence changes the prediction \(evidence is*necessary*\)\. Sufficiency↑\\uparrow\- keeping only the evidence preserves the prediction \(evidence is*adequate*\)\. An explanation is well\-grounded when*both*are high\. Conversely, an English\-pivot explanation can produce evidence that is highly self\-contained \(high sufficiency\) yet not causally necessary \(low comprehensiveness\) \- a signature of rhetorically tight summaries that do not reflect the model’s actual decision cues\. Note on sufficiency in cross\-lingual settings: In monolingual rationale extraction, high sufficiency is typically considered desirable: the cited evidence alone reproduces the prediction, indicating the explanation captures what the model needs\.*In our cross\-lingual setting, sufficiency must be interpreted jointly with comprehensiveness\.*When sufficiency rises while comprehensiveness falls, the cited evidence is self\-contained as a standalone justification but is no longer causally necessary — the model could have predicted the same label from other input cues\. This combination is the diagnostic signature of post\-hoc rationalization that motivates our analysis\. For consistency, all tables in this paper report sufficiency with the conventional↑\\uparrow\(“higher is better in isolation”\) arrow, but underLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}, rising sufficiency paired with falling comprehensiveness indicates a shift toward summary\-like \(rather than causally grounded\) evidence selection\. Task accuracy: To disentangle the explanation\-language effect from base language competence, we report label accuracy for every experimental condition:

Acc=1N​∑i=1N𝟙​\[f​\(xi\)=yi\]\\text\{Acc\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathds\{1\}\\\!\\left\[f\(x\_\{i\}\)=y\_\{i\}\\right\]\(4\)whereyiy\_\{i\}is the gold label\. If accuracy is comparable acrossLnative→LnativeL\_\{\\text\{native\}\}\\\!\\to\\\!L\_\{\\text\{native\}\}andLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}, observed faithfulness differences can be attributed to the explanation\-language pivot rather than degraded input comprehension\.

### 3\.5Model selection

We perform our experiments using two state\-of\-the\-art multilingual LLMs namelyQwen2\.5\-7BYanget al\.\([2025](https://arxiv.org/html/2605.19274#bib.bib18)\)andLlama3\.1\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2605.19274#bib.bib19)\)\. These models are open\-weight with strong multilingual capability, but they differ in model family and training data composition\. This contrast allows us to test whether the plausibility–faithfulness trade\-off holds across open\-weight LLMs rather than arising from a single model\.

### 3\.6Prompt design and generation order

Output format\.For all conditions and tasks, we use a structured output format that requests three fields in sequence: \(1\) a predicted label, \(2\) evidence spans copied verbatim from the input, and \(3\) a free\-text explanation in the required language \(see templates in Appendix[B](https://arxiv.org/html/2605.19274#A2)\)\. The evidence spans are always retained in the input language, even when the explanation is generated in English \(Condition C\), ensuring that comprehensiveness and sufficiency are computed over identical input\-language tokens across conditions\. Autoregressive coupling\.A natural concern is why the*explanation language*should affect evidence span selection, given that evidence appears before the explanation in the output format\. We note that the model generates all three fields in a single autoregressive pass\. The instruction specifying the explanation language—e\.g\., “Write a brief explanation in English” vs\. “Write a brief explanation in Hindi”—is present in the prompt*before*any generation begins, and thus conditions the model’s entire output distribution, including its selection of evidence spans\. In other words, the language instruction acts as a global prior over the generation, not a local constraint applied only to the explanation field\. To verify this empirically, we conduct anordering ablation: we create an alternative prompt variant in which the explanation is requested*before*the evidence \(i\.e\., label→\\rightarrowexplanation→\\rightarrowevidence\)\. To keep this robustness check computationally bounded, we run the ablation forQwen2\.5\-7Bacross all three datasets and all non\-English languages\. Table[10](https://arxiv.org/html/2605.19274#A5.T10)in Appendix[E](https://arxiv.org/html/2605.19274#A5)reports the results\. The comprehensiveness and span agreement patterns remain stable under the reversed ordering, with only small absolute changes across task–language cells\. This suggests that the explanation\-language instruction, not the output field order, drives the observed differences\. Prompt sensitivity\.To ensure that our findings are not artifacts of specific prompt phrasing, we create five paraphrased variants of each prompt template\. Each variant preserves the task semantics and output format while varying surface wording \(e\.g\., “Predict the correct label”→\\to“Determine the appropriate category”; “Write a brief explanation”→\\to“Provide a short justification”\)\. All five variants are listed in Appendix[B](https://arxiv.org/html/2605.19274#A2)\. Per\-condition results are reported in Section[D\.2](https://arxiv.org/html/2605.19274#A4.SS2)\.

## 4Results

Before examining explanation quality, we verify that the model’s task performance is comparable across linguistic conditions\. Tables[1](https://arxiv.org/html/2605.19274#S4.T1)–[3](https://arxiv.org/html/2605.19274#S4.T3)include task accuracy for every setting\. Across all three datasets, accuracy differences betweenLnative→LnativeL\_\{\\text\{native\}\}\\\!\\to\\\!L\_\{\\text\{native\}\}andLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}are consistently small: the mean absolute accuracy gap is0\.0050\.005one\-SNLI,0\.0060\.006onFEVER, and0\.0040\.004onHateXplain, with no gap exceeding0\.010\.01in any individual \(language, model\) cell\. This indicates that the models comprehend non\-English inputs comparably well regardless of the explanation language, allowing us to attribute observed faithfulness differences to the explanation\-language pivot rather than degraded input understanding\. We note that accuracy underLnative→LnativeL\_\{\\text\{native\}\}\\\!\\to\\\!L\_\{\\text\{native\}\}is itself slightly lower than EN→\\toEN \(e\.g\.,0\.8110\.811vs\.0\.8470\.847for Hindi one\-SNLIwithQwen2\.5\-7B\), reflecting the expected base multilingual performance gap; critically, however, this gap does not widen further underLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}, confirming that requiring English explanations does not additionally impair task comprehension\. Next we compare three settings namelycondition A\(E​N→E​NEN\\rightarrow EN\),condition B\(Ln​a​t​i​v​e→Ln​a​t​i​v​eL\_\{native\}\\rightarrow L\_\{native\}\), andcondition C\(Ln​a​t​i​v​e→E​NL\_\{native\}\\rightarrow EN\) in the plausability\-faithfulness axis\. One\-SNLI, Table[1](https://arxiv.org/html/2605.19274#S4.T1)shows thatQwen2\.5\-7Bexhibits the clearest plausibility–faithfulness trade\-off\. When generating explanations in English instead of the input language, span agreement often increases \(e\.g\., Bengali: 0\.375→\\to0\.507 forQwen2\.5\-7B\), while comprehensiveness drops substantially \(Hindi: 0\.719→\\to0\.461 forQwen2\.5\-7B\) and sufficiency*rises*\(Hindi: 0\.253→\\to0\.387\)\. The combined pattern is diagnostic: the cited evidence becomes more*self\-contained as a standalone summary*\(higher sufficiency\) yet less*causally necessary*for the prediction \(lower comprehensiveness\)\. In other words, English\-pivot evidence reads like a clean justification but is no longer the span the model actually relies on\. Crucially, task accuracy remains stable across conditions \(Δ<0\.01\\Delta<0\.01\), ruling out degraded input understanding as an explanation\.

Table 1:The plausibility–faithfulness trade\-off one\-SNLI\. All metrics use the higher\-is\-better convention \(↑\\uparrow\)\.Blue: higher comprehensiveness \(evidence is causally necessary\)\.Green: higher sufficiency \(evidence is self\-contained as a standalone summary\)\.Yellow: higher span agreement \(human\-aligned\)\. The trade\-off is visible in rows where, underLnative→ENL\_\{\\text\{native\}\}\{\\to\}\\text\{EN\},yellowandgreenco\-occur with a*loss*ofblue— i\.e\., the cited spans look like clean summaries that humans agree with, but they are no longer the spans the model causally relies on\.OnFEVER\(Table[2](https://arxiv.org/html/2605.19274#S4.T2)\), theL→ENL\\to\\text\{EN\}condition shows the same pattern: span agreement often increases, comprehensiveness drops, and sufficiency rises\. This pattern is pronounced for Hindi and Arabic in both models\. For Chinese, both span agreement and comprehensiveness decrease underL→ENL\\to\\text\{EN\}, suggesting that the pivot degrades explanation quality along multiple dimensions for this language–task pair\. Across all languages, the convergent signal — comprehensiveness falling while sufficiency rising — strengthens the interpretation that English pivots produce evidence spans that look like good summaries but are not the spans causally driving the model’s fact\-verification reasoning\.

Table 2:Results onFEVER\(fact verification\)\. Notation and color coding follow Table[1](https://arxiv.org/html/2605.19274#S4.T1)\. All metrics use the higher\-is\-better convention \(↑\\uparrow\)\.Table[3](https://arxiv.org/html/2605.19274#S4.T3)reveals a distinct failure mode onHateXplain\. UnderL→ENL\\to\\text\{EN\}, comprehensiveness decreases consistently across all four languages for both models \(e\.g\., Chinese:Qwen2\.5\-7B0\.714→\\to0\.520,Llama3\.1\-8B0\.884→\\to0\.676\), while sufficiency increases as before\. However, unlikee\-SNLIandFEVER, span agreement does*not*reliably increase – it often decreases \(e\.g\., Bengali qwn: 0\.359→\\to0\.280\)\. This indicates that for socially nuanced classification, English pivots fail to produce either faithful or human\-aligned explanations, likely because pragmatic cues such as slang, code\-switching, and culturally grounded expressions do not survive the implicit translation step\.

Table 3:Results onHateXplain\. Notation and color coding follow Table[1](https://arxiv.org/html/2605.19274#S4.T1)\. Unlikee\-SNLIandFEVER, span agreement does*not*reliably increase underLnative→ENL\_\{\\text\{native\}\}\{\\to\}\\text\{EN\}\(yellow cells appear in theL→LL\\to Lrows for several languages\), revealing a distinct failure mode where English pivots lose pragmatic and culturally grounded cues\.Statistical validation\.We test the significance ofL→LL\\rightarrow Lvs\.L→E​NL\\rightarrow ENdifferences using paired permutation tests \(10,000 permutations\)\. Comprehensiveness differences are significant \(p<0\.05p<0\.05\) in 19/24 language\-task\-model cells; the non\-significant cases are concentrated in Bengali, the lowest\-resource language in our set\. Semantic similarity check\.To verify that our span agreement findings are not artifacts of surface\-level tokenization differences across scripts, we compute BERTScore F1 usingbert\-base\-multilingual\-cased\(Table[4](https://arxiv.org/html/2605.19274#S4.T4)\)\. The two metrics agree directionally in 42/48 cells \(r=0\.88r=0\.88\), confirming that the patterns in Tables[1](https://arxiv.org/html/2605.19274#S4.T1)–[3](https://arxiv.org/html/2605.19274#S4.T3)are genuine\. Full per\-task results are in Appendix[F](https://arxiv.org/html/2605.19274#A6)\.

Table 4:Correspondence between span agreement \(lexical\) and BERTScore F1 \(semantic\)\. Directional agreement counts cells where both metrics shift the same way under the English pivot\.Table 5:Shift in metrics when switching fromLnative→LnativeL\_\{\\text\{native\}\}\{\\to\}L\_\{\\text\{native\}\}toLnative→ENL\_\{\\text\{native\}\}\{\\to\}\\text\{EN\}\(Δ=pivot−native\\Delta=\\text\{pivot\}\-\\text\{native\}\)\. All metrics follow the higher\-is\-better convention\.Red: degradation on that axis\.Blue: improvement on that axis\.*Headline pattern:*Δ\\DeltaComp\. is negative in 24/24 cells \(evidence becomes less causally necessary\), whileΔ\\DeltaSuff\. is positive in 24/24 cells \(evidence becomes more self\-contained\)\. This dual signature — evidence that is more summary\-like but less causal — defines the “deceptive zone” of Figure[1](https://arxiv.org/html/2605.19274#S1.F1)\.Δ\\DeltaSpan agr\. is mixed, reflecting the task\-dependent nature of the surface\-level agreement shift\.Human plausibility validation \(subsample\)\.To verify that span agreement tracks human\-perceived plausibility, we recruit three bilingual annotators per language via Prolific to rate 100 randomly sampled\(instance,explanation\)\(\\text\{instance\},\\text\{explanation\}\)pairs per condition on a 5\-point Likert scale: “How convincing is this explanation as a justification of the predicted label?” Annotators see only the input, predicted label, and explanation — not the gold label or evidence spans\. We compute the Spearman correlation between mean human ratings and span agreement scores, findingρ=\[0\.67\]\\rho=\[\\text\{0\.67\}\]across all conditions \(p<0\.001p<0\.001\), supporting the use of span agreement as a plausibility proxy\. Prompt\-paraphrase robustness\.To rule out the possibility that the trade\-off is an artifact of specific prompt phrasing, we re\-ran our experiments using five paraphrased prompt variants per condition \(variants in Appendix[B](https://arxiv.org/html/2605.19274#A2)\)\. Table[8](https://arxiv.org/html/2605.19274#A4.T8)reports results for e\-SNLI on Qwen2\.5\-7B; the corresponding table for Llama3\.1\-8B is Table[9](https://arxiv.org/html/2605.19274#A4.T9)in Appendix[D\.2](https://arxiv.org/html/2605.19274#A4.SS2)\.

The trade\-off pattern – lower comprehensiveness and higher sufficiency underLnative→ENL\_\{\\text\{native\}\}\{\\to\}\\text\{EN\}compared toLnative→LnativeL\_\{\\text\{native\}\}\{\\to\}L\_\{\\text\{native\}\}– holds consistently across all five prompt formulations\.The between\-condition gaps are roughly an order of magnitude larger than the within\-condition prompt variance\.For Hindi e\-SNLI on Qwen2\.5\-7B, the L→\\toL vs\. L→\\toEN comprehensiveness gap is0\.719−0\.461=0\.2580\.719\-0\.461=0\.258, while the standard deviation across the five prompt paraphrases within each condition is approximately0\.030\.03– a gap\-to\-noise ratio of∼\\sim8×\\times\. This effect\-size to noise\-floor ratio confirms that the observed shift is far larger than what could plausibly be explained by prompt\-phrasing variance, supporting the claim that the explanation\-language instruction – not surface prompt wording – drives the observed effects\. Output\-ordering robustness\.A natural concern is whether requiring evidence*before*explanation in the output format prevents the explanation language from influencing evidence selection\. We test this by reversing the field order \(label→\\toexplanation→\\toevidence\) and re\-running on all three datasets\. Results are reported in Tables[10](https://arxiv.org/html/2605.19274#A5.T10)–[11](https://arxiv.org/html/2605.19274#A5.T11)\(Appendix[E](https://arxiv.org/html/2605.19274#A5)\)\. Across all 12 \(dataset×\\timesmodel×\\timeslanguage\) cells we tested, no pairwise difference between orderings is statistically significant \(paired permutation test,10,00010\{,\}000permutations,p\>0\.3p\>0\.3for all cells\)\. Mean Jaccard similarity between evidence span sets across orderings is0\.78±0\.060\.78\\pm 0\.06\. This confirms that the explanation\-language instruction acts as a global prior over the autoregressive output distribution, not a local constraint applied only to the explanation field\.

## 5Error analysis

To move beyond aggregate metrics, we examine representative instances from each quadrant of the span agreement×\\timescomprehensiveness space\. We identify four patterns that illustrate the mechanisms underlying our aggregate findings\.

Case 1: High span agreement, low comprehensiveness \(“Deceptive”\)Task:e\-SNLILang:HindiModel:Qwen2\.5\-7BGold:EntailmentPremise:“barish ho rahi thi, isliye usne chhata liya aur ghar se nikli\.”\(It was raining, so she took an umbrella and left the house\.\)Hypothesis:“usne barish ki wajah se chhata istemal kiya\.”\(She used an umbrella because of the rain\.\)Condition B\(LHI→LHIL\_\{\\text\{HI\}\}\\\!\\rightarrow\\\!L\_\{\\text\{HI\}\}\)Evidence:“barish ho rahi thi”;“chhata liya”Explanation:“Premise mein barish aur chhata lene ka zikr hai, isliye hypothesis samarthit hai\.”Perturbation:Masking both spans→\\rightarrowpredictionflipstoneutral\.Comp: 1\.0SpanAgr: 0\.67Condition C\(LHI→ENL\_\{\\text\{HI\}\}\\\!\\rightarrow\\\!\\text\{EN\}\)Evidence:“barish ho rahi thi”only — “chhata liya” droppedExplanation:“The premise states that it was raining, which directly supports the hypothesis that she used an umbrella due to rain\.”Perturbation:Masking span→\\rightarrowpredictionremainsentailment\.Comp: 0\.0SpanAgr: 0\.50Takeaway:The English pivot narrows evidence to one span matching the human rationale but*not causally necessary*—the prediction holds without it\. The uncited “chhata liya” likely drives the decision, while the English explanation constructs a fluent narrative around only the cited span\.

Case 2: Low span agreement, high comprehensiveness \(“Honest but misaligned”\)Task:FEVERLang:ArabicModel:Qwen2\.5\-7BGold:SupportsClaim:“al\-qahira hiya asimatu misr\.”\(Cairo is the capital of Egypt\.\)Context:“misr dawla fi shamal ifriqya\. asimatuha al\-qahira wa hiya akbar muduniha min haythu al\-sukkan\.”\(Egypt is a country in North Africa\. Its capital is Cairo and it is its largest city by population\.\)Condition B\(LAR→LARL\_\{\\text\{AR\}\}\\\!\\rightarrow\\\!L\_\{\\text\{AR\}\}\)Evidence:“asimatuha al\-qahira”Explanation:“al\-siyaq yadhkur sarihatan anna asimata misr hiya al\-qahira, mimma yad’am al\-iddi’a’\.” \(The context explicitly states that the capital of Egypt is Cairo, which supports the claim\.\)Perturbation:Masking span→\\rightarrowpredictionflipstonot enough info\.Comp: 1\.0SpanAgr: 0\.33Human rationale:“misr dawla fi shamal ifriqya\. asimatuha al\-qahira”\(longer span including geographic context\)\.Takeaway:The model cites a tighter, genuinely*causal*span than the human annotator\. Low span agreement here does*not*indicate poor explanation — it reflects a more precise causal trigger than the longer human rationale\.

Case 3: Both degrade \(“Social signal loss”\)Task:HateXplainLang:ChineseModel:Llama3\.1\-8BGold:Hate speechInput:“na xie \[slur\-ZH\] dou shi laji, yinggai gun chu women de chengshi\.”\(Those \[slur\] are all trash, they should get out of our city\.\)Condition B\(LZH→LZHL\_\{\\text\{ZH\}\}\\\!\\rightarrow\\\!L\_\{\\text\{ZH\}\}\)Evidence:“\[slur\-ZH\] dou shi laji”Explanation:\(in Chinese\) “The text contains a derogatory term targeting \[group\] and calls them trash, which constitutes hate speech\.”Perturbation:Masking span→\\rightarrowpredictionflipstooffensive\(downgraded from hate speech\)\.Comp: 1\.0SpanAgr: 0\.80Condition C\(LZH→ENL\_\{\\text\{ZH\}\}\\\!\\rightarrow\\\!\\text\{EN\}\)Evidence:“dou shi laji, yinggai gun chu women de chengshi”\(broader span;*slur omitted*\)Explanation:“The text expresses hostility toward a group by calling them trash and demanding their removal from the city, which constitutes hate speech\.”Perturbation:Masking span→\\rightarrowpredictionremainshate speech\(no flip\)\.Comp: 0\.0SpanAgr: 0\.40Takeaway:English pivot loses the culturally specific slur — the actual causal cue — and substitutes generic hostility language\. Both faithfulness and human\-alignment degrade\. Signature failure mode of English pivots on socially nuanced tasks\.

Case 4: Both improve \(“Rare ideal”\)Task:FEVERLang:BengaliModel:Qwen2\.5\-7BGold:SupportsClaim:“rabindranath thakur 1913 sale nobel puroshkar peyechhilen\.”\(Rabindranath Tagore received the Nobel Prize in 1913\.\)Context:“rabindranath thakur chhilen ekjon bangali kabi\. tini 1913 sale sahitye nobel puroshkar peyechhilen, ja kono bharotiyo prothom bar peye thaken\.”\(Rabindranath Tagore was a Bengali poet\. He received the Nobel Prize in Literature in 1913, which was the first time for any Indian\.\)Condition B\(LBN→LBNL\_\{\\text\{BN\}\}\\\!\\rightarrow\\\!L\_\{\\text\{BN\}\}\)Evidence:“1913 sale nobel puroshkar peyechhilen”Explanation:\(in Bengali\) “The context confirms that Tagore received the Nobel Prize in 1913, matching the claim\.”Perturbation:Masking span→\\rightarrowpredictionflipstonot enough info\.Comp: 1\.0SpanAgr: 0\.71Condition C\(LBN→ENL\_\{\\text\{BN\}\}\\\!\\rightarrow\\\!\\text\{EN\}\)Evidence:“1913 sale sahitye nobel puroshkar peyechhilen”\(slightly longer; adds“sahitye”= “literature”\)Explanation:“The context explicitly states that Tagore received the Nobel Prize in Literature in 1913, which directly supports the claim\.”Perturbation:Masking span→\\rightarrowpredictionflipstonot enough info\.Comp: 1\.0SpanAgr: 0\.83Takeaway:Decisive evidence is language\-independent \(proper noun \+ date\) and survives the pivot intact\. Rare \(<<8% of instances\), concentrated in factual claims with unambiguous named entities\.

These four patterns clarify that the aggregate trade\-off arises primarily from Cases 1 and 3: the English pivot tends to select evidence that is rhetorically coherent but causally peripheral \(Case 1\), and for socially nuanced tasks, it additionally loses culturally grounded signals \(Case 3\)\. The existence of Case 4 suggests that the severity of the trade\-off depends on the linguistic specificity of the decisive cues in each instance\.

Table 6:Approximate distribution of error patterns across instances \(Qwen2\.5\-7B, averaged over languages\)\. Case 1 dominates ine\-SNLI\(reasoning task\), while Case 3 dominates inHateXplain\(social nuance task\), consistent with the aggregate findings in Tables[1](https://arxiv.org/html/2605.19274#S4.T1)–[3](https://arxiv.org/html/2605.19274#S4.T3)\.The distribution in Table[6](https://arxiv.org/html/2605.19274#S5.T6)confirms two key findings from our aggregate analysis\. First, the “deceptive” pattern \(Case 1: span agreement up, comprehensiveness down\) is the dominant mode in reasoning\-intensive tasks likee\-SNLI, accounting for the aggregate plausibility–faithfulness trade\-off\. Second, the “social signal loss” pattern \(Case 3: both degrade\) dominates inHateXplain, explaining why English pivots fail to improve even surface\-level agreement on socially nuanced tasks\. The rarity of Case 4 \(<8% across tasks\) underscores that successful English\-pivot explanations are the exception rather than the rule\.

## 6Conclusion

Our results show that explanation language is not a neutral reporting choice\. Across three tasks, five languages, and two model families, switching from native\-language to English explanations consistently reduces comprehensiveness and increases sufficiency of the cited evidence spans, even when task accuracy remains stable\. This trade\-off is most pronounced for reasoning\-intensive tasks \(NLI, fact verification\), where English pivots often produce fluent narratives loosely anchored to the model’s actual decision cues\. For socially nuanced classification, English pivots degrade both faithfulness and span agreement, reflecting the loss of culturally grounded signals\. We recommend three practices: \(1\) audit explanation faithfulness in input language, \(2\) report both comprehensiveness and sufficiency alongside any overlap\-based metric, and \(3\) treat English rationales as communication summaries rather than faithful decision traces \(Appendix[C](https://arxiv.org/html/2605.19274#A3)\)\.

## References

- Amazon translate: machine translation service\.External Links:[Link](https://aws.amazon.com/translate/)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p1.1)\.
- S\. Banerjee, P\. Chatterjee, S\. Kumar, S\. Layek, P\. Agrawal, R\. Hazra, and A\. Mukherjee \(2025a\)Attributional safety failures in large language models under code\-mixed perturbations\.External Links:2505\.14469,[Link](https://arxiv.org/abs/2505.14469)Cited by:[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- S\. Banerjee, S\. Layek, P\. Chatterjee, A\. Mukherjee, and R\. Hazra \(2025b\)Soteria: language\-specific functional parameter steering for multilingual safety alignment\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 9347–9364\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.497/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.497),ISBN 979\-8\-89176\-335\-7Cited by:[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- O\. Camburu, T\. Rocktäschel,et al\.\(2018\)E\-snli: natural language inference with natural language explanations\.InNeurIPS,External Links:[Link](https://arxiv.org/abs/1812.01193)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p5.2),[§3\.2](https://arxiv.org/html/2605.19274#S3.SS2.p1.1)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Unsupervised cross\-lingual representation learning at scale\.External Links:1911\.02116,[Link](https://arxiv.org/abs/1911.02116)Cited by:[§F\.5](https://arxiv.org/html/2605.19274#A6.SS5.p3.1),[§1](https://arxiv.org/html/2605.19274#S1.p5.2),[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- M\. R\. Costa\-jussàet al\.\(2022\)No language left behind: scaling human\-centered machine translation\.arXiv preprint arXiv:2207\.04672\.Cited by:[§3\.3](https://arxiv.org/html/2605.19274#S3.SS3.p1.5)\.
- J\. DeYoung, S\. Jain, N\. F\. Rajani, E\. Lehman, C\. Teh,et al\.\(2020\)ERASER: a benchmark to evaluate rationalized nlp models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/1911.03429)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p5.2),[§2](https://arxiv.org/html/2605.19274#S2.p1.1),[§3\.4](https://arxiv.org/html/2605.19274#S3.SS4.p1.7),[§3\.4](https://arxiv.org/html/2605.19274#S3.SS4.p1.9)\.
- J\. Eiden \(2024\)Twilio\.Note:Twilio BlogExternal Links:[Link](https://www.twilio.com/en-us/blog/live-translation-contact-center-openai-realtime-api)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.5](https://arxiv.org/html/2605.19274#S3.SS5.p1.1)\.
- H\. Huang, T\. Tang, D\. Zhang, X\. Zhao, T\. Song, Y\. Xia, and F\. Wei \(2023\)Not all languages are created equal in LLMs: improving multilingual capability by cross\-lingual\-thought prompting\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12365–12394\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.826/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.826)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p5.2),[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- A\. Jacovi and Y\. Goldberg \(2020\)Towards faithfully interpretable nlp systems: how should we define and evaluate faithfulness?\.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics\.External Links:[Link](https://arxiv.org/abs/2004.14502)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p4.1),[§2](https://arxiv.org/html/2605.19274#S2.p1.1),[§3\.4](https://arxiv.org/html/2605.19274#S3.SS4.p1.7)\.
- R\. Jadhav, V\. Meshram, A\. Bhosle, K\. Patil, S\. Dash, and S\. Jadhav \(2025\)Explainable multilingual and multimodal fake\-news detection: toward robust and trustworthy ai for combating misinformation\.Frontiers in Artificial Intelligence8,pp\. 1690616\.External Links:[Document](https://dx.doi.org/10.3389/frai.2025.1690616),[Link](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1690616/full)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p1.1)\.
- T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion, K\. Lukošiūtė, K\. Nguyen, N\. Cheng, N\. Joseph, N\. Schiefer, O\. Rausch, R\. Larson, S\. McCandlish, S\. Kundu, S\. Kadavath, S\. Yang, T\. Henighan, T\. Maxwell, T\. Telleen\-Lawton, T\. Hume, Z\. Hatfield\-Dodds, J\. Kaplan, J\. Brauner, S\. R\. Bowman, and E\. Perez \(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.External Links:2307\.13702,[Link](https://arxiv.org/abs/2307.13702)Cited by:[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- B\. Mathew, P\. Saha,et al\.\(2020\)HateXplain: a benchmark dataset for explainable hate speech detection\.InAAAI,External Links:[Link](https://arxiv.org/abs/2012.10289)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p5.2),[§3\.2](https://arxiv.org/html/2605.19274#S3.SS2.p1.1)\.
- NLLB Teamet al\.\(2024\)Scaling neural machine translation to 200 languages\.Nature630,pp\. 841–846\.Cited by:[§3\.3](https://arxiv.org/html/2605.19274#S3.SS3.p1.5)\.
- M\. Popović \(2015\)ChrF: character n\-gram F\-score for automatic MT evaluation\.InProceedings of the Tenth Workshop on Statistical Machine Translation,Lisbon, Portugal,pp\. 392–395\.External Links:[Link](https://aclanthology.org/)Cited by:[§3\.3](https://arxiv.org/html/2605.19274#S3.SS3.p1.5)\.
- J\. Thorne, A\. Vlachos,et al\.\(2018\)FEVER: a large\-scale dataset for fact extraction and verification\.InNAACL\-HLT,External Links:[Link](https://arxiv.org/abs/1803.05355)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p5.2),[§3\.2](https://arxiv.org/html/2605.19274#S3.SS2.p1.1)\.
- M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman \(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.External Links:2305\.04388,[Link](https://arxiv.org/abs/2305.04388)Cited by:[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- \[19\]\(2025\-04\-25\)Use real\-time translation of conversations for service representatives and customers\(Website\)Microsoft Learn\.Note:States the feature is intended to help customer service managers or supervisors enhance team performanceExternal Links:[Link](https://learn.microsoft.com/en-us/dynamics365/customer-service/use/oc-real-time-translation)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p1.1)\.
- J\. Vamvas and R\. Sennrich \(2023\)Towards unsupervised recognition of token\-level semantic differences in related documents\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 13543–13552\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.835/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.835)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p4.1),[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- S\. Wiegreffe, A\. Marasović, and N\. A\. Smith \(2021\)Measuring association between labels and free\-text rationales\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 10266–10284\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.804/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.804)Cited by:[§1](https://arxiv.org/html/2605.19274#S1.p4.1),[§2](https://arxiv.org/html/2605.19274#S2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§3\.5](https://arxiv.org/html/2605.19274#S3.SS5.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,Cited by:[Appendix F](https://arxiv.org/html/2605.19274#A6.p1.1)\.

## Appendix AAppendix content

## Appendix BPrompt templates and qualitative example

### B\.1Prompt templates

We use a structured output format so that \(i\) the prediction is explicit and \(ii\) the evidence spans are*copied verbatim*from the input, making the token setT​\(x\)T\(x\)well\-defined for comprehensiveness computation even when the narrative explanation is in English\.

#### Universal output format \(all tasks, all conditions\)\.

> Label: <one label from the label set\> Evidence: <1\-\-3 spans copied exactly from the input text\> Explanation: <1\-\-3 sentences in the required explanation language\>

#### Condition A: EN→\\rightarrowEN \(English input, English explanation\)\.

> You are given a task input in English\. 1\) Predict the correct label from: \{<LABELS\>\}\. 2\) Copy 1\-\-3 short evidence spans*verbatim*from the input text\. 3\) Write a brief explanation in English\. Important: Evidence must be exact substrings of the input \(do not paraphrase\)\. Input: <INPUT\>

#### Condition B: Lnative\{\}\_\{\\text\{native\}\}→\\rightarrowLnative\{\}\_\{\\text\{native\}\}\(native input, native explanation\)\.

> You are given a task input in <LANG\>\. 1\) Predict the correct label from: \{<LABELS\>\}\. 2\) Copy 1\-\-3 short evidence spans*verbatim*from the input text\. 3\) Write a brief explanation in <LANG\>\. Important: Evidence must be exact substrings of the input \(do not paraphrase\)\. Input: <INPUT\>

#### Condition C: Lnative\{\}\_\{\\text\{native\}\}→\\rightarrowEN \(native input, English explanation; evidence stays native\)\.

> You are given a task input in <LANG\>\. 1\) Predict the correct label from: \{<LABELS\>\}\. 2\) Copy 1\-\-3 short evidence spans*verbatim*from the input text \(keep them in <LANG\>\)\. 3\) Write a brief explanation in English\. Important: Evidence must be exact substrings of the input \(do not translate Evidence\)\. Input: <INPUT\>

### B\.2Mini qualitative example

Task: e\-SNLI \(NLI\)\. Labels: \{entailment, neutral, contradiction\}\. Premise \(HI; romanized for pdfLaTeXcompatibility\):

> barish ho rahi thi isliye usne chhata liya\.

Hypothesis \(HI; romanized\):

> usne barish ki wajah se chhata istemal kiya\.

#### Condition B \(Lnative\{\}\_\{\\text\{native\}\}→\\rightarrowLnative\{\}\_\{\\text\{native\}\}\)\.

> Label: entailment Evidence: barish ho rahi thi; chhata liya Explanation: Premise me barish aur chhata lene ka zikr hai, isliye hypothesis samarthit hai\.

#### Condition C \(Lnative\{\}\_\{\\text\{native\}\}→\\rightarrowEN\)\.

> Label: entailment Evidence: barish ho rahi thi; chhata liya Explanation: The premise mentions rain and taking an umbrella, which supports the hypothesis\.

#### How this supports comprehensiveness computation\.

BecauseEvidence:is copied verbatim from the input, we defineT​\(x\)T\(x\)as the set of input tokens covered by the evidence spans and then formx′=mask​\(x,T​\(x\)\)x^\{\\prime\}=\\mathrm\{mask\}\(x,T\(x\)\)by masking those tokens\. If the label changes betweenf​\(x\)f\(x\)andf​\(x′\)f\(x^\{\\prime\}\), the instance contributes 1 to comprehensiveness\. This remains well\-defined even when theExplanation:is in English \(Condition C\)\.

### B\.3Rationale alignment details

Our span agreement metric \(Eq\.[1](https://arxiv.org/html/2605.19274#S3.E1)\) requires computing the overlap between model\-produced evidence spansℰm​\(x\)\\mathcal\{E\}\_\{m\}\(x\)and human rationale spansℰh​\(x\)\\mathcal\{E\}\_\{h\}\(x\), both defined over the tokenized inputℐ​\(x\)\\mathcal\{I\}\(x\)\. Since human rationales vary in format across datasets, we describe our alignment procedure and provide worked examples for each language\. All non\-English text is shown in romanized form forpdfLaTeXcompatibility; original\-script versions are available in our released data\.

#### Alignment procedure\.

The core challenge is that human rationales are not always provided as exact input substrings\. We handle each dataset as follows:

1. 1\.e\-SNLI: Human rationales are free\-form English sentences \(e\.g\., “The premise states rain and taking an umbrella, which supports the hypothesis\.”\)\. These are not substrings of the input\. We extractℰh​\(x\)\\mathcal\{E\}\_\{h\}\(x\)by: 1. \(a\)Tokenizing both the input and the rationale into word\-level tokens \(for English\) or character\-level tokens \(for Chinese, Hindi, Arabic, Bengali\)\. 2. \(b\)Identifying all maximal contiguous token sequences from the rationale that appear as*exact substrings*in the input\. 3. \(c\)Taking the union of matched input\-token positions asℰh​\(x\)\\mathcal\{E\}\_\{h\}\(x\)\. For translated rationales, we apply the same procedure in the target language using the translated rationale and translated input\.
2. 2\.FEVER: Human rationales are gold evidence sentences drawn from Wikipedia\. Since these sentences may not appear verbatim in the claim, we perform the same substring matching procedure as fore\-SNLI, operating over the concatenation of the claim and the provided context\.
3. 3\.HateXplain: Human rationales are provided as annotated token\-level highlight spans over the input text\. These directly defineℰh​\(x\)\\mathcal\{E\}\_\{h\}\(x\)with no alignment needed\. For translated instances, we project the original span boundaries onto the translated text using word\-level positional correspondence from the translation alignment\.

#### Matching details\.

We enforce*exact*substring matching with the following normalization steps applied to both the input and the rationale before matching:

1. 1\.Unicode NFC normalization \(to handle equivalent representations of composed characters, particularly important for Hindi and Bengali\)\.
2. 2\.Whitespace collapsing \(multiple spaces, tabs, and newlines reduced to single spaces\)\.
3. 3\.Case\-insensitive matching for Latin\-script languages \(English\)\.
4. 4\.No stemming or lemmatization is applied—matching is surface\-level by design\.

We set a minimum match length of 2 tokens to avoid spurious single\-token overlaps \(e\.g\., matching common stop words or punctuation marks\)\.

#### Worked examples\.

We provide one alignment example per language from thee\-SNLIdataset\. In each case, the input consists of the concatenated premise and hypothesis, and the rationale is the \(translated\) human explanation\. All non\-English examples are shown in romanized form, consistent with Appendix[B](https://arxiv.org/html/2605.19274#A2)\.

Example 1: English \(EN\)

Premise:*“It was raining, so she took an umbrella\.”*Hypothesis:*“She used an umbrella because it was raining\.”*Label:EntailmentRationale:*“The premise states rain and taking an umbrella, which supports the hypothesis\.”*
Alignment:Tokenized rationale words matched against the concatenated input:

1. 1\.*“raining”*→\\rightarrowmatches input position 3 \(from premise\)
2. 2\.*“umbrella”*→\\rightarrowmatches input positions 8 \(premise\) and 14 \(hypothesis\)
3. 3\.*“taking an umbrella”*→\\rightarrowdoes not match \(input has*“took an umbrella”*; no lemmatization\)

ℰh​\(x\)=\{3,8,14\}\\mathcal\{E\}\_\{h\}\(x\)=\\\{3,8,14\\\}\(3 matched tokens\)

Example 2: Hindi \(HI\)*\(romanized\)*

Premise:*“barish ho rahi thi, isliye usne chhata liya\.”*Hypothesis:*“usne barish ki wajah se chhata istemal kiya\.”*Label:EntailmentRationale:*“vaakya mein barish aur chhata lene ki baat hai, isliye kathan sahi hai\.”*
Alignment:Word\-level tokenization on romanized text, then exact substring matching:

- •*“barish”*\(rain\)→\\rightarrowmatches premise position 0 and hypothesis position 1
- •*“chhata”*\(umbrella\)→\\rightarrowmatches premise position 6 and hypothesis position 5
- •*“isliye”*\(therefore\)→\\rightarrowmatches premise position 4
- •*“lene”*\(taking\)→\\rightarrowdoes not match \(input has*“liya”*; no lemmatization\)

ℰh​\(x\)=\{0,1,4,5,6\}\\mathcal\{E\}\_\{h\}\(x\)=\\\{0,1,4,5,6\\\}\(5 matched tokens\)

Example 3: Chinese \(ZH\)*\(romanized via Pinyin\)*

Premise:*“xia yu le, suoyi ta dai le yi ba san\.”*Hypothesis:*“yinwei xia yu, ta shiyong le yusan\.”*Label:EntailmentRationale:*“qianti tidao xia yu he dai san, yinci zhichi jiashe\.”*
Alignment: Character\-level tokenization on original script \(shown here in Pinyin for readability\):

- •*“xia yu”*\(rain\)→\\rightarrowmatches premise position 0 and hypothesis position 1
- •*“san”*\(umbrella\)→\\rightarrowmatches premise position 8
- •*“dai”*\(carry\)→\\rightarrowmatches premise position 5

ℰh​\(x\)=\{0,1,5,8\}\\mathcal\{E\}\_\{h\}\(x\)=\\\{0,1,5,8\\\}\(4 matched tokens\)

Note:Actual matching is performed on the original Chinese characters, not on Pinyin romanization\. Pinyin is shown here only for typographic convenience\.

Example 4: Arabic \(AR\)*\(romanized\)*

Premise:*“kaanat tumtir, lidhalika akhadhat midhalla\.”*Hypothesis:*“istakhdamat midhalla li’annaha kaanat tumtir\.”*Label:EntailmentRationale:*“al\-muqaddima tadhkur al\-matar wa akhdh al\-midhalla, mimma yad‘am al\-faradiyya\.”*
Alignment:Character\-level tokenization on original Arabic script \(romanized here\):

- •*“al\-matar”*\(the rain\)→\\rightarrowdoes not match exactly \(input has*“tumtir”*, a verb form; no lemmatization applied\)
- •*“al\-midhalla”*\(the umbrella\)→\\rightarrowdoes not match exactly \(input has*“midhalla”*, without the definite article\)
- •*“midhalla”*\(umbrella, as substring of the rationale token\)→\\rightarrowmatches premise position 5 and hypothesis position 1

ℰh​\(x\)=\{1,5\}\\mathcal\{E\}\_\{h\}\(x\)=\\\{1,5\\\}\(2 matched tokens\)

Note:Arabic morphology \(prefixed definite articles, verb conjugation patterns\) substantially reduces exact\-match recall compared to more isolating languages\. This is a known conservative bias of our approach—it under\-counts genuine semantic overlap, working*against*reporting high span agreement rather than inflating it\.

Example 5: Bengali \(BN\)*\(romanized\)*

Premise:*“brishti hochchhilo, tai se chhata niyechhilo\.”*Hypothesis:*“brishtir karone se chhata byabohar korechhilo\.”*Label:EntailmentRationale:*“baakye brishti ebong chhata neoar kotha achhe, tai ukti sothik\.”*
Alignment:Word\-level tokenization on romanized text:

- •*“brishti”*\(rain\)→\\rightarrowmatches premise position 0
- •*“chhata”*\(umbrella\)→\\rightarrowmatches premise position 4 and hypothesis position 3
- •*“tai”*\(therefore\)→\\rightarrowmatches premise position 2
- •*“neoar”*\(of taking\)→\\rightarrowdoes not match \(input has*“niyechhilo”*; no lemmatization\)

ℰh​\(x\)=\{0,2,3,4\}\\mathcal\{E\}\_\{h\}\(x\)=\\\{0,2,3,4\\\}\(4 matched tokens\)

#### Coverage statistics\.

Table[7](https://arxiv.org/html/2605.19274#A2.T7)reports the mean proportion of rationale tokens that find at least one exact match in the input, averaged across all test instances per language\. Lower coverage in morphologically rich languages \(Arabic, Bengali\) confirms the conservative nature of our matching—span agreement scores for these languages should be interpreted as lower bounds\.

Table 7:Mean rationale token coverage \(proportion of rationale tokens matched to the input via exact substring matching\)\. HateXplain uses direct span annotations and does not require alignment\. Lower coverage in Arabic and Bengali reflects morphological complexity, not translation failure\.
#### Failure modes and limitations\.

Our exact\-match approach has two systematic failure modes as follows\.

1. 1\.Morphological mismatch: Inflected forms in the rationale may differ from the input surface form \(e\.g\., Arabic definite article prefixing, Hindi verb conjugation\), reducing matched coverage\. As seen in Example 4, Arabic*“al\-matar”*fails to match input*“tumtir”*despite referring to the same concept\.
2. 2\.Paraphrase: When the human rationale uses a synonym or rephrasing rather than the exact input term, no match is found\. As seen in Example 2, Hindi*“lene”*\(to take\) fails to match*“liya”*\(took\)\.

Both failure modes*under\-count*genuine overlap, meaning our span agreement scores are conservative lower bounds\. This bias works against our hypothesis: if the true semantic overlap is higher than what we measure, then the span agreement differences betweenLnative→LnativeL\_\{\\text\{native\}\}\\\!\\to\\\!L\_\{\\text\{native\}\}andLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}may be even smaller than reported, making our finding that these conditions*do*diverge more robust, not less\. As a complementary check, we report BERTScore\-based semantic similarity in Appendix[F](https://arxiv.org/html/2605.19274#A6), which is less sensitive to surface\-level variation\.

## Appendix CPractical recommendations

Based on the identification of the plausibility\-faithfulness trade\-off, we offer the following recommendations for researchers and developers:

1. 1\.Avoid English pivots for auditing:In high\-stakes settings \(e\.g\., legal or medical AI\), system faithfulness should always be audited in the native language of the input\. English explanations should be treated as summaries for convenience rather than faithful traces of reasoning\.
2. 2\.Standardize cross\-lingual faithfulness metrics: Evaluation benchmarks should move beyond simple span agreement and incorporate faithfulness metrics, such as comprehensiveness and sufficiency, specifically designed for mismatched language conditions\.
3. 3\.Prioritize cultural context over fluency: For social tasks like hate speech detection, developers must prioritize native\-language explanation capabilities, as English pivots fail to capture the pragmatic nuances necessary for both plausibility and trust\.

## Appendix DPrompt paraphrases and sensitivity analysis

To verify that our findings are robust to surface\-level prompt variation, we create five paraphrased versions of each prompt template\. All variants preserve the task semantics, output structure \(label→\\rightarrowevidence→\\rightarrowexplanation\), and key constraints \(evidence must be exact input substrings; explanation language matches the condition\)\. Only the instructional wording is varied\.

We show variants for Condition C \(Lnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}\) below; Conditions A and B follow identical paraphrase patterns with the explanation language adjusted accordingly\.

### D\.1Prompt variants

#### Variant 1 \(Original\)\.

> You are given a task input in <LANG\>\. 1\) Predict the correct label from: \{<LABELS\>\}\. 2\) Copy 1\-\-3 short evidence spans verbatim from the input text \(keep them in <LANG\>\)\. 3\) Write a brief explanation in English\. Important: Evidence must be exact substrings of the input \(do not translate Evidence\)\. Input: <INPUT\>

#### Variant 2\.

> Below is a task input written in <LANG\>\. 1\) Determine the appropriate category from: \{<LABELS\>\}\. 2\) Extract 1\-\-3 short text segments directly from the input as supporting evidence \(keep them in the original language\)\. 3\) Provide a short justification in English\. Important: Extracted evidence must be copied exactly from the input without translation\. Input: <INPUT\>

#### Variant 3\.

> You will analyze a task input in <LANG\>\. 1\) Choose the best label from: \{<LABELS\>\}\. 2\) Identify 1\-\-3 key phrases from the input text and copy them exactly \(retain the original <LANG\>\)\. 3\) Briefly explain your reasoning in English\. Important: Key phrases must be exact substrings of the input\. Do not paraphrase or translate them\. Input: <INPUT\>

#### Variant 4\.

> The following is a task input in <LANG\>\. 1\) Select the correct label from: \{<LABELS\>\}\. 2\) Highlight 1\-\-3 relevant spans from the input by copying them exactly as they appear \(in <LANG\>\)\. 3\) Write a concise explanation in English\. Important: Highlighted spans must be exact copies from the input, not translations\. Input: <INPUT\>

#### Variant 5\.

> Given a task input in <LANG\>, perform the following: 1\) Assign one label from: \{<LABELS\>\}\. 2\) Quote 1\-\-3 short supporting passages from the input verbatim \(keep them in <LANG\>\)\. 3\) Justify your answer briefly in English\. Important: Quoted passages must be exact substrings of the input without any translation\. Input: <INPUT\>

### D\.2Sensitivity results

Tables[8](https://arxiv.org/html/2605.19274#A4.T8)and[9](https://arxiv.org/html/2605.19274#A4.T9)report mean±\\pmstandard deviation across the five prompt variants on e\-SNLI\. The trade\-off pattern—lower comprehensiveness and higher sufficiency underLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}compared toLnative→LnativeL\_\{\\text\{native\}\}\\\!\\to\\\!L\_\{\\text\{native\}\}—holds consistently across all prompt formulations\. Standard deviations are substantially smaller than the between\-condition gaps, confirming that the observed effects are not artifacts of specific prompt wording\.

Table 8:Prompt sensitivity one\-SNLI\(Qwen2\.5\-7B\)\. Mean±\\pmstd across 5 prompt paraphrases\. The comprehensiveness gap betweenLnative→LnativeL\_\{\\text\{native\}\}\\\!\\to\\\!L\_\{\\text\{native\}\}andLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}\(e\.g\., Hindi:0\.2580\.258\) consistently exceeds the within\-condition standard deviation \(σ≈0\.03\\sigma\\approx 0\.03\), confirming robustness to prompt phrasing\.Table 9:Prompt sensitivity one\-SNLI\(Llama3\.1\-8B\)\. Mean±\\pmstd across 5 prompt paraphrases\. The same trade\-off pattern holds: comprehensiveness drops and sufficiency rises underLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}, with between\-condition gaps exceeding within\-condition variance\.

## Appendix EOrdering ablation

This ablation is intended as a prompt\-structure sanity check rather than a full re\-run of the main experiment; therefore, we report it for one representative model,Qwen2\.5\-7B, across all tasks and languages\.

### E\.1Reversed prompt template

The reversed\-order prompt for Condition C \(Lnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}\) is:

> You are given a task input in <LANG\>\. 1\) Predict the correct label from: \{<LABELS\>\}\. 2\) Write a brief explanation in English\. 3\) Copy 1\-\-3 short evidence spans verbatim from the input text \(keep them in <LANG\>\)\. Important: Evidence must be exact substrings of the input \(do not translate Evidence\)\. Input: <INPUT\>

The corresponding reversed output format is:

> Label: <one label from the label set\> Explanation: <1\-\-3 sentences in the required explanation language\> Evidence: <1\-\-3 spans copied exactly from the input text\>

Conditions A and B are reversed analogously\.

### E\.2Results

Table[10](https://arxiv.org/html/2605.19274#A5.T10)compares the default \(evidence\-first\) and reversed \(explanation\-first\) orderings one\-SNLIforQwen2\.5\-7Bacross all languages\. We report comprehensiveness, sufficiency, and span agreement for both orderings\.

Table 10:Ordering ablation forQwen2\.5\-7Bunder theLnative→ENL\_\{\\text\{native\}\}\\to\\text\{EN\}condition\. The original prompt requests label→\\rightarrowevidence→\\rightarrowexplanation, while the reversed prompt requests label→\\rightarrowexplanation→\\rightarrowevidence\. Values are reported for all three datasets and all non\-English languages\. Across tasks and languages, reversing the output\-field order produces only small changes in comprehensiveness and span agreement, suggesting that the explanation\-language instruction acts as a global conditioning signal rather than a local constraint imposed only after evidence generation\.Table[11](https://arxiv.org/html/2605.19274#A5.T11)reports the same ablation forLlama3\.1\-8B\.

Table 11:Ordering ablation one\-SNLI\(Llama3\.1\-8B\)\. Same setup as Table[10](https://arxiv.org/html/2605.19274#A5.T10)\. Results confirm that the trade\-off is order\-independent for Llama as well \(paired permutation test,p\>0\.3p\>0\.3for all cells\)\.
### E\.3Analysis

The close agreement between the two orderings across all language–model combinations supports the autoregressive coupling argument presented in Section[3\.6](https://arxiv.org/html/2605.19274#S3.SS6): because the language instruction appears in the prompt before generation begins, it conditions the model’s entire output distribution—including evidence span selection—regardless of where in the output sequence the evidence field appears\.

We additionally compute the Jaccard similarity between the evidence span sets produced under the two orderings\. Across all conditions, the mean Jaccard index is0\.87±0\.060\.87\\pm 0\.06, indicating that the model selects largely the same evidence spans regardless of whether it generates the explanation first or last\. The small residual variation is consistent with the stochastic nature of autoregressive sampling and does not correlate with explanation language\.

These results rule out the concern that the evidence\-first ordering shields evidence selection from the explanation language instruction\. The language instruction functions as a global conditioning signal, not a local directive tied to a specific output field\.

## Appendix FSemantic similarity check

Our primary span agreement metric \(Eq\.[1](https://arxiv.org/html/2605.19274#S3.E1)\) relies on exact token\-level overlap, which can under\-penalize semantically correct but lexically different evidence selections—particularly in morphologically rich languages like Arabic and Bengali \(see coverage statistics in Table[7](https://arxiv.org/html/2605.19274#A2.T7)\)\. To verify that our findings are not artifacts of this surface\-level metric, we complement span agreement with BERTScore\(Zhanget al\.,[2020](https://arxiv.org/html/2605.19274#bib.bib25)\), a semantic similarity metric based on contextual embeddings that is more robust to paraphrase, morphological variation, and word\-order differences\.

### F\.1Setup

For each instance, we compute BERTScore F1 between the model\-produced evidence spans and the human rationale annotation\. We usebert\-base\-multilingual\-casedas the underlying model, which provides consistent cross\-lingual representations across all five languages in our study\. For each condition, we report the corpus\-level mean BERTScore F1 across all instances\.

We emphasize that BERTScore measures*semantic*similarity between evidence and rationale, whereas our span agreement metric \(Eq\.[1](https://arxiv.org/html/2605.19274#S3.E1)\) measures*lexical*overlap\. If the two metrics agree in their directional trends across conditions, this strengthens confidence that the observed patterns are genuine and not artifacts of tokenization or morphological mismatch\.

### F\.2Results: e\-SNLI

Table 12:e\-SNLI: Span agreement \(lexical\) vs\. BERTScore F1 \(semantic\) side by side\. Directional trends are consistent: where span agreement increases under the English pivot, BERTScore also increases, and vice versa\. BERTScore values are uniformly higher than span agreement, reflecting its ability to capture semantic matches missed by exact token overlap\.Bold: better value in theL→LL\\\!\\to\\\!Lvs\.L→ENL\\\!\\to\\\!\\text\{EN\}comparison\.Highlighted: pivot condition\.Table[12](https://arxiv.org/html/2605.19274#A6.T12)shows that BERTScore and span agreement exhibit consistent directional trends one\-SNLI\. For languages where span agreement increases under the English pivot \(HindiQwen2\.5\-7B:0\.417→0\.4670\.417\\to 0\.467; Arabic both models; Bengali both models\), BERTScore F1 also increases\. For ChineseQwen2\.5\-7B, where span agreement decreases \(0\.597→0\.5160\.597\\to 0\.516\), BERTScore also decreases \(0\.738→0\.6890\.738\\to 0\.689\)\. This convergence indicates that the span agreement patterns reported in Table[1](https://arxiv.org/html/2605.19274#S4.T1)are not artifacts of tokenization differences across scripts\.

BERTScore values are uniformly higher than span agreement \(mean gap:\+0\.17\+0\.17\), which is expected: BERTScore captures semantic matches that exact token overlap misses \(e\.g\., inflected forms, synonym substitutions\)\. Crucially, however, the*relative*ordering across conditions is preserved, confirming that our primary span agreement metric provides a valid—if conservative—signal\.

### F\.3Results: FEVER

Table 13:FEVER: Span agreement vs\. BERTScore F1\. Directional trends are again consistent between the two metrics\. For Chinese, where both metrics decrease under the pivot, the agreement between lexical and semantic measures is particularly informative—the degradation is genuine, not a tokenization artifact\. Notation follows Table[12](https://arxiv.org/html/2605.19274#A6.T12)\.OnFEVER\(Table[13](https://arxiv.org/html/2605.19274#A6.T13)\), BERTScore again tracks span agreement directionally\. For Chinese, where both span agreement and comprehensiveness decrease under the English pivot, BERTScore confirms the degradation \(0\.489→0\.4410\.489\\to 0\.441forQwen2\.5\-7B;0\.471→0\.4230\.471\\to 0\.423forLlama3\.1\-8B\)\. This is important because one might hypothesize that the Chinese span agreement drop is merely a tokenization artifact \(Chinese characters vs\. English words\); BERTScore, operating on contextual embeddings, rules out this alternative explanation\.

### F\.4Results: HateXplain

Table 14:HateXplain: Span agreement vs\. BERTScore F1\. Unlikee\-SNLI, the English pivot does not consistently improve either metric—span agreement and BERTScore both show mixed or negative shifts, confirming that the loss of socially nuanced cues is a genuine semantic phenomenon, not a surface\-level tokenization effect\. Notation follows Table[12](https://arxiv.org/html/2605.19274#A6.T12)\.HateXplain\(Table[14](https://arxiv.org/html/2605.19274#A6.T14)\) reveals the most informative pattern\. If the span agreement decreases observed in Table[3](https://arxiv.org/html/2605.19274#S4.T3)were merely tokenization artifacts—e\.g\., morphologically rich forms being penalized by exact match—we would expect BERTScore to recover the “true” semantic similarity and show improvement under the English pivot\. Instead, BERTScore shows the same mixed\-to\-negative pattern as span agreement: for Hindi \(0\.561→0\.5120\.561\\to 0\.512forQwen2\.5\-7B\) and Bengali \(0\.562→0\.4990\.562\\to 0\.499forQwen2\.5\-7B;0\.502→0\.4780\.502\\to 0\.478forLlama3\.1\-8B\), the semantic similarity*also*decreases under the pivot\. This confirms that English explanations for hate speech genuinely lose socially relevant semantic content, rather than merely failing to match surface tokens\.

### F\.5Summary and implications

Table 15:Agreement between span agreement and BERTScore F1 across conditions\. “Directional agreement” counts the number of language–model cells \(out of 16: 4 languages×\\times2 models×\\times2 directions\) where both metrics move in the same direction under the English pivot\. Pearson’srris computed across all condition\-level mean scores within each task\.Table 16:Human translation audit over 50 randomly sampled instances per target language\. Two bilingual annotators judge semantic preservation and label validity\. Instances marked label\-invalid by either annotator are excluded from all experiments\.Table[15](https://arxiv.org/html/2605.19274#A6.T15)summarizes the overall agreement between the two metrics\. Across all three tasks, span agreement and BERTScore F1 agree directionally in 42 out of 48 cells \(87\.5%\), with a Pearson correlation ofr=0\.88r=0\.88\. This high concordance supports three conclusions:

1. 1\.Span agreement is a valid proxy\.Despite its known limitations with morphological variation and paraphrase, span agreement captures the same directional trends as the semantically richer BERTScore metric\. The cross\-lingual patterns reported in our main tables are not artifacts of the metric choice\.
2. 2\.Morphological bias is conservative, not misleading\.The gap between BERTScore and span agreement is largest for Arabic \(mean gap:\+0\.21\+0\.21\) and Bengali \(mean gap:\+0\.19\+0\.19\), consistent with these languages’ richer morphology reducing exact\-match recall\. However, this bias*under\-counts*overlap uniformly across conditions, preserving the relative ordering\.
3. 3\.The HateXplain pattern is genuine\.The failure of English pivots to improve semantic similarity on hate speech \(confirmed by both metrics\) rules out the hypothesis that surface\-level tokenization effects mask underlying semantic improvement\. The loss of social and pragmatic cues under English pivoting is a substantive semantic phenomenon\.

We note one limitation: BERTScore relies onbert\-base\-multilingual\-cased, which itself exhibits English\-centric biasesConneauet al\.\([2020](https://arxiv.org/html/2605.19274#bib.bib15)\)\. As a result, BERTScore may slightly overestimate similarity for English\-pivot conditions relative to native\-language conditions\. If anything, this bias works*against*our findings \(making English pivots look better than they are\), further strengthening the robustness of the observed trade\-off\.

## Appendix GHuman plausibility validation

Across all conditions, span agreement is strongly correlated with human plausibility ratings \(ρ=0\.67\\rho=0\.67,p<0\.001p<0\.001\)\. The correlation is strongest fore\-SNLI\(ρ=0\.71\\rho=0\.71\), followed byFEVER\(ρ=0\.66\\rho=0\.66\) andHateXplain\(ρ=0\.60\\rho=0\.60\)\. This supports the use of span agreement as a proxy for perceived plausibility, while also confirming that it should not be treated as a complete substitute for human evaluation\.

Table 17:Correlation between human plausibility ratings and span agreement on the rated subsample\.
## Appendix HTranslation quality audit

Because all non\-English evaluation sets are constructed by translating English benchmark instances, translation artifacts could confound both span agreement and perturbation\-based faithfulness\. We therefore conduct a bilingual translation audit for each target language\. For Chinese, Hindi, Arabic, and Bengali, we randomly sample 50 translated instances and ask two bilingual annotators to evaluate two properties: \(i\) semantic preservation, and \(ii\) label validity\.

For semantic preservation, annotators assign one of three labels:preserved,minor shift, ormajor shift\. A translation is markedpreservedif the meaning of the source instance is retained without a task\-relevant change\. It is markedminor shiftif the translation introduces small wording or fluency changes that do not affect the gold label\. It is markedmajor shiftif the translation changes, removes, or adds information that could affect task interpretation\. For label validity, annotators judge whether the original gold label remains correct after translation\. Instances marked label\-invalid by either annotator are excluded from all experiments\.

Table[16](https://arxiv.org/html/2605.19274#A6.T16)summarizes the audit results\. Across languages, 86–92% of translations preserve the source semantics without meaningful change\. Label\-invalid cases are rare, affecting 2–4% of audited instances\. Bengali shows the highest rate of minor or major semantic shifts, which is consistent with its lower\-resource status and with the weaker significance patterns observed for Bengali in some faithfulness tests\. These results suggest that translation artifacts are present but limited, and are unlikely to explain the systematicLnative→ENL\_\{\\text\{native\}\}\\\!\\to\\\!\\text\{EN\}faithfulness drop observed across tasks and models\.

Similar Articles

Rethinking the Multilingual Reasoning Gap with Layer Swap

arXiv cs.CL

This paper revisits the multilingual reasoning gap in LLMs, finding it smaller than previously reported under comparable supervision. It introduces Layer Swap, which transfers mid-layer weights from an English reasoning specialist to native language specialists, nearly closing the gap while preserving native-language chain-of-thought.

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

arXiv cs.CL

This paper introduces a causal framework to quantify rationalization bias in LLM judges, where verdicts and explanations are influenced by non-evidential cues rather than underlying texts. It proposes cue interventions, anchoring metrics, and the Proof-Before-Preference mitigation protocol, demonstrating improved cue invariance.