SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

arXiv cs.CL Papers

Summary

Introduces SEA-NLI, a culturally grounded NLI benchmark covering eight Southeast Asian countries, revealing low performance of LLMs on culturally specific knowledge, especially in languages and science/technology. Shows that culture-aware prompting helps but chain-of-thought offers limited gains.

arXiv:2606.03284v1 Announce Type: new Abstract: Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:38 AM

# SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding
Source: [https://arxiv.org/abs/2606.03284](https://arxiv.org/abs/2606.03284)
[View PDF](https://arxiv.org/pdf/2606.03284)

> Abstract:Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia \(SEA\)\. Existing NLI benchmarks are largely Western\-centric, translation\-derived, or monolingual, limiting their ability to measure culturally grounded reasoning\. We introduce SEA\-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers\. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge\-intensive categories such as Languages and Science and Technology\. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA\-adapted models and culture\-aware prompting improve performance, while CoT prompting offers limited gains\.

## Submission history

From: Peerat Limkonchotiwat \[[view email](https://arxiv.org/show-email/55010c4d/2606.03284)\] **\[v1\]**Tue, 2 Jun 2026 07:49:50 UTC \(1,588 KB\)

Similar Articles

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

arXiv cs.CL

This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.

Sample-Size Scaling of the African Languages NLI Evaluation

arXiv cs.CL

This paper examines the effect of labeled data size on natural language inference performance for 16 African languages using the AfriXNLI benchmark. The results show that scaling behavior is language-sensitive and often non-monotonic, challenging the common assumption of monotonic improvement, and emphasizing the need for language-specific dataset creation and stronger multilingual strategies.

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

arXiv cs.CL

This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.