SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding
Summary
Introduces SEA-NLI, a culturally grounded NLI benchmark covering eight Southeast Asian countries, revealing low performance of LLMs on culturally specific knowledge, especially in languages and science/technology. Shows that culture-aware prompting helps but chain-of-thought offers limited gains.
View Cached Full Text
Cached at: 06/03/26, 09:38 AM
# SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding Source: [https://arxiv.org/abs/2606.03284](https://arxiv.org/abs/2606.03284) [View PDF](https://arxiv.org/pdf/2606.03284) > Abstract:Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia \(SEA\)\. Existing NLI benchmarks are largely Western\-centric, translation\-derived, or monolingual, limiting their ability to measure culturally grounded reasoning\. We introduce SEA\-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers\. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge\-intensive categories such as Languages and Science and Technology\. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA\-adapted models and culture\-aware prompting improve performance, while CoT prompting offers limited gains\. ## Submission history From: Peerat Limkonchotiwat \[[view email](https://arxiv.org/show-email/55010c4d/2606.03284)\] **\[v1\]**Tue, 2 Jun 2026 07:49:50 UTC \(1,588 KB\)
Similar Articles
When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models
This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.
Sample-Size Scaling of the African Languages NLI Evaluation
This paper examines the effect of labeled data size on natural language inference performance for 16 African languages using the AfriXNLI benchmark. The results show that scaling behavior is language-sensitive and often non-monotonic, challenging the common assumption of monotonic improvement, and emphasizing the need for language-specific dataset creation and stronger multilingual strategies.
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia
SEA-Embedding presents a fully open and reproducible text embedding pipeline for Southeast Asian languages, trained solely on public data, achieving state-of-the-art results on the SEA-BED benchmark.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.