Tag
Introduces SEA-NLI, a culturally grounded NLI benchmark covering eight Southeast Asian countries, revealing low performance of LLMs on culturally specific knowledge, especially in languages and science/technology. Shows that culture-aware prompting helps but chain-of-thought offers limited gains.
This paper demonstrates that LLMs are heavily biased toward English, and shows that continual pre-training does not offer cost advantages over training from scratch for adapting models to other languages, especially for cultural understanding.
OpenAI introduced IndQA, a new benchmark with 2,278 questions across 12 Indian languages and 10 cultural domains, designed to evaluate AI models' understanding of culturally nuanced and reasoning-heavy tasks that existing benchmarks fail to capture. Created with 261 domain experts, IndQA addresses the saturation of existing multilingual benchmarks like MMMLU and focuses on real-world cultural comprehension rather than translation or multiple-choice tasks.