Tag
Introduces ROK-FORTRESS, a bilingual benchmark for measuring how language and geopolitical context jointly affect LLM safety behavior, using English-Korean and US-ROK axes as a case study. Findings show language and context interact in ways that translation-only evaluations miss.
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Researchers introduce MORPHOGEN, a multilingual benchmark testing LLMs’ ability to rewrite first-person sentences in the opposite gender while preserving meaning across French, Arabic, and Hindi.