Tag
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Researchers introduce MORPHOGEN, a multilingual benchmark testing LLMs’ ability to rewrite first-person sentences in the opposite gender while preserving meaning across French, Arabic, and Hindi.