Tag
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Researchers introduce x1, a family of reasoning models that adaptively select optimal languages for reasoning on a per-instance basis, demonstrating that language choice impacts reasoning quality in multilingual and cultural tasks.