Tag
Introduces UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions. Evaluates 11 LLMs, revealing task-dependent few-shot effects and the misleading nature of accuracy on imbalanced legal tasks.
GaoYao introduces a 182k-sample benchmark across 26 languages and 51 regions to systematically evaluate LLMs’ multilingual and multicultural capabilities, revealing large geographical performance gaps.