Tag
This paper introduces SemCog Bench, a curated benchmark of 1,858 Arabic-Hebrew word pairs with sentence-level annotations, to evaluate LLMs' ability to distinguish true cognates from false friends and loanwords. Results show high accuracy on true cognates but sharp drops on false friends, highlighting a key limitation in cross-lingual semantic reasoning.
This paper proposes a safety-oriented, consequence-aware evaluation framework for large language models in Air Traffic Control, revealing that high aggregate accuracy masks significant reliability issues in handling high-risk semantic errors.