Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

arXiv cs.CL Papers

Summary

Researchers present SemanticQA, a benchmark for evaluating language models on semantic phrase processing tasks including idioms, noun compounds, and verbal constructions, revealing significant performance variation across model architectures and scales on semantic reasoning tasks.

arXiv:2604.16593v1 Announce Type: new Abstract: We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/21/26, 07:03 AM

# Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
Source: [https://arxiv.org/abs/2604.16593](https://arxiv.org/abs/2604.16593)
[View PDF](https://arxiv.org/pdf/2604.16593)

> Abstract:We present SemanticQA, an evaluation suite designed to assess language models \(LMs\) in semantic phrase processing tasks\. The benchmark consolidates existing multiword expression \(MwE\) resources and reorganizes them into a unified testbed\. It covers both general lexical phenomena, such as lexical collocations, and three fine\-grained categories: idiomatic expressions, noun compounds, and verbal constructions\. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions\. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non\-trivial semantic phrases\. The evaluation harness and data of SemanticQA are available at[this https URL](https://github.com/jacklanda/SemanticQA)\.

## Submission history

From: Yang Liu \[[view email](https://arxiv.org/show-email/ed5e06c8/2604.16593)\] **\[v1\]**Fri, 17 Apr 2026 17:56:21 UTC \(574 KB\)

Similar Articles

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

arXiv cs.CL

VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

Introducing SimpleQA

OpenAI Blog

OpenAI introduces SimpleQA, a new factuality benchmark dataset with 4,326 short fact-seeking questions designed to evaluate frontier language models on their ability to provide accurate answers without hallucination. The dataset achieves high quality through dual independent annotation, rigorous criteria, and achieves only ~3% estimated error rate, with GPT-4o scoring less than 40%.