Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

arXiv cs.CL 04/21/26, 04:00 AM Papers

Summary

Researchers present SemanticQA, a benchmark for evaluating language models on semantic phrase processing tasks including idioms, noun compounds, and verbal constructions, revealing significant performance variation across model architectures and scales on semantic reasoning tasks.

arXiv:2604.16593v1 Announce Type: new Abstract: We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/21/26, 07:03 AM

# Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
Source: [https://arxiv.org/abs/2604.16593](https://arxiv.org/abs/2604.16593)
[View PDF](https://arxiv.org/pdf/2604.16593)

> Abstract:We present SemanticQA, an evaluation suite designed to assess language models \(LMs\) in semantic phrase processing tasks\. The benchmark consolidates existing multiword expression \(MwE\) resources and reorganizes them into a unified testbed\. It covers both general lexical phenomena, such as lexical collocations, and three fine\-grained categories: idiomatic expressions, noun compounds, and verbal constructions\. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions\. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non\-trivial semantic phrases\. The evaluation harness and data of SemanticQA are available at[this https URL](https://github.com/jacklanda/SemanticQA)\.

## Submission history

From: Yang Liu \[[view email](https://arxiv.org/show-email/ed5e06c8/2604.16593)\] **\[v1\]**Fri, 17 Apr 2026 17:56:21 UTC \(574 KB\)

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

Similar Articles

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

@dbreunig: Reasoning models are great at understanding nuance and natural language. This nuance hasn't trickled down to retrieval …

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Introducing SimpleQA

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Submit Feedback

Similar Articles

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

@dbreunig: Reasoning models are great at understanding nuance and natural language. This nuance hasn't trickled down to retrieval …

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text