Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
Summary
Researchers present SemanticQA, a benchmark for evaluating language models on semantic phrase processing tasks including idioms, noun compounds, and verbal constructions, revealing significant performance variation across model architectures and scales on semantic reasoning tasks.
View Cached Full Text
Cached at: 04/21/26, 07:03 AM
# Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models Source: [https://arxiv.org/abs/2604.16593](https://arxiv.org/abs/2604.16593) [View PDF](https://arxiv.org/pdf/2604.16593) > Abstract:We present SemanticQA, an evaluation suite designed to assess language models \(LMs\) in semantic phrase processing tasks\. The benchmark consolidates existing multiword expression \(MwE\) resources and reorganizes them into a unified testbed\. It covers both general lexical phenomena, such as lexical collocations, and three fine\-grained categories: idiomatic expressions, noun compounds, and verbal constructions\. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions\. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non\-trivial semantic phrases\. The evaluation harness and data of SemanticQA are available at[this https URL](https://github.com/jacklanda/SemanticQA)\. ## Submission history From: Yang Liu \[[view email](https://arxiv.org/show-email/ed5e06c8/2604.16593)\] **\[v1\]**Fri, 17 Apr 2026 17:56:21 UTC \(574 KB\)
Similar Articles
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.
@dbreunig: Reasoning models are great at understanding nuance and natural language. This nuance hasn't trickled down to retrieval …
A tweet highlights that while reasoning models excel at nuance and natural language understanding, this capability hasn't translated to retrieval systems, pointing to a key bottleneck in AI.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
Introducing SimpleQA
OpenAI introduces SimpleQA, a new factuality benchmark dataset with 4,326 short fact-seeking questions designed to evaluate frontier language models on their ability to provide accurate answers without hallucination. The dataset achieves high quality through dual independent annotation, rigorous criteria, and achieves only ~3% estimated error rate, with GPT-4o scoring less than 40%.
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
A comprehensive dual-aspect evaluation framework for large language models on Vietnamese legal text simplification, combining quantitative benchmarking (Accuracy, Readability, Consistency) with qualitative error analysis across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1.