Choosing features for classifying multiword expressions

arXiv cs.CL 05/13/26, 04:00 AM Papers

Summary

This paper discusses methods for selecting features to improve the classification of multiword expressions.

arXiv:2605.11779v1 Announce Type: new Abstract: Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.

Original Article

View Cached Full Text

Cached at: 05/13/26, 06:18 AM

# Choosing features for classifying multiword expressions
Source: [https://arxiv.org/abs/2605.11779](https://arxiv.org/abs/2605.11779)
Bibliographic Tools

## Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Code, Data, Media

## Code, Data and Media Associated with this Article

Demos

## Demos

Related Papers

## Recommenders and Search Tools

About arXivLabs

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website\.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy\. arXiv is committed to these values and only works with partners that adhere to them\.

Have an idea for a project that will add value for arXiv's community?[**Learn more about arXivLabs**](https://info.arxiv.org/labs/index.html)\.

Similar Articles

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

arXiv cs.CL

This systematic review of 139 studies proposes a unified framework and meta-analysis for document classification via multimodal and multiview information fusion, finding that fusion improves accuracy (mean gain of +5.28 percentage points) but highlights reproducibility challenges.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

arXiv cs.AI

This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.

Improving Selective Classification with Pairwise Queries for Binary Classification

arXiv cs.LG

This paper proposes using pairwise queries to improve selective classification for binary classification, particularly where confidence estimates are inconsistent, as in LLM in-context learning. Theoretical conditions and experiments on synthetic and real datasets show that pairwise query-based algorithms achieve better accuracy-cost tradeoffs than raw confidence estimates.

A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics

arXiv cs.CL

This paper presents a data-driven analysis of multi-word expressions (MWEs) based on 16 theoretical criteria, annotated by linguistics experts, finding that no expressions are absolutely idiomatic and that lexical criteria are most influential.

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Hugging Face Daily Papers

This paper introduces the EDU-CIRCUIT-HW dataset for evaluating multimodal large language models on real-world university-level STEM handwritten solutions, revealing significant recognition limitations and proposing a hybrid approach that combines automated recognition with minimal human oversight to enhance grading robustness.