Choosing features for classifying multiword expressions

arXiv cs.CL 05/13/26, 04:00 AM Papers

Summary

This paper discusses methods for selecting features to improve the classification of multiword expressions.

arXiv:2605.11779v1 Announce Type: new Abstract: Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 06:18 AM

# Choosing features for classifying multiword expressions
Source: [https://arxiv.org/abs/2605.11779](https://arxiv.org/abs/2605.11779)
Bibliographic Tools

## Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Code, Data, Media

## Code, Data and Media Associated with this Article

Demos

## Demos

Related Papers

## Recommenders and Search Tools

About arXivLabs

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website\.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy\. arXiv is committed to these values and only works with partners that adhere to them\.

Have an idea for a project that will add value for arXiv's community?[**Learn more about arXivLabs**](https://info.arxiv.org/labs/index.html)\.

Similar Articles

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Hugging Face Daily Papers

This paper introduces the EDU-CIRCUIT-HW dataset for evaluating multimodal large language models on real-world university-level STEM handwritten solutions, revealing significant recognition limitations and proposing a hybrid approach that combines automated recognition with minimal human oversight to enhance grading robustness.

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

arXiv cs.CL

The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Hugging Face Daily Papers

This paper introduces three parameter-efficient methods for multi-view proficiency estimation on the Ego-Exo4D dataset, shifting from discriminative classification to generative feedback. The proposed models achieve state-of-the-art accuracy with significantly fewer parameters and training epochs than video-transformer baselines.

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.

MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

arXiv cs.CL

MEDSYN is a multilingual multimodal benchmark for evaluating MLLMs on complex clinical cases with up to 7 distinct visual evidence types per case. The study reveals that while frontier models match human experts on differential diagnosis generation, all MLLMs show significant gaps in final diagnosis selection due to poor synthesis of heterogeneous clinical evidence.

Similar Articles

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

MEDSYN: Benchmarking Multi-Evidence Synthesis in Complex Clinical Cases for Multimodal Large Language Models

Submit Feedback