Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Hugging Face Daily Papers 04/14/26, 12:00 AM Papers

Summary

Proposes Slipform, a training framework that uses lexical concreteness to select harder negatives and a margin-based Cement loss, boosting compositional reasoning in vision-language models.

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

Original Article

View Cached Full Text

Cached at: 04/21/26, 11:27 AM

Paper page - Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Source: https://huggingface.co/papers/2604.13313

Abstract

Vision-language models face challenges in compositional reasoning due to insufficient samples for distinguishing subtle semantics, which are addressed through lexical concreteness-based negative sample selection and a novel margin-based loss function.

Vision-Language Modelsdemonstrate remarkable capabilities but often struggle withcompositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations duringcontrastive pretraining. Althoughhard negative miningoffers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establisheslexical concretenessas a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of theInfoNCEfurther reveals a severegradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, theCement lossis formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated asSlipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.13313

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.13313 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.13313 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.13313 in a Space README.md to link it from this page.

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Paper page - Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]

Submit Feedback

Similar Articles

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]