HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

arXiv cs.CL 05/08/26, 04:00 AM Papers

Summary

The paper introduces Hard Negative Captions (HNC), a dataset and method for training vision-language models to achieve fine-grained comprehension by addressing weak associations in web-collected image-text pairs.

arXiv:2605.06157v1 Announce Type: new Abstract: Image-Text-Matching (ITM) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language (VL). However, due to the weak association between the web-collected image-text pairs, models fail to show a fine-grained understanding of the combined semantics of these modalities. To address this issue we propose Hard Negative Captions (HNC): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine-grained cross-modal comprehension in VL. Additionally, we provide a challenging manually-created test set for benchmarking models on a fine-grained cross-modal mismatch task with varying levels of compositional complexity. Our results show the effectiveness of training on HNC by improving the models' zero-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios. Also, we demonstrate that HNC models yield a comparable or better initialization for fine-tuning

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 07:18 AM

# HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
Source: [https://arxiv.org/abs/2605.06157](https://arxiv.org/abs/2605.06157)
[View PDF](https://arxiv.org/pdf/2605.06157)

> Abstract:Image\-Text\-Matching \(ITM\) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language \(VL\)\. However, due to the weak association between the web\-collected image\-text pairs, models fail to show a fine\-grained understanding of the combined semantics of these modalities\. To address this issue we propose Hard Negative Captions \(HNC\): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine\-grained cross\-modal comprehension in VL\. Additionally, we provide a challenging manually\-created test set for benchmarking models on a fine\-grained cross\-modal mismatch task with varying levels of compositional complexity\. Our results show the effectiveness of training on HNC by improving the models' zero\-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios\. Also, we demonstrate that HNC models yield a comparable or better initialization for fine\-tuning

## Submission history

From: Esra Dönmez \[[view email](https://arxiv.org/show-email/26fa9ba5/2605.06157)\] **\[v1\]**Wed, 6 May 2026 14:01:47 UTC \(21,966 KB\)

HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities

Similar Articles

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Disparities In Negation Understanding Across Languages In Vision-Language Models

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

Submit Feedback

Similar Articles

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Disparities In Negation Understanding Across Languages In Vision-Language Models

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation