HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
Summary
The paper introduces Hard Negative Captions (HNC), a dataset and method for training vision-language models to achieve fine-grained comprehension by addressing weak associations in web-collected image-text pairs.
View Cached Full Text
Cached at: 05/08/26, 07:18 AM
# HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities Source: [https://arxiv.org/abs/2605.06157](https://arxiv.org/abs/2605.06157) [View PDF](https://arxiv.org/pdf/2605.06157) > Abstract:Image\-Text\-Matching \(ITM\) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language \(VL\)\. However, due to the weak association between the web\-collected image\-text pairs, models fail to show a fine\-grained understanding of the combined semantics of these modalities\. To address this issue we propose Hard Negative Captions \(HNC\): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine\-grained cross\-modal comprehension in VL\. Additionally, we provide a challenging manually\-created test set for benchmarking models on a fine\-grained cross\-modal mismatch task with varying levels of compositional complexity\. Our results show the effectiveness of training on HNC by improving the models' zero\-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios\. Also, we demonstrate that HNC models yield a comparable or better initialization for fine\-tuning ## Submission history From: Esra Dönmez \[[view email](https://arxiv.org/show-email/26fa9ba5/2605.06157)\] **\[v1\]**Wed, 6 May 2026 14:01:47 UTC \(21,966 KB\)
Similar Articles
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Proposes Slipform, a training framework that uses lexical concreteness to select harder negatives and a margin-based Cement loss, boosting compositional reasoning in vision-language models.
Disparities In Negation Understanding Across Languages In Vision-Language Models
MIT researchers release the first multilingual negation benchmark covering seven languages and show VLMs like CLIP struggle with non-Latin scripts, while MultiCLIP and SpaceVLM offer uneven improvements across languages.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
This paper introduces HyperLens, a high-resolution probe to quantify cognitive effort in LLMs by tracing fine-grained confidence trajectories across layers. It reveals that complex tasks require higher cognitive effort and demonstrates how Supervised Fine-Tuning can reduce this effort, potentially degrading performance.
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
MNAFT (Modality Neuron-Aware Fine-Tuning) is a novel approach that selectively updates language-specific and language-agnostic neurons in multimodal large language models to improve image translation while preserving pre-trained knowledge. The method outperforms state-of-the-art image translation techniques including cascaded models and standard fine-tuning approaches.