Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
Summary
The paper challenges the assumption that cosine alignment between supervised latents and visual targets improves accuracy in vision-language models, finding a strong negative correlation. It introduces PRISM diagnostics revealing that answers are decoded downstream from latents, not within them, and that the auxiliary loss reshapes the language model via shared parameters.
View Cached Full Text
Cached at: 06/09/26, 08:41 AM
Paper page - Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
Source: https://huggingface.co/papers/2606.05753
Abstract
Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather than within them.
Latent visual reasoning(LVR) insertssupervised latent tokensbetween perception and answer generation invision-language models(VLMs). The field uses alignment between these latents and their visual targets, i.e.,cosine similarityormean squared error(MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: alinear probethat asks where the answer is decodable, and acorruption testthat asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with anInformation Bottleneckreading of the loss, the auxiliary objective reshapes the language model viashared parametersrather than via the latent variable it nominally optimizes.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.05753
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05753 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05753 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05753 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
This paper demonstrates that cosine similarity is a poor proxy for assessing layer importance in LLMs, and proposes using the actual accuracy drop from layer removal as a more robust metric.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
Investigates spatial representation in vision-language models, revealing a consistent bias where models conflate vertical image position with distance, and introduces SpatialTunnel synthetic benchmark to expose this shortcut; finds that better disentangled spatial representations improve robustness.
Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models
This paper presents the first systematic study of multilingual instruction following in Vision-Language-Action (VLA) models, revealing significant performance degradation when models trained on English are evaluated on other languages. The authors propose Multilingual Principal Component Alignment (MPCA) to reduce the multilingual performance gap.