Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Hugging Face Daily Papers 06/04/26, 12:00 AM Papers

Summary

The paper challenges the assumption that cosine alignment between supervised latents and visual targets improves accuracy in vision-language models, finding a strong negative correlation. It introduces PRISM diagnostics revealing that answers are decoded downstream from latents, not within them, and that the auxiliary loss reshapes the language model via shared parameters.

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:41 AM

Paper page - Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Source: https://huggingface.co/papers/2606.05753

Abstract

Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather than within them.

Latent visual reasoning(LVR) insertssupervised latent tokensbetween perception and answer generation invision-language models(VLMs). The field uses alignment between these latents and their visual targets, i.e.,cosine similarityormean squared error(MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: alinear probethat asks where the answer is decodable, and acorruption testthat asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with anInformation Bottleneckreading of the loss, the auxiliary objective reshapes the language model viashared parametersrather than via the latent variable it nominally optimizes.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2606\.05753

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05753 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05753 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05753 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Paper page - Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Large Vision-Language Models Get Lost in Attention

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

Submit Feedback

Similar Articles

Large Vision-Language Models Get Lost in Attention

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models