Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Hugging Face Daily Papers 06/13/26, 12:00 AM Papers

Summary

SAGA framework uses frozen multimodal large language models to provide attribute-aware supervision for vision encoders via Group Relative Policy Optimization, improving zero-shot image retrieval by 3–6 points on fine-grained benchmarks.

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

Original Article

View Cached Full Text

Cached at: 06/17/26, 07:52 PM

Paper page - Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Source: https://huggingface.co/papers/2606.15134

Abstract

SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.

Vision encodersfor retrieval are typically trained withclass-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. Amultimodal large language model(MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded,attribute-aware perceptioninto a training signal for the encoder itself. Specifically, we useGroup Relative Policy Optimization(GRPO) to reward the MLLM for correct predictions on the vision encoder’s tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliaryattention-distillation lossanchors the encoder’s embedding to tokens the MLLM attended to, and a standardmetric-learning lossshapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improvesRecall@1by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves onzero-shot image retrieval.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.15134

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.15134 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15134 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15134 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Paper page - Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Stateful Visual Encoders for Vision-Language Models

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Submit Feedback

Similar Articles

Stateful Visual Encoders for Vision-Language Models

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation