Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings
Summary
SAGA framework uses frozen multimodal large language models to provide attribute-aware supervision for vision encoders via Group Relative Policy Optimization, improving zero-shot image retrieval by 3–6 points on fine-grained benchmarks.
View Cached Full Text
Cached at: 06/17/26, 07:52 PM
Paper page - Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings
Source: https://huggingface.co/papers/2606.15134
Abstract
SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.
Vision encodersfor retrieval are typically trained withclass-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. Amultimodal large language model(MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded,attribute-aware perceptioninto a training signal for the encoder itself. Specifically, we useGroup Relative Policy Optimization(GRPO) to reward the MLLM for correct predictions on the vision encoder’s tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliaryattention-distillation lossanchors the encoder’s embedding to tokens the MLLM attended to, and a standardmetric-learning lossshapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improvesRecall@1by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves onzero-shot image retrieval.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.15134
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15134 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15134 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15134 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Stateful Visual Encoders for Vision-Language Models
This paper introduces a stateful visual encoder for vision-language models that conditions visual representations on prior features, enabling better visual comparison in multi-image and agentic settings. The method shows consistent improvements across tasks such as cross-image spatial aggregation and longitudinal radiology.
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
This paper proposes a novel approach that conditions diffusion models on Multimodal Large Language Models (MLLMs) for subject-driven image generation, using VAE-based identity conditioning and a Dual Layer Aggregation module to improve both semantic understanding and identity preservation while mitigating copy-paste artifacts.
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.