AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
Summary
Alibaba researchers propose AFMRL, a two-stage framework that uses MLLMs to extract product attributes and enhance fine-grained multimodal representation learning for e-commerce retrieval tasks.
View Cached Full Text
Cached at: 04/23/26, 10:03 AM
# AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
Source: [https://arxiv.org/html/2604.20135](https://arxiv.org/html/2604.20135)
Biao Zhang, Lixin Chen11footnotemark:1, Bin Zhang11footnotemark:1, Zongwei Wang, Tong Liu, Bo Zheng
Taobao & Tmall Group of Alibaba
###### Abstract
Multimodal representation is crucial for E\-commerce tasks such as identical product retrieval\. Large representation models \(e\.g\., VLM2Vec\) demonstrate strong multimodal understanding capabilities, yet they struggle with fine\-grained semantic comprehension, which is essential for distinguishing highly similar items\. To address this, we propose Attribute\-Enhanced Fine\-Grained Multi\-Modal Representation Learning \(AFMRL\), which defines product fine\-grained understanding as an attribute generation task\. It leverages the generative power of Multimodal Large Language Models \(MLLMs\) to extract key attributes from product images and text, and enhances representation learning through a two\-stage training framework: 1\) Attribute\-Guided Contrastive Learning \(AGCL\), where the key attributes generated by the MLLM are used in the image\-text contrastive learning training process to identify hard samples and filter out noisy false negatives\. 2\) Retrieval\-aware Attribute Reinforcement \(RAR\), where the improved retrieval performance of the representation model post\-attribute integration serves as a reward signal to enhance MLLM’s attribute generation during multimodal fine\-tuning\. Extensive experiments on large\-scale E\-commerce datasets demonstrate that our method achieves state\-of\-the\-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine\-grained representation learning\.
AFMRL: Attribute\-Enhanced Fine\-Grained Multi\-Modal Representation Learning in E\-commerce
Biao Zhang††thanks:Equal Contribution\., Lixin Chen11footnotemark:1, Bin Zhang11footnotemark:1, Zongwei Wang††thanks:Project Leader\., Tong Liu††thanks:Corresponding Author\., Bo ZhengTaobao & Tmall Group of Alibaba
## 1Introduction
Figure 1:Comparison of multimodal information in General and E\-commerce domains\. In General domains, text typically provides a global description of the image, with multiple instances corresponding between them\. In contrast, in E\-commerce domains, product titles often describe only specific instances within the product image \(which is called the main subject\)\. Moreover, for the same product, there may exist various images \(e\.g\., different shooting angles and backgrounds\) and titles \(e\.g\., varying descriptions, marketing phrases\)\.The field of multimodal representation learning is undergoing a paradigm shift, moving beyond discriminative matching frameworks towards generative models capable of sophisticated understanding and reasoningJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\); Zhanget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib44),[2024a](https://arxiv.org/html/2604.20135#bib.bib47)\); Zhouet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib45)\)\. This evolution is particularly critical in domains like E\-commerce, where the ability to distinguish between visually similar products hinges on a deep comprehension of fine\-grained attributes\. Accurately retrieving a product requires not just matching a query like"red dress"to an image, but understanding nuanced details such as"a V\-neck A\-line dress in crimson silk with cap sleeves", which necessitates a model that can parse compositional structure and subtle visual cues\. This paper explores the transition from traditional representation models to Multimodal Large Language Models \(MLLMs\) to achieve this leap in capability\.
Traditional representation models, such as CLIPRadfordet al\.\([2021](https://arxiv.org/html/2604.20135#bib.bib4)\), are built on discriminative dual\-encoder architectures that learn an aligned metric space for matching\. This approach, while effective for broad semantic retrieval, often functions as a "bag\-of\-words" systemYuksekgonulet al\.\([2022](https://arxiv.org/html/2604.20135#bib.bib37)\), struggling with compositional reasoning\. For instance, it can fail to robustly distinguish between "awhitet\-shirt with abluelogo" and "abluet\-shirt with awhitelogo\." In stark contrast, MLLMs, built on auto\-regressive principles, are compelled to understand such structural relationships in order to produce coherent descriptions sequentially\. This inherent design not only embeds compositional understanding but also unlocks transformative advantages, including instruction\-driven flexibility to create task\-aware representations and emergent commonsense reasoning to infer higher\-level abstract conceptsLiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib43)\)\.
Despite its potential, adapting MLLMs to fine\-grained representation learning still faces a fundamental challenge\. Constrained by the causal attention mechanism, prevailing large representation models typically derive embeddings via global average pooling \(e\.g\., LLM2VecBehnamGhaderet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib27)\)\) or last\-token hidden states \(e\.g\., VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\)\), which is incompatible with established fine\-grained alignment techniques such as Region\-of\-Interest \(RoI\)\-based methodsXieet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib8)\)\. This limitation inspires our core research question:How can we harness an MLLM’s advanced understanding and instruction\-following abilities to overcome its architectural constraints for fine\-grained E\-commerce tasks?
Our solution is to define fine\-grained product understanding as a key attribute generation task\. Continuing with the“red dress”as an example, if MLLMs can infer key attributes such as“dark red”,“silk”,“V\-neck”,“A\-line”, and“cap sleeves”from the product’s image and text, we consider it to have strong fine\-grained understanding capability\. The next question we need to address is:How can we integrate the generated key product attributes into multimodal representation learning?To this end, we propose a two\-stage training framework: 1\) Attribute\-Guided Contrastive Learning \(AGCL\): We first employ an MLLM to generate a list of critical attributes for each product \(e\.g\., "suede material," "lace\-up closure," "rubber sole"\)\. These attributes then serve as an explicit supervisory signal within the contrastive learning process, helping identify hard negatives and filter out noisy false negatives, thereby sharpening the model’s discriminative power; 2\) Retrieval\-aware Attribute Reinforcement \(RAR\): To resolve the potential misalignment between the generated textual attributes and the learned visual representation, we introduce a reinforcement learning stage\. The representation model itself acts as a reward function, providing feedback to fine\-tune the attribute generation model\. This creates a self\-improving loop, ensuring that the generated attributes are maximally aligned with and beneficial for the final discriminative representation\.
In summary, this work makes the following contributions:
- •To the best of our knowledge, we systematically investigate and validate the application of Multimodal Large Language Models \(MLLMs\) for fine\-grained representation learning within the E\-commerce domain\.
- •We propose a novel framework that integrates Attribute\-Guided Contrastive Learning \(AGCL\) and Retrieval\-aware Attribute Reinforcement \(RAR\), significantly improving the model’s fine\-grained discriminative capability while ensuring alignment between attribute generation and representation learning\.
- •Extensive experiments on challenging E\-commerce datasets demonstrate that our method achieves state\-of\-the\-art performance across multiple downstream retrieval tasks, validating both its effectiveness and superiority\.
## 2Related Work
### 2\.1Fine\-grained Multimodal Understanding
The foundation of modern multimodal representation learning was laid by models like CLIP, which pioneered large\-scale contrastive learning between images and text\. Building on this, a range of approaches have been proposed to enhance fine\-grained vision\-language understanding\. For example, FG\-CLIPXieet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib8)\)improves cross\-modal alignment by leveraging hard negative samples and region\-level annotations; ECLIPJinet al\.\([2023](https://arxiv.org/html/2604.20135#bib.bib9)\)introduces instance\-centric pretraining for E\-commerce applications, whileE2\\text\{E\}^\{2\}Qiet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib10)\)employs regional contrastive learning for fashion retrieval\. These methods typically enhance model discriminability by incorporating more granular signals\.
In recent years, Multimodal Large Language Models \(MLLMs\) have emerged as powerful tools for multimodal understanding\. Models such as LLaVALiuet al\.\([2023](https://arxiv.org/html/2604.20135#bib.bib11)\), CogVLMWanget al\.\([2024b](https://arxiv.org/html/2604.20135#bib.bib14)\), DeepSeek\-VLLuet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib50)\), and Qwen2\.5\-VLBaiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib51)\)leverage the world knowledge and reasoning capabilities of large language models to effectively extract detailed information from both images and text, achieving significantly stronger performance in fine\-grained multimodal tasks\.
### 2\.2Generative Models for Representation Learning
The remarkable success of Large Language Models \(LLMs\) has motivated a recent surge in research aimed at repurposing decoder\-only architectures for dense representation learningBehnamGhaderet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib27)\); Leeet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib29)\); Maet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib30)\); Shinet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib31)\)\. Early efforts, such as LLM\-EmbedderZhanget al\.\([2023](https://arxiv.org/html/2604.20135#bib.bib32)\), adapted prompt\-based techniques to extract representations\. More recent approaches have introduced architectural modifications; for instance, LLM2VecBehnamGhaderet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib27)\)enables bidirectional attention and employs a masked next\-token prediction objective to convert pre\-trained decoders into powerful text encoders\. Similarly, NV\-EmbedLeeet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib29)\)improves embedding efficiency by incorporating a latent attention mechanism and removing the causal mask during contrastive training\.
This trend has naturally extended into the multimodal domain, leveraging the superior architectural and reasoning capabilities of MLLMs\. Models such as MagicLensZhanget al\.\([2024b](https://arxiv.org/html/2604.20135#bib.bib6)\)have demonstrated strong performance in instruction\-guided retrieval, while frameworks like E5\-VJianget al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib7)\)and VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\)have set new benchmarks for encoding long, complex multimodal inputs into universal embeddings\.
Figure 2:An overview of our proposed framework\. The model is trained in two stages\. \(a\) Stage 1: Attribute\-Guided Contrastive Learning: A representation model is trained using an enhanced contrastive loss with false negative masking and hard negative reweighting\. \(b\) Stage 2: Retrieval\-aware Attribute Reinforcement: An attribute generator is fine\-tuned with reinforcement learning \(GRPO\), using the frozen representation model to provide a direct, retrieval\-based reward\. \(c\) Inference: The optimized generator extracts attributes to enrich the query input, which is then encoded by the representation model for retrieval\.
## 3Methodology
In this section, we present AFMRL \(Attribute\-Enhanced Fine\-Grained Multi\-Modal Representation Learning\), a novel framework for learning fine\-grained multimodal representations\. Our AFMRL framework is designed around a principle of decoupled responsibilities\. We employ two specialized models: a highly efficient Representation Model optimized for generating discriminative embeddings, and a powerful Attribute Generator dedicated to high\-level reasoning and extracting key local features\. This separation allows each component to excel at its specific task\. The central premise of our work is that explicitly incorporating these attributes can resolve critical ambiguities in fine\-grained retrieval, particularly with hard negatives—items that are visually similar but semantically distinct\.
Figure 3:Demonstration of our proposed Attribute\-Guided Learning process and attribute’s influence on the retrieval\.### 3\.1Attribute\-Guided Contrastive Learning
Figure 4:The overall framework of our proposed Retrieval\-aware Attribute Reinforcement training pipeline\. G denotes the Attribute Generator, and R denotes the Representation Model \(the frozen encoder used for reward calculation\)\.As illustrated in Figure[3](https://arxiv.org/html/2604.20135#S3.F3), AGCL utilizes VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\)as the base embedding model\. Following CLIP, VLM2Vec is trained with contrastive learning using the InfoNCE loss\. This leads to two key issues: \(1\) failing to leverage complementary matching signals beyond the dense embeddings, and \(2\) penalizing the model for matching with false negatives \(semantically similar items in the batch\)\.
To address these challenges, AGCL introduces an MLLM\-based attribute generator \(distilled from Qwen2\.5\-VL\-72B\-InstructBaiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib51)\); details in Appendix[D](https://arxiv.org/html/2604.20135#A4)\) to produce key attributes for each training sample, and enhances the standard InfoNCE loss\. Specifically, we utilize the generated key attributes to guide the selection of hard negatives\. We compute BM25 scores,BijB\_\{ij\}, between the key attributes of queryqiq\_\{i\}and each candidatepjp\_\{j\}to quantify their lexical relevance\. BM25 is a robust information\-retrieval function\. High BM25 scores identify samples that are lexically similar to the query, making them challenging negatives that warrant greater attention during training\. To integrate these scores, we transform them into normalized importance weights using a bounded activation function:
wij=e1\+tanh\(Bij\)\.w\_\{ij\}=e^\{1\+\\tanh\(B\_\{ij\}\)\}\.\(1\)This formulation ensures weights are bounded, providing stable and targeted emphasis on lexically hard negatives\. In addition, for each queryqiq\_\{i\}, we compute cosine similaritiessij=cos\(qi,pj\)s\_\{ij\}=\\cos\(q\_\{i\},p\_\{j\}\)with all candidate samplespjp\_\{j\}in the batch, and then define a binary masking function:
ℳij=𝕀\[\(j≠i\)∧\(sij\>sii\+δ\)\],\\mathcal\{M\}\_\{ij\}=\\mathbb\{I\}\[\(j\\neq i\)\\land\(s\_\{ij\}\>s\_\{ii\}\+\\delta\)\],\(2\)where𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\. If a candidate negative sample’s similarity to the query exceeds a predefined margin, we remove it from the in\-batch pool\. This creates a refined set of valid negatives𝒩i=\{j:ℳij=0\}\\mathcal\{N\}\_\{i\}=\\\{j:\\mathcal\{M\}\_\{ij\}=0\\\}for each query, ensuring the model is not penalized for correctly identifying semantically similar samples\.
The final loss function holistically integrates the two strategies above\. For each queryqiq\_\{i\}with similarity scores scaled by temperatureτ\\tau, the AGCL loss is formulated as:
ℒAGCL=−logwii⋅esii/τwii⋅esii/τ\+∑j∈𝒩iwij⋅esij/τ\.\\mathcal\{L\}\_\{\\text\{AGCL\}\}=\-\\log\\frac\{w\_\{ii\}\\cdot e^\{s\_\{ii\}/\\tau\}\}\{w\_\{ii\}\\cdot e^\{s\_\{ii\}/\\tau\}\+\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}w\_\{ij\}\\cdot e^\{s\_\{ij\}/\\tau\}\}\.\(3\)Here, the summation in the denominator is restricted to the refined negative set𝒩i\\mathcal\{N\}\_\{i\}, thereby excluding false negatives\. The remaining valid negatives are re\-weighted by their importance scoreswijw\_\{ij\}\. This integrated objective enables the model to learn more robust and discriminative representations by focusing on true, informative negative samples\.
### 3\.2Retrieval\-aware Attribute Reinforcement
Section[3\.1](https://arxiv.org/html/2604.20135#S3.SS1)yields a strong foundation: a Key Attribute Generator \(πSFT\\pi\_\{\\text\{SFT\}\}\) proficient at producing plausible attributes, and a powerful Embedding Model trained with AGCL\. However, the generator’s distillation objective is disconnected from the final retrieval task\. To bridge this optimization gap, we introduce a reinforcement learning \(RL\) stage that directly aligns the attribute generation policy with downstream retrieval performance\. This RL stage fine\-tunes the generator policy,πθ\\pi\_\{\\theta\}, to produce attributes that maximize the efficacy of the fixed representation model\.
Direct\-Feedback Reward Mechanism\.The design of the reward function is paramount to the effectiveness of reinforcement learning\. Moving beyond simplistic proxy signals, we introduce a reward function that is directly coupled with our final task metric\. We accomplish this by leveraging our pre\-trained representation model as an integral component of the reward evaluation environment\. In this setup, the policyπθ\\pi\_\{\\theta\}generates a set of attributes for a given query\. These attributes are used to augment the multimodal input, which is subsequently passed to the representation model to conduct a retrieval search across a candidate pool\. The reward is then defined precisely as the Recall@k of this search\.
Table 1:Performance on Coarse\-Grained and Cross\-Modal Retrieval Tasks\. The AGCL component of our AFMRL framework provides consistent improvements over strong baselines, demonstrating its broad effectiveness for general retrieval\.Table 2:Fine\-Grained Instance Retrieval Results\. Our full model, AFMRL, demonstrates state\-of\-the\-art performance, with each component providing a clear incremental benefit\.This direct\-feedback loop ensures the policy is optimized precisely for enhancing top\-kkretrieval\. The rewardℛ\(x,a\)\\mathcal\{R\}\(x,a\)for a queryxxand generated attributesaais:
ℛ\(x,a\)=\|𝒫x∩topk\(x,a\)\|\|𝒫x\|,\\mathcal\{R\}\(x,a\)=\\frac\{\|\\mathcal\{P\}\_\{x\}\\cap\\text\{top\}\_\{k\}\(x,a\)\|\}\{\|\\mathcal\{P\}\_\{x\}\|\},\(4\)where𝒫x\\mathcal\{P\}\_\{x\}is the set of ground\-truth positives, andtopk\(x,a\)\\text\{top\}\_\{k\}\(x,a\)denotes the set of top\-kkretrieved items when using generated attributesaafor queryxx\. To maintain linguistic coherence learned during SFT, we penalize malformed generations\. The final reward is:
ℛ\(x,a,ϕ\)=ℛ\(x,a\)⋅𝕀\[ϕ=1\]\+η⋅𝕀\[ϕ=0\],\\mathcal\{R\}\(x,a,\\phi\)=\\mathcal\{R\}\(x,a\)\\cdot\\mathbb\{I\}\[\\phi=1\]\+\\eta\\cdot\\mathbb\{I\}\[\\phi=0\],\(5\)whereϕ\\phiis a validity flag andη=−0\.1\\eta=\-0\.1is a penalty for invalid outputs\.
Policy Optimization with GRPO\.Equipped with the SFT\-initialized policyπSFT\\pi\_\{\\text\{SFT\}\}and our direct\-feedback rewardℛ\\mathcal\{R\}, we employ GRPOShaoet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib5)\)for policy optimization\. GRPO is well\-suited for this task due to its stability and sample efficiency, mitigating risks associated with potentially sparse rewards\. Moreover, it is an efficient algorithm that eliminates the need for an explicit reward model and value models\. The RL process fine\-tunes the policyπθ\\pi\_\{\\theta\}\(initialized fromπSFT\\pi\_\{\\text\{SFT\}\}\) to maximize the expected retrieval reward\. The objective function of GRPO is defined as follows:
𝒥GRPO\(θ\)=𝔼τ∼πθold\[min\(rt\(θ\)A^t,clip\(rt\(θ\),1−ϵ,1\+ϵ\)A^t\)−βDKL\(πθ\|\|πSFT\)\],\\begin\{split\}\\mathcal\{J\}\_\{\\text\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\}\\bigg\[&\\min\\Big\(r\_\{t\}\(\\theta\)\\hat\{A\}\_\{t\},\\\\ &\\quad\\operatorname\{clip\}\\big\(r\_\{t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\big\)\\\\ \\hat\{A\}\_\{t\}\\Big\)&\-\\beta D\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\|\|\\pi\_\{\\text\{SFT\}\}\)\\bigg\],\\end\{split\}\(6\)wherert\(θ\)r\_\{t\}\(\\theta\)is the probability ratio, and the advantage estimateA^t\\hat\{A\}\_\{t\}is computed by normalizing batch rewards:A^i=\(ℛi−mean\(ℛ\)\)std\(ℛ\)\\hat\{A\}\_\{i\}=\\frac\{\(\\mathcal\{R\}\_\{i\}\-\\text\{mean\}\(\\mathcal\{R\}\)\)\}\{\\text\{std\}\(\\mathcal\{R\}\)\}\. The KL\-divergence term, regulated byβ\\beta, acts as a regularization mechanism, anchoring the policy to the robust linguistic foundation established during SFT while encouraging exploration towards higher retrieval performance\.ϵ\\epsilondenotes the clipping parameter in the GRPO objective function, used to prevent excessive policy updates\.
## 4Experiments
We conduct a series of experiments to evaluate AFMRL on large\-scale E\-commerce benchmarks and ablate the contributions of its key components\.
### 4\.1Experimental Setup
Dataset\.We evaluate our method on the large\-scale M5Product E\-commerce datasetDonget al\.\([2022](https://arxiv.org/html/2604.20135#bib.bib40)\)\. To ensure data quality, we filter out items with missing or invalid images, resulting in a clean set of 5,760,482 products spanning over 6,000 categories\. Additionally, to create more diverse real\-world scenarios, we collected a large\-scale multimodal product dataset from a popular E\-commerce platform, named EIPM \(E\-commerce Identical Product Matching Dataset\)\. It contains about 2 million groups of same products and over 10 million product items, covering more than 10,000 sub\-categories, such as clothes, cosmetics, toys, and so on\.
Table 3:Performance on Classification and Clustering Tasks\.Implementation Details\.Our AFMRL framework consists of two core components: 1\) VLM2Vec Model: We initialize the model from Qwen2\-VL\-2B\-InstructWanget al\.\([2024a](https://arxiv.org/html/2604.20135#bib.bib16)\)and fine\-tune it using LoRA\. The learning rate is 5e\-5, the InfoNCE temperature is 0\.02, and the batch size is 2048\. We use a LoRA rank of 8 and train for 2,000 steps; 2\) Attribute Generator: We initialize the generator from Qwen2\.5\-VL\-3B\-InstructBaiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib51)\)and perform full\-parameter fine\-tuning\. The distillation stage uses a learning rate of 5e\-5 for 2,000 steps\. The margin for the false negative masking strategy is set to 0\.4\. The RL stage uses GRPO for alignment with a learning rate of 1e\-6 for 350 steps\. The hyperparameters are set as follows:β=0\.01\\beta=0\.01andϵ=0\.2\\epsilon=0\.2\. The number of rollouts is set to 8\. Details of the RL training process are provided in Appendix[B](https://arxiv.org/html/2604.20135#A2)\.
Baselines\.We compare AFMRL against two categories of models: established BERT\-based models \(CLIPRadfordet al\.\([2021](https://arxiv.org/html/2604.20135#bib.bib4)\), ALBEFLiet al\.\([2021](https://arxiv.org/html/2604.20135#bib.bib38)\), etc\.\) and LLM\-based approaches \(LLM2CLIPHuanget al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib34)\), VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\)\)\. This enables a comprehensive evaluation against both classic and state\-of\-the\-art architectures\.
### 4\.2Main Results
Depending on the retrieval granularity, product retrieval can be divided into coarse\-grained and fine\-grained tasks\. In coarse\-grained retrieval, products belonging to the same category are treated as positive samples, whereas fine\-grained retrieval imposes stricter requirements: items are considered positive samples only when attributes \(e\.g\., style and color\) match exactly\.
#### 4\.2\.1Coarse\-Grained and Cross\-Modal Retrieval
As shown in Table[1](https://arxiv.org/html/2604.20135#S3.T1), compared with traditional discriminative models, representation models based on MLLMs exhibit significantly better retrieval performance\. Furthermore, after incorporating AGCL, the model performance is further improved\. This indicates that the AGCL serves as a strong foundation for downstream retrieval tasks: even without a dedicated attribute generator, it can still improve the quality of representations for broad semantic matching, thereby providing a more powerful base model for various retrieval tasks\.
#### 4\.2\.2Fine\-Grained Instance Retrieval
Fine\-grained instance retrieval imposes high demands on discriminative capability\. Here, we deploy the full AFMRL framework, including the attribute generator\.
As shown in Table[2](https://arxiv.org/html/2604.20135#S3.T2), models such as Region\-CLIP and FG\-CLIP, which incorporate local features, achieve much better performance than on coarse\-grained retrieval tasks, confirming that fine\-grained features are crucial for E\-commerce tasks\. In contrast, the inability to fully exploit local features nearly erases the representational advantage brought by the strong reasoning capabilities of MLLMs\.
Notably, building on AGCL, the full AFMRL framework achieves state\-of\-the\-art performance: 1\) The distilled generator significantly boosts Recall@1 to 52\.42% by adding high\-quality descriptive attributes, indicating that key attributes effectively enhance the MLLM’s understanding of fine\-grained details; 2\) Building on this, we further introduce a retrieval\-aware policy, culminating in a final Recall@1 of 54\.28%\. This validates our core hypothesis: for challenging fine\-grained tasks, explicitly generating and optimizing descriptive attributes based on the end\-task retrieval objective is maximally effective\.
### 4\.3Evaluation on Downstream Tasks
To assess the generalizability of the learned embeddings beyond retrieval, we evaluated them on downstream tasks of product classification and clustering\. On a dataset of 849,207 items across 5,146 classes, we trained a linear probe on frozen embeddings for classification \(Accuracy\) and applied k\-Means for clustering\. The results in Table[3](https://arxiv.org/html/2604.20135#S4.T3)show that the model equipped with both AGCL and the RAR strategy achieves the best performance on classification and clustering, indicating that AFMRL exhibits strong generalization capability in E\-commerce scenarios\.
## 5Discussions
In this section, we delve deeper into the mechanisms behind our experimental results\. We analyze the efficacy of our method’s key components, provide insights into the intriguing phenomena observed during reinforcement learning, and candidly discuss the limitations of our current approach\.
Figure 5:Showcase of key entities\.### 5\.1Efficacy of Generated Attributes and the Representation Model
Our method’s core contribution is the use of reinforcement learning to optimize the generation of key attributes that enrich multimodal inputs for retrieval\. The effectiveness of this design is twofold: the direct value of the generated attributes and the robust representation model that serves as its foundation\.
Generated attributes provide more discriminative power\.We argue that the performance gains are primarily attributable to the policy network,πθ\\pi\_\{\\theta\}, generating highly discriminative key attributes\. This is intuitively demonstrated by the case study in Figure[5](https://arxiv.org/html/2604.20135#S5.F5)\. In the original embedding space, a query can be equidistant to its positive \(pos\+\) and negative \(neg\-\) samples due to high visual and semantic similarity \(e\.g\., products from the same brand\)\. After augmenting the input with key attributes generated by our model \(e\.g\., "size": "<\>", "series": "<\>"\), the distance to the negative sample is significantly increased \(ΔDistance\\Delta\\text\{Distance\}grows\) while similarity to the positive sample is maintained\. This indicates that the generated attributes guide the model to focus on subtle yet critical differences, effectively "pushing away" incorrect candidates and making the feature representations more discriminative\.
Figure 6:Curve chart \(w/o AGCL\) of training steps and R@1\.AGCL provides a robust foundation for attribute generation\.The generation of high\-quality attributes relies on a powerful underlying representation model\. As shown in Figure[6](https://arxiv.org/html/2604.20135#S5.F6), our proposed Attribute\-Guided Contrastive Learning \(AGCL\) strategy plays a pivotal role\. Compared to the baselineVLM2Vec, which quickly plateaus and gets stuck in a local optimum, our proposed componentAGCLexhibits steady and continuous improvement on both R@1 and R@5 metrics, eventually surpassing the baseline\. We posit that AGCL encourages the model to learn a more robust and generalized representation space by preventing it from overfitting to simple negatives early in training\. This high\-quality representation space provides a solid foundation upon which the RL policy can learn to generate precise and effective attributes\.
Figure 7:Distribution of reward scores for differentkk\.
### 5\.2In\-depth Analysis of the Reinforcement Learning Process
Having established the effectiveness of our core design, we now analyze the RL training process itself to understand its internal dynamics and emergent properties\.
Motivation for the Choice ofkk\.We use recall as the reward function for RAR\. In the Recall@k reward, the hyperparameterkkcontrols the trade\-off between the precision and the density of the learning signal: ifkkis too small, the reward becomes overly sparse; ifkkis too large, the signal saturates and the gradients weaken\. Since GRPO relies on discriminative inter\-group reward differences, a well\-balanced reward distribution is crucial\. As shown in Fig\.[7](https://arxiv.org/html/2604.20135#S5.F7),k=50k=50strikes the best balance, avoiding the sparsity ofk=10k=10and the saturation ofk=100k=100, thereby providing the richest learning signal\. In addition, we also compare how different values ofkkaffect model convergence under different reward functions; details are provided in Appendix[B](https://arxiv.org/html/2604.20135#A2)\.
Figure 8:Evolution of think and answer lengths during RL\.Emergent behavior: generation conciseness\.As illustrated in Figure[8](https://arxiv.org/html/2604.20135#S5.F8), the average length of generated attributes consistently decreases\. Unlike in complex reasoning tasks \(e\.g\., mathShaoet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib5)\)\), where models benefit from longer reasoning chains, our retrieval objective does not require lengthy explanations\. The agent learns that redundant or irrelevant attributes act as noise and hurt Recall@k, so it is implicitly encouraged to produce the shortest set of attributes that is still sufficient for good retrieval, leading to concise yet effective generations\.
Circular Iterative Training\.Our AFMRL framework decouples attribute generation from representation learning, offering an inherent advantage: once the attribute generator is refined via reinforcement learning, it can feed back into the representation learning process to enhance it\. We refer to this mechanism as Circular Iterative Training \(CIT\)\. As shown in Table[4](https://arxiv.org/html/2604.20135#S5.T4), when we reused the RL\-trained attribute generator for AGCL training \(using only 30% of the training samples\), the model’s performance on downstream tasks improved significantly\.
Table 4:Performance comparison of different training strategies on downstream tasks\.
## 6Conclusions
This paper focuses on a key challenge in fine\-grained E\-commerce retrieval: representation models based on MLLMs lack sufficient understanding of fine\-grained information\. To address this, we propose the AFMRL framework, which first uses MLLM\-generated attributes to mine hard negative samples, and then performs representation reinforcement learning with retrieval performance as the reward\. Extensive experiments on large\-scale E\-commerce datasets demonstrate that our method achieves state\-of\-the\-art performance, validating our central thesis that guiding discriminative learning with generative reasoning is an effective approach to fine\-grained multimodal retrieval\.
## 7Limitations
Despite the strong performance on our primary retrieval task, we acknowledge the method’s limitations\. Generalization on downstream tasks\. As shown in Table[3](https://arxiv.org/html/2604.20135#S4.T3), when evaluated on downstream classification and clustering tasks, our RL\-trained policy \(πRL\\pi\_\{RL\}\) is slightly higher than the SFT\-trained policy \(πSFT\\pi\_\{SFT\}\)\. We attribute this to a phenomenon known as the "alignment tax\." Because theπRL\\pi\_\{RL\}policy is optimized for the highly specific Recall@k retrieval metric, its representations become specialized for this task, potentially at the cost of their generality\. SFT, being a more general\-purpose tuning paradigm, appears to better preserve the universal features required for broader tasks like classification and clustering\.
## References
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[Appendix D](https://arxiv.org/html/2604.20135#A4.p1.1),[Appendix D](https://arxiv.org/html/2604.20135#A4.p3.1),[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2604.20135#S3.SS1.p2.3),[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p2.2)\.
- P\. BehnamGhader, V\. Adlakha, M\. Mosbach, D\. Bahdanau, N\. Chapados, and S\. Reddy \(2024\)Llm2vec: large language models are secretly powerful text encoders\.arXiv preprint arXiv:2404\.05961\.Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p1.1)\.
- X\. Dong, X\. Zhan, Y\. Wu, Y\. Wei, M\. C\. Kampffmeyer, X\. Wei, M\. Lu, Y\. Wang, and X\. Liang \(2022\)M5product: self\-harmonized contrastive learning for e\-commercial multi\-modal pretraining\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 21252–21262\.Cited by:[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p1.1)\.
- L\. Gao, Y\. Zhang, J\. Han, and J\. Callan \(2021\)Scaling deep contrastive learning batch size under memory limited setup\.InProceedings of the 6th Workshop on Representation Learning for NLP \(RepL4NLP\-2021\),pp\. 316–321\.Cited by:[§E\.3](https://arxiv.org/html/2604.20135#A5.SS3.p2.1)\.
- R\. Hadsell, S\. Chopra, and Y\. LeCun \(2006\)Dimensionality reduction by learning an invariant mapping\.In2006 IEEE computer society conference on computer vision and pattern recognition \(CVPR’06\),Vol\.2,pp\. 1735–1742\.Cited by:[§E\.1](https://arxiv.org/html/2604.20135#A5.SS1.p2.4)\.
- W\. Huang, A\. Wu, Y\. Yang, X\. Luo, Y\. Yang, L\. Hu, Q\. Dai, C\. Wang, X\. Dai, D\. Chen,et al\.\(2024\)Llm2clip: powerful language model unlocks richer visual representation\.arXiv preprint arXiv:2411\.04997\.Cited by:[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p3.1)\.
- T\. Jiang, M\. Song, Z\. Zhang, H\. Huang, W\. Deng, F\. Sun, Q\. Zhang, D\. Wang, and F\. Zhuang \(2024\)E5\-v: universal embeddings with multimodal large language models\.arXiv preprint arXiv:2407\.12580\.Cited by:[§E\.2](https://arxiv.org/html/2604.20135#A5.SS2.p1.1),[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p2.1)\.
- Z\. Jiang, R\. Meng, X\. Yang, S\. Yavuz, Y\. Zhou, and W\. Chen \(2025\)VLM2Vec: training vision\-language models for massive multimodal embedding tasks\.InICLR,Cited by:[§E\.1](https://arxiv.org/html/2604.20135#A5.SS1.p1.5),[§E\.1](https://arxiv.org/html/2604.20135#A5.SS1.p3.7),[§E\.2](https://arxiv.org/html/2604.20135#A5.SS2.p1.1),[§1](https://arxiv.org/html/2604.20135#S1.p1.1),[§1](https://arxiv.org/html/2604.20135#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p2.1),[§3\.1](https://arxiv.org/html/2604.20135#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p3.1)\.
- Y\. Jin, Y\. Li, Z\. Yuan, and Y\. Mu \(2023\)Learning instance\-level representation for large\-scale multi\-modal pretraining in e\-commerce\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11060–11069\.Cited by:[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p1.1)\.
- C\. Lee, R\. Roy, M\. Xu, J\. Raiman, M\. Shoeybi, B\. Catanzaro, and W\. Ping \(2024\)Nv\-embed: improved techniques for training llms as generalist embedding models\.arXiv preprint arXiv:2405\.17428\.Cited by:[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p1.1)\.
- J\. Li, J\. Ma, X\. Zhang, Y\. Li, and J\. Shi \(2024\)Give: guiding visual encoder to perceive overlooked information\.arXiv preprint arXiv:2410\.20109\.Cited by:[§E\.2](https://arxiv.org/html/2604.20135#A5.SS2.p1.1)\.
- J\. Li, R\. Selvaraju, A\. Gotmare, S\. Joty, C\. Xiong, and S\. C\. H\. Hoi \(2021\)Align before fuse: vision and language representation learning with momentum distillation\.Advances in neural information processing systems34,pp\. 9694–9705\.Cited by:[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p3.1)\.
- S\. Li, X\. Gao, and S\. S\. Du \(2025\)Highlighting what matters: promptable embeddings for attribute\-focused image retrieval\.arXiv preprint arXiv:2505\.15877\.Cited by:[§E\.2](https://arxiv.org/html/2604.20135#A5.SS2.p1.1),[§1](https://arxiv.org/html/2604.20135#S1.p2.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p2.1)\.
- H\. Lu, W\. Liu, B\. Zhang, B\. Wang, K\. Dong, B\. Liu, J\. Sun, T\. Ren, Z\. Li, H\. Yang,et al\.\(2024\)Deepseek\-vl: towards real\-world vision\-language understanding\.arXiv preprint arXiv:2403\.05525\.Cited by:[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p2.1)\.
- X\. Ma, L\. Wang, N\. Yang, F\. Wei, and J\. Lin \(2024\)Fine\-tuning llama for multi\-stage text retrieval\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2421–2425\.Cited by:[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p1.1)\.
- D\. Qi, H\. Zhao, and S\. Li \(2024\)Easy regional contrastive learning of expressive fashion representations\.Advances in Neural Information Processing Systems37,pp\. 20480–20509\.Cited by:[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p2.1),[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p3.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§3\.2](https://arxiv.org/html/2604.20135#S3.SS2.p4.4),[§5\.2](https://arxiv.org/html/2604.20135#S5.SS2.p3.1)\.
- J\. Shin, B\. Kim, and E\. Kim \(2025\)Generative modeling of class probability for multi\-modal representation learning\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 20737–20746\.Cited by:[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p1.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024a\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§4\.1](https://arxiv.org/html/2604.20135#S4.SS1.p2.2)\.
- W\. Wang, Q\. Lv, W\. Yu, W\. Hong, J\. Qi, Y\. Wang, J\. Ji, Z\. Yang, L\. Zhao, S\. XiXuan,et al\.\(2024b\)Cogvlm: visual expert for pretrained language models\.Advances in Neural Information Processing Systems37,pp\. 121475–121499\.Cited by:[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p2.1)\.
- R\. Xiao, S\. Kim, M\. Georgescu, Z\. Akata, and S\. Alaniz \(2025\)Flair: vlm with fine\-grained language\-informed image representations\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 24884–24894\.Cited by:[§E\.2](https://arxiv.org/html/2604.20135#A5.SS2.p1.1)\.
- C\. Xie, B\. Wang, F\. Kong, J\. Li, D\. Liang, G\. Zhang, D\. Leng, and Y\. Yin \(2025\)FG\-clip: fine\-grained visual and textual alignment\.arXiv preprint arXiv:2505\.05071\.Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p3.1),[§2\.1](https://arxiv.org/html/2604.20135#S2.SS1.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, P\. Kalluri, D\. Jurafsky, and J\. Zou \(2022\)When and why vision\-language models behave like bags\-of\-words, and what to do about it?\.arXiv preprint arXiv:2210\.01936\.Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p2.1)\.
- C\. Zhang, S\. Wu, H\. Zhang, T\. Xu, Y\. Gao, Y\. Hu, and E\. Chen \(2024a\)Notellm: a retrievable large language model for note recommendation\.InCompanion Proceedings of the ACM Web Conference 2024,pp\. 170–179\.Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p1.1)\.
- K\. Zhang, Y\. Luan, H\. Hu, K\. Lee, S\. Qiao, W\. Chen, Y\. Su, and M\. Chang \(2024b\)MagicLens: self\-supervised image retrieval with open\-ended instructions\.InProceedings of the 41st International Conference on Machine Learning,pp\. 59403–59420\.Cited by:[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p2.1)\.
- P\. Zhang, S\. Xiao, Z\. Liu, Z\. Dou, and J\. Nie \(2023\)Retrieve anything to augment large language models\.arXiv preprint arXiv:2310\.07554\.Cited by:[§2\.2](https://arxiv.org/html/2604.20135#S2.SS2.p1.1)\.
- X\. Zhang, Y\. Zhang, W\. Xie, M\. Li, Z\. Dai, D\. Long, P\. Xie, M\. Zhang, W\. Li, and M\. Zhang \(2025\)Bridging modalities: improving universal multimodal retrieval by multimodal large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9274–9285\.Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p1.1)\.
- J\. Zhou, Y\. Xiong, Z\. Liu, Z\. Liu, S\. Xiao, Y\. Wang, B\. Zhao, C\. J\. Zhang, and D\. Lian \(2025\)MegaPairs: massive data synthesis for universal multimodal retrieval\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 19076–19095\.External Links:[Link](https://aclanthology.org/2025.acl-long.935/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.935),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2604.20135#S1.p1.1)\.
## Appendix AEvaluation on EIPM dataset
To further validate the robustness and generalizability of our proposed method, we conducted additional experiments on a large\-scale real\-world E\-commerce dataset\.
Setup\.It contains 10 million products for training and a test set of 200,000 products\. The evaluation tasks, metrics, and baseline models used here are identical to the experimental setup described in the main body of the paper\.
Results\.The results of this supplementary experiment are presented in Table[5](https://arxiv.org/html/2604.20135#A1.T5)\. The key takeaway is that the performance trends observed on this large\-scale dataset are consistent with our findings from the primary experiments\.
Most importantly, our proposed AGCL method remains highly effective\. As shown in Table[5](https://arxiv.org/html/2604.20135#A1.T5), applying AGCL to both the LLM2CLIP and the stronger VLM2Vec backbones results in clear and consistent performance gains across all three retrieval tasks \(I2T, T2I, and Coarse\-grained\)\. For example,VLM2VecAGCL\\text\{VLM2Vec\}\_\{\\text\{$AGCL$\}\}outperforms the vanilla VLM2Vec by a significant margin, achieving the best results on all metrics\.
This experiment confirms that our approach is not only effective on standard academic benchmarks but also robust and scalable enough to deliver significant improvements on challenging, large\-scale data from a real\-world production environment\.
Table 5:Performance of Coarse\-Grained and Cross\-Modal Retrieval Tasks on EIPM Dataset\.
## Appendix BDetails in training
We use Recall, Precision, and NDCG—three commonly used evaluation metrics for retrieval performance—as rewards, respectively, and compare the model convergence under different top\-kksettings\. As shown in Figures[9](https://arxiv.org/html/2604.20135#A2.F9)\-[11](https://arxiv.org/html/2604.20135#A2.F11), different metrics respond differently to the value of k\. NDCG achieves optimal convergence at k=10; further increasing k weakens the ranking signals it relies on\. In contrast, recall and precision peak at k=50, where the candidate pool provides dense learning signals without exhibiting saturation\.
Figure 9:Training curves under different top‑k settings when using NDCG as the RL reward function\.Figure 10:Training curves under different top‑k settings when using Precision as the RL reward function \(training fails to converge when top‑k is set to 100\)\.Figure 11:Training curves under different top‑k settings when using Recall as the RL reward function\.
## Appendix CQualitative results
As illustrated in Figure[12](https://arxiv.org/html/2604.20135#A3.F12), our reinforcement learning \(RL\) framework significantly improves the quality of generated attributes\. The baseline model often extracts noisy or overly general terms from verbose product titles\. After optimization with our retrieval\-aware RL policy, the model produces attributes that are substantially more succinct and accurate\. The RL reward signal actively discourages the generation of terms that do not enhance discriminative power, compelling the model to focus only on the most essential and factually correct product features that are crucial for the fine\-grained retrieval task\.
Figure 12:Cases of product item and generated attributes w/o RL\.
## Appendix DCold start of the attribute generator
We used a powerful model, Qwen\-2\.5\-VL\-72B\-InstructBaiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib51)\), as an "attribute oracle" to generate precise attributes for a query, its positive target, and a hard negative\. To enhance the quality and transparency of the generation, we prompted the oracle to first produce a Chain\-of\-Thought \(CoT\) reasoning process enclosed in<think\>tags, followed by the final structured attributes enclosed in<answer\>tags\. This process forces the model to first "think" about the distinguishing features before extracting them\.
We then measured the change in similarity scores produced by a base representation model, with and without the final oracle attributes from the<answer\>block\. As Figure[3](https://arxiv.org/html/2604.20135#S3.F3)\(b\) illustrates, in the baseline condition, the hard negative achieves a similarity score perilously close to that of the positive target\. However, when the model is augmented with the explicit key attributes, the representational space is reshaped, significantly widening the discriminative margin\. This confirms our hypothesis: access to accurate, fine\-grained attributes is critical for robustly distinguishing between similar items\.
We argue that relying on such a massive oracle model is impractical for real\-world deployment, and that extracting key product attributes is a relatively easy text generation task\. Therefore, we adopt a knowledge distillation approach, transferring the oracle’s capability into a more inference\-friendly model, Qwen\-2\.5\-VL\-3B\-InstructBaiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib51)\)\. Specifically, we first perform Supervised Fine\-tuning \(SFT\) on a dataset of 10,000 CoT examples generated by the oracle\. This teaches our smaller generator model to mimic the oracle’s reasoning process, generating both the<think\>process and the final<answer\>attributes\.
## Appendix EPreliminaries
### E\.1Generative Model for Embedding Tasks
Following the framework established by VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\), we adapt a generative model to the task of multimodal embedding\. The goal is to train a model that can embed diverse data types into a shared semantic space\. A training instance is represented as a query\-positive target pair, \(qq,t\+t^\{\+\}\), where bothqqandt\+t^\{\+\}can be an image, text, or an interleaved combination of both\. To accommodate various downstream tasks, the queryqqis conditioned with a specific task instruction \(e\.g\., "Retrieve the most relevant product for this image"\)\.
The learning objective is to train a discriminative representation functionf\(⋅\)f\(\\cdot\)using contrastive learningHadsellet al\.\([2006](https://arxiv.org/html/2604.20135#bib.bib2)\)\. This objective aims to maximize the similarity score between the representations of a query and its positive target,f\(q\)f\(q\)andf\(t\+\)f\(t^\{\+\}\), while simultaneously minimizing its similarity to all other negative targets\{t−\}\\\{t^\{\-\}\\\}in the batch\. Given a Multimodal Large Language Model \(MLLM\) as our backbone, we obtain representations by extracting the hidden state of the final token from its last layer\.
Formally, for a mini\-batch of N pairs\{\(q1,t1\),…,\(qN,tN\)\}\\\{\(q\_\{1\},t\_\{1\}\),\.\.\.,\(q\_\{N\},t\_\{N\}\)\\\}, the InfoNCE loss is defined as:
ℒ=1N∑i=1N−logesi,i/τesi,i/τ\+∑j≠iNesi,j/τ,\\mathcal\{L\}=\\frac\{1\}\{N\}\\sum^\{N\}\_\{i=1\}\-log\\frac\{e^\{s\_\{i,i\}/\\tau\}\}\{e^\{s\_\{i,i\}/\\tau\}\+\\sum^\{N\}\_\{j\\neq i\}e^\{s\_\{i,j\}/\\tau\}\},\(7\)wheresi,j=cosine\(f\(qi\),f\(tj\)\)s\_\{i,j\}=\\text\{cosine\}\(f\(q\_\{i\}\),f\(t\_\{j\}\)\)denotes the cosine similarity between the representations of queryqiq\_\{i\}and targettjt\_\{j\}\. The functionf\(⋅\)f\(\\cdot\)represents the MLLM\-based embedding process, andτ\\tauis the temperature hyperparameter\. Following VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\), we setτ\\tauto 0\.02\.
### E\.2Promptable Multimodal Embeddings
Traditional representation models generate a single, static embedding for each data item, which limits their flexibility for context\-dependent tasks where a single item might be relevant to diverse intents \(e\.g\., matching a product’s brand versus its visual style\)\. To address this, the paradigm ofpromptable embeddingshas emerged\. This approach generates a context\-aware representation by conditioning on a textualprompt, transforming the embedding function tof\(item, prompt\)f\(\\text\{item, prompt\}\)and allowing it to dynamically highlight task\-relevant attributesLiet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib43)\)\. The efficacy of this paradigm has been validated by models like VLM2VecJianget al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib1)\)and E5\-VJianget al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib7)\)for creating domain\-specific embeddings, and by architectures like GiVELiet al\.\([2024](https://arxiv.org/html/2604.20135#bib.bib41)\)and FLAIRXiaoet al\.\([2025](https://arxiv.org/html/2604.20135#bib.bib42)\)that enable finer\-grained control via token\-level text\-image interactions\. Building on this foundation, our work systematically studies promptable embeddings for retrieval targets and, crucially, develops strategies for their efficient, large\-scale deployment\.
### E\.3Large Batch Training with GradCache
The efficacy of contrastive learning hinges on large batch sizes to ensure a diverse set of negative samples, yet the memory footprint of MLLMs makes this computationally prohibitive\.
To overcome this constraint, we employ GradCacheGaoet al\.\([2021](https://arxiv.org/html/2604.20135#bib.bib3)\), a gradient caching technique that enables training with a very large effective batch size\. The core idea is to decouple the backpropagation process of the contrastive loss from that of the encoder\. Instead of performing a backward pass for the entire large batch, GradCache operates in two steps:
1. 1\.Representation Gradient Computation and Caching: The large batch is divided into smaller mini\-batches that fit into GPU memory\. For each mini\-batch, a forward pass is performed to compute embeddings, and these embeddings are used to calculate the representation\-level gradientsδℒ/δf\(xi\)\\delta\\mathcal\{L\}/\\delta f\(x\_\{i\}\)for all samplesxix\_\{i\}in the full batch\. These small gradient tensors are then cached;
2. 2\.Sub\-batch Gradient Accumulation: For each mini\-batch, a second forward and backward pass is performed\. The full\-batch gradients for the model parametersθ\\thetaare then accumulated by multiplying the cached representation gradients with the local model gradientsδf\(xi\)/δθ\\delta f\(x\_\{i\}\)/\\delta\\thetaand summing across all mini\-batches\.Similar Articles
FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis
This paper introduces FoodCHA, a multi-modal LLM agent framework designed for fine-grained food analysis, addressing challenges in hierarchical consistency and attribute discrimination for dietary monitoring.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE introduces a unified multimodal image generation and editing framework that aligns VLM semantic embeddings with diffusion conditioning, achieving state-of-the-art fidelity without costly fusion or from-scratch training.
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
Huggingface introduces EcomRLVE-GYM, a framework providing eight verifiable environments for training reinforcement learning agents on complex e-commerce tasks. The tool features adaptive difficulty curricula and algorithmic rewards to improve task completion in shopping assistants, demonstrated by training a Qwen 3 8B model.
SIMMER: Cross-Modal Food Image–Recipe Retrieval via MLLM-Based Embedding
SIMMER proposes a novel MLLM-based embedding approach for cross-modal food image-recipe retrieval, replacing traditional dual-encoder architectures with a unified encoder and achieving state-of-the-art results on the Recipe1M dataset with significant improvements over prior methods.
Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking
MSR-MEL introduces an unsupervised framework using LLMs to synthesize and reason over multi-perspective evidence for multimodal entity linking, outperforming prior methods on standard benchmarks.