Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

arXiv cs.AI 06/16/26, 04:00 AM Papers

multimodal-llm visual-search active-visual-reasoning agent deep-search multimodal-agent

Summary

Visual-Seeker proposes a visual-native multimodal deep search agent that actively reasons over fine-grained visual details and synthesizes multimodal evidence, achieving state-of-the-art performance on five challenging multimodal search benchmarks.

arXiv:2606.15231v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:44 AM

# Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
Source: [https://arxiv.org/html/2606.15231](https://arxiv.org/html/2606.15231)
\\reportnumber

00111footnotetext:Equal contribution\.22footnotetext:Project leader\.33footnotetext:Corresponding to cxzhang@bit\.edu\.cn, fuying\.yy@antgroup\.com

Changtao Miao3,∗,†Jinbo Su4,∗Zhaowen Zhou3,∗,†Chunxia Zhang5,‡Xukai Wang2Ruiqi Liu2Kaiyuan Zheng3Jiansheng Cai3Bo Zhang3Zhe Li3Shiming Xiang1,2Ying Yan3,‡ 1School of Artificial IntelligenceUCAS2Institute of AutomationCAS 3Ant Digital TechnologiesAnt Group4RUC5BIT

###### Abstract

Multimodal large language models \(MLLMs\) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open\-world scenarios\. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual\-native search paradigm remains underexplored\. Existing methods primarily rely on simple images with explicit semantics and text\-only evidence trajectories, limiting the agent’s ability to perform multi\-hop, cross\-modal reasoning and search\. To address these limitations, we proposeVisual\-Seeker, a visual\-native multimodal deep search agent via active visual reasoning\. Rather than treating vision as a static input, our agent actively attends to fine\-grained visual details, dynamically harvests visual evidence throughout the search process\. To unlock its visual\-native potential, we design an active visual reasoning data pipeline and synthesize 5K high\-quality multimodal trajectories for model training\. Extensive experiments demonstrate the state\-of\-the\-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual\-native reasoning and search in real\-world web environments\. The code and data can be accessed at:[https://github\.com/ZhengboZhang/Visual\-Seeker](https://github.com/ZhengboZhang/Visual-Seeker)\.

## 1Introduction

The rapid progress in large language models \(LLMs\) has spurred the development of autonomous search agents capable of multi\-hop reasoning, tool use and web navigationsearch\-r1;webdancer;browseragent\. However, these agent systems operate only on a text modality, making them difficult to handle visual queries and the visual information in the web environment\. To bridge this gap, recent works have extended deep search agents to possess multimodal capabilities and equipped them with large multimodal language models \(MLLMs\) to handle image and text inputsmmsearch\-r1;sensenova;skywork\. This improvement has driven the development of deep search agents for solving visual question answering \(VQA\) in the open world\.

![Refer to caption](https://arxiv.org/html/2606.15231v1/x1.png)Figure 1:\(a\) The real\-world image queries typically present complex, entity\-rich visual content\. The model must leverage its robust visual understanding to accurately identify target entities and iteratively refine the search process\. \(b\) Multimodal Search Agent should actively aggregate multimodal evidence from web sources, synthesizing these diverse cues through cross\-modal reasoning to generate comprehensive answers\.But the visual\-native capabilities of multimodal deep search agents remain largely unexplored\. Visual information is only used as an adjunct modality to the query input, which limits the model’s ability to handle scenarios with long\-horizon visual reasoning\. As shown in Figure[1](https://arxiv.org/html/2606.15231#S1.F1), first, the images that users query often come from the real world, and these images have complex backgrounds and multiple entities, which poses a challenge to the visual perception capabilities of multimodal search agents\. However, the existing methods often construct their training data starting from images with entities containing explicit semantic information through web search, which hinders the training of visual perception capabilitieswebwatcher;mmds;mmsearch\-r1\. Second, the real\-world web environment contains rich visual information, therefore multimodal search agents should be capable of proactively collecting visual evidence\. However, the most existing methods only insert visual queries into the prompt, without search trajectories that rely on visual evidence, which limits the coverage of multi\-hop and multimodal scenariossensenova;skywork;webwatcher\.

To address these limitations, we propose a visual\-native multimodal deep search agent,Visual\-Seekerthat bridges the gap between passive visual perception and active cross\-modal search in open\-world environments\. As illustrated in Figure[1](https://arxiv.org/html/2606.15231#S1.F1), in contrast to existing methods, our agent is designed to enhance the visual\-native capability in search trajectory through active visual reasoning\. Specifically, our model can perceive fine\-grained properties in complex real\-world images containing multiple interconnected entities, dynamically perform multi\-hop reasoning, and proactively collect and integrate visual evidence during the search process\. To cultivate these visual\-native capabilities, we design anActive Visual Reasoningdata synthesis pipeline with three stages\. First, we extract the names and visual descriptions of seed entities from real\-world images with visual reasoning processes,thereby obtaining target entities in complex images\. This drives the model to perform fine\-grained visual perception and search of specific regions to obtain semantic information about the target entities\. We then use a dual\-strategy random walk to expand the depth and breadth of the search trajectory\. Finally, we use search engines to obtain information\-rich images of the entities and merge queries containing visual evidence\. Based on this data pipeline, we synthesize 5K high\-quality multimodal search trajectories and use them for training Visual\-Seeker\. Extensive experiments across five challenging benchmarks demonstrate that our agent achieves the state\-of\-the\-art performance, significantly outperforming existing open\-source and proprietary models\.

The key contributions of this work can be summarized as follows:

- •We propose a multimodal deep search agent, Visual\-Seeker , which combines visual\-native capabilities with search\. It can perform visual understanding of images in complex multi\-entity scenes and proactively collect visual evidence for cross\-modal search\.
- •We design a Active Visual Reasoning data synthesis pipeline that enables models to develop fine\-grained visual perception and the ability to actively collect visual evidence during deep search\. 5K high\-quality trajectories generated from the data pipeline are used for model training\.
- •Our agent achieves the state\-of\-the\-art performance on five multimodal search benchmarks and outperforms several proprietary models\.

## 2Related Works

### 2\.1Text\-only Deep Search Agent

The pre\-trained knowledge of large language models is time\-truncated, and the retrieval\-augmented generation \(RAG\)rag1;rag2;xue2022multimethod has the limitation of pre\-building a knowledge base\. Text\-only deep search agents aim to overcome this restriction by leveraging external tools to search in real\-world environmentssearch\-r1;webdancer\. Text\-only deep search agents transform complex information retrieval processes into iterative loops of reasoning and tool callsbrowseragent;tongyidr;websailor\. This paradigm empowers large language models to autonomously generate search queries and browse web pages\. However, such text\-based search agents are restricted by textual queries and document retrieval, thereby lacking the capacity to interpret or leverage multimodal sensory data from real\-world scenarios\.

### 2\.2Multimodal Deep Search Agent

The development of multimodal models has driven researchers to explore multimodal deep search agent\. Early studiesmmsearch\-r1;webwatcherequip the agent with reverse image search tools to obtain the semantic information of the entire image from the visual input, and fuse textual QA with entity\-visual queries to generate fine\-tuning data, enabling the agent to retrieve images and perform multi\-turn reasoning\. Recent worksdeepmmsearch;skywork;sensenova;deepeyeshave introduced the image cropping tool that leverages the model’s visual grounding capabilities to retrieve target entities within images, mitigating the interference of background noise\.

While the ability of agents to interact with tools and perform multi\-turn reasoning is crucial, visual reasoning capabilities and the ability to proactively gather visual evidence are also indispensable in multi\-turn search for solving complex problemsmmbc;visbrowse\. However, the existing methods still have limitations in constructing fine\-tuning data for multimodal search agent: lack of visual queries for complex, multi\-entity images that closely resemble the real world; lack of incorporation of visual evidence into the necessary path of search\.

## 3Method

### 3\.1Active Visual Reasoning Data Pipeline

To construct high\-quality training data for multimodal deep search agent, we propose an active visual reasoning data synthesis pipeline for visual\-native search task\. As illustrated in Figure[2](https://arxiv.org/html/2606.15231#S3.F2), our pipeline generates multi\-hop multimodal deep search trajectories, starting from complex entity\-centric queries and strategically injecting visual evidence to activate the visual\-native reasoning capabilities of MLLMs\.

![Refer to caption](https://arxiv.org/html/2606.15231v1/x2.png)Figure 2:Active Visual Reasoning Data Pipeline\.This pipeline synthesizes complex visual queries by extracting entity information from multi\-entity images, expands the search depth through random walks on a knowledge graph, and ultimately inserts visual evidence into the search trajectory\. \(a\) Multi\-entity visual description extraction and entity name filtering and disambiguation\. \(b\) Dual\-strategy random walk expands search depth\. \(c\) Injection of visual evidence to force visual search and reasoning\.#### 3\.1\.1Fine\-grained Seed Entity Selection

Accurately obtaining the target seed entity to be searched in multi\-entity images is crucial for activating the fine\-grained visual perception ability of the model\. LiveVQAlivevqais a large\-scale dataset featuring time\-sensitive visual questions and multi\-entity images\. It also provides reasoning process for each VQA instance, enabling the extraction of multiple seed entities and their corresponding captions from complex real\-world images\.

##### Entity Recognition\.

Unlike conventional named entity recognition \(NER\) that operates solely on text, our approach leverages both visual and textual cues to identify entities that are visually grounded and relevant to the reasoning chain\. Given a query imageII, questionQQand reasoning processRRsampled from LiveVQA, we employ a MLLM with in\-context learning to perform joint entity extraction\. The MLLM is instructed to output a structured list of entities, where each entityEiE\_\{i\}is represented as:

Ei=\(ni,di,typei\)E\_\{i\}=\(n\_\{i\},d\_\{i\},type\_\{i\}\)\(1\)
wherenin\_\{i\}is the exact name of the entity anddid\_\{i\}is a concise textual description of the entity’s state in the image, such as “The man wearing a pink shirt in the picture”\. Andtypeitype\_\{i\}is the category of the entity, such as person, location, organization, etc\.

##### Entity Filtering and Disambiguation\.

Not all extracted entities are suitable for grounding search queries\. We apply a three\-step filtering strategy to ensure entity quality\.

Step 1: Generic Mention Filtering\.We filter out entities that lack specific identifiers, such as generic noun phrases or pronouns “the man”, “a building”\. Formally, an entityEiE\_\{i\}is retained only if its namenin\_\{i\}is a proper noun or a uniquely identifiable common noun within the given context\. This is implemented through a rule\-based filter combined with MLLM verification\.

Step 2: Complex Entity\-Image Filtering\.Complex multi\-entity images are key to increasing the difficulty of the reasoning questions, so we need to filter out images with obvious entity semantics\. Specifically, we construct the context from the query imageIIand entity descriptiondid\_\{i\}, asking MLLM to determine whether the semantic information of entities in an image can be easily obtained, such as a photograph of single person with the entity description stating “The main character in the picture”\.

Step 3: Entity Disambiguation\.For polysemous entities, such as “Apple” which refers to both a company and a fruit, we perform context\-aware disambiguation\. We construct a disambiguation prompt that provides the query imageII, questionQQ, and surrounding entities as context, asking the MLLM to select the correct sense from candidate Wikipedia disambiguation pages\.

After entity recognition, filtering, and disambiguation, we obtained query images with complex semantic entities, and 2K entities from it as seed entities for multi\-hop VQA synthesis\.

#### 3\.1\.2Multi\-hop Trajectory Synthesis

Given the seed entities, we synthesize multi\-hop reasoning questions by randomly walking on an offline Wikipedia knowledge graph\. The core objective is to generate diverse and non\-linear search trajectories that drive the search agent in calling tools and collecting evidence from multiple sources\.

##### Entity Node Extension

We construct an offline knowledge graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)from the Wikipedia and perform random walks starting from the node of seed entityV\(0\)V^\{\(0\)\}, recursively following links from each entity node to simulate human browsing behavior\. A naive depth\-first random walk on𝒢\\mathcal\{G\}produces linear reasoning chains, which are insufficient for training robust search agents\. So, we introduce two strategies to extend the topology of the subgraph\.

Strategy 1: Backtracking\.To simulate the realistic behavior of search agents revisiting previous assumptions, we use backtracking strategy\. At steptt, with probabilitypback∈\(0,1\)p\_\{back\}\\in\(0,1\), the walker jumps back to a previously visited entityV\(τ\)V^\{\(\\tau\)\}, whereτ<t\\tau<t, chosen from the walk historyℋt=\{V\(0\),…,V\(t\)\}\\mathcal\{H\}\_\{t\}=\\\{V^\{\(0\)\},\.\.\.,V^\{\(t\)\}\\\}:

V\(t\+1\)=\{V\(τ\)∼ℋtwithpback,V′∼𝒩\(V\(t\)\)otherwise\.V^\{\(t\+1\)\}=\\begin\{cases\}V^\{\(\\tau\)\}\\sim\\mathcal\{H\}\_\{t\}&\\text\{with \}p\_\{\\text\{back\}\},\\\\ V^\{\\prime\}\\sim\\mathcal\{N\}\(V^\{\(t\)\}\)&\\text\{otherwise\}\.\\end\{cases\}\(2\)
where𝒩\(V\(t\)\)\\mathcal\{N\}\(V^\{\(t\)\}\)is the setV\(t\)V^\{\(t\)\}’s neighbor nodes\. This strategy creates tree\-shape structures within the trajectory, forcing the agent to manage and compare multiple reasoning branches\.

Strategy 2: Cycle ConstraintStarting from the seed entity, the walker first traverses a shared prefix of lengthLprefixL\_\{prefix\}to reach a fork node\. At the fork, two disjoint branches are spawned and each extending independently forLbranchL\_\{branch\}steps with mutual node exclusion to ensure path diversity\. The branches then converge toward a common node through iterative expansion of their frontier sets, with a maximum ofTTattempts to discover a valid convergence point\. A BFS verification confirms the existence of two node\-disjoint paths between fork and convergence, ensuring the strict cycle topology\.

##### QA Generation

We sample connected subgraphs𝒢s∈𝒢\\mathcal\{G\}\_\{s\}\\in\\mathcal\{G\}from the topology obtained by entity node extension\. With the relationships between entities in the subgraph as the context for reasoning, we prompt a large language model to generate text questions that conform to the subgraph constraints\.

#### 3\.1\.3Visual Evidence Injection

While the knowledge\-graph\-based QA synthesis equips the agent with structured factual reasoning, it neglects a critical capability in multimodal deep search: actively acquiring and interpreting visual evidence from the web pages\. To address this gap, we introduce visual evidence into the reasoning process of synthetic VQA, forcing the agent to call tools to search for images and perform pixel\-level reasoning\.

For each VQA instance, we locate the name of its answer entity and its attribute description from the wiki page\. Then we prompt a LLM to generate a search keyword, and use a search engine to retrieve a set of candidate images\. This allows for the retrieval of images containing richer visual information, activating the model’s visual understanding capabilities\.

For each candidate image, we use an MLLM to extract visual details that are not derivable only from textual descriptions\. The MLLM is prompted to identify fine\-grained attributes in the image, such as color patterns and spatial layouts\. Then, we use fuzzy search keywords as question and visual details as answer to synthesize a two\-hop QA pair, driving the model to invoke tools for text\-to\-image retrieval and visual perception\. Finally, we merge the extended two\-hop QA into the original VQA\. The examples of our synthesized data are shown in the Appendix[A\.1](https://arxiv.org/html/2606.15231#A1.SS1)\.

### 3\.2Visual\-Seeker

Based on the VQA dataset synthesized in our data pipeline, we train the Visual\-Seeker using a carefully crafted supervised fine\-tuning method without relying on costly reinforcement learning\. We employ the teacher model to generate multimodal search trajectories based on our agentic workflow and perform SFT as cold start\.

![Refer to caption](https://arxiv.org/html/2606.15231v1/x3.png)Figure 3:Tool call distribution of training trajectories\.Our synthesized queries require moresearch\_imagetool calls and more balanced tool call pattern\. The VEI shorts for Visual Evidence Injection\.#### 3\.2\.1Workflow and Tools

We provide the agent with five external tools that it can choose to use, including: \(1\)text\_search, powered by SerperAPI\*\*\*https://serper\.dev, it can retrieve relevant webpage titles and URLs based on natural language queries; \(2\)reverse\_image\_searchretrieves images related to the input image and return the webpage titles and URLs; \(3\)search\_imageretrieves images related to the textual input for active visual evidence collection; \(4\)visituses JinaAPI†††https://jina\.aito capture webpage content and submit it to an summary model for summarization, and we use Qwen3\-30B\-A3B\-Instruct as summary model; \(5\)image\_cropcrops the image based on the input coordinates for fine\-grained visual reasoning\.

The multimodal search agent operates in a ReAct\-style loop\. Given a text queryQQand optional imagesII, the contextCiC\_\{i\}in theii\-th interaction round is denoted as:

Ci=\(Q,I,R1,A1,𝒪1,…,Ri,Ai,𝒪i\)C\_\{i\}=\(Q,I,R\_\{1\},A\_\{1\},\\mathcal\{O\}\_\{1\},\.\.\.,R\_\{i\},A\_\{i\},\\mathcal\{O\}\_\{i\}\)\(3\)
WhereRRis the reasoning content of model,AArepresents the tool call parameters within <tool\_call\> blocks, and𝒪\\mathcal\{O\}represents the results returned by the tools\. The loop terminates when the model outputs an answer or reaches the maximum number of tool interaction rounds, which we set to 15\.

#### 3\.2\.2Supervised Fine\-Tuning

We first perform SFT for a cold start, teaching the base model the essential capabilities required for a multimodal search agent\. Following the data pipeline described in Section[3\.1](https://arxiv.org/html/2606.15231#S3.SS1), we ultimately collected 3K VQA data without visual evidence injection and 800 VQA data with visual evidence injection\. Based on our agentic workflow, we use Claude\-4\.6\-Opusclaudeas the teacher model and generate multimodal deep search trajectories\. In addition, we synthesize text\-only QA instances and generated 500 trajectories\. And we sample VQA instances from the FVQAmmsearch\-r1training set and generate 700 multimodal trajectories\. Finally, we mix all the above 5K data and train the model using cross\-entropy loss\. The goal of SFT is to guide the model to learn a pattern of multi\-turn interaction with the tools, and to actively collect and organize visual and textual evidence during the search process\.

The distribution of tool call counts in the trajectories of various data sources is shown in Figure[3](https://arxiv.org/html/2606.15231#S3.F3)\. The high percentage oftext\_searchtool calls based on the data constructed using our data pipeline indicates that our queries require multi\-hop search\. After introducing visual evidence injection, the proportion of triggers forsearch\_imagetools in the agent trajectory increases significantly\. This indicates that our synthesized queries encourage the model to collect multimodal evidence from web pages for deeper search process\.

## 4Experiments

### 4\.1Experimental Setup

##### Implementation details\.

During the SFT stage, we use Ms\-Swiftms\-swiftas the training framework to full fine\-tuning of the model\. The model is trained over 3 epochs with a batch size of 8, and a learning rate of 2e\-6\. We use Qwen3\-VL\-8B\-Instructqwen3as the base model and the training stage is conducted on 8 NVIDIA A100 GPUs\.

##### Evaluation benchmarks\.

We evaluate our model on five challenging multimodal agentic search benchmarks: MMSearchmmsearch, MMSearch\-Plusmmsearchp, BrowseComp\-VLwebwatcher, MM\-BrowseCompmmbcand VisBrowse\-Benchvisbrowse\. For MMsearch\-Plus, based on previous workvdr;points, we only use single\-image samples\.

##### Baselines\.

We use three types of methods for the baseline models: direct answer, agentic workflow and multimodal deep search agent\. The evaluated models include proprietary and open\-source multimodal models: GPT\-5gpt5, Gemini\-2\.5 seriesgemini, Claude\-4\-Sonnetclaude, Qwen3\-VL\-8B\-Instructqwen3and various multimodal deep search agents\.

- •Direct Answer:Models answer the question relying on internal parametric knowledge, without external tool access\.
- •Agentic Workflow:Models can use all the tools in our agent framework to collect visual and textual evidence\.
- •Multimodal Deep Search Agent:We compare the performance with existing multimodal search agents\.

##### Evaluation metrics\.

We use accuracy \(%\) as the metric to evaluate the model’s performance on five benchmarks\. Qwen3\-235B\-A22B\-Instruct is employed as a judge model using the LLM\-as\-Judge method to evaluate answer correctness against ground truth\.\. Details of the prompts are shown in the Appendix[A\.2](https://arxiv.org/html/2606.15231#A1.SS2)\.

Table 1:The performance comparison between our model and other methods across five challenging benchmarks\. MMSearch\+ shorts for MMSearch\-Plus, BC\-VL shorts for BrowseComp\-VL and MM\-BC shorts for MM\-BrowseComp\. Thebold numbersrepresent the best accuracy in each benchmark\. TheΔ\\Deltarepresents an improvement in our model compared to the base model in agentic workflow\.ModelMMSearchMMSearch\+BC\-VLMM\-BCVisBrowseAvg\.Direct AnswerGPT\-533\.319\.147\.210\.326\.027\.2Gemini\-2\.5\-Flash30\.48\.137\.15\.416\.019\.4Gemini\-2\.5\-Pro39\.814\.543\.110\.319\.525\.4Claude\-4\-Sonnet18\.74\.029\.35\.38\.313\.1Qwen3\-VL\-8B\-Instruct15\.23\.225\.14\.910\.711\.8Agentic WorkflowGPT\-565\.734\.549\.19\.429\.037\.5Gemini\-2\.5\-Flash67\.023\.343\.110\.420\.732\.9Gemini\-2\.5\-Pro67\.830\.945\.214\.826\.637\.1Claude\-4\-Sonnet70\.320\.944\.16\.319\.532\.2Qwen3\-VL\-8B\-Instruct53\.810\.928\.46\.715\.423\.0Multimodal Deep Search AgentMMSearch\-R1\-7Bmmsearch\-r153\.8\-\-\-\-\-WebWatcher\-7Bwebwatcher49\.1\-21\.2\-\-\-WebWatcher\-32Bwebwatcher55\.3\-27\.0\-\-\-DeepEyesV2\-7Bdeepeyes63\.7\-\-\-\-\-Skywork\-R1V4\-30B\-A3Bskywork66\.1\-38\.4\-\-\-SenseNova\-MARS\-8Bsensenova67\.8\-\-\-\-\-Vision\-DeepResearch\-8Bvdr69\.620\.442\.6\-\-VSearcher\-8Bvsearcher47\.2\-30\.86\.2\-\-MM\-DeepResearch\-8Bmmds67\.8\-37\.9\-\-\-MM\-DeepResearch\-32Bmmds69\.0\-43\.0\-\-\-Points\-Seeker\-8Bpoints70\.825\.244\.4\-\-\-OpenSearch\-VL\-8Bopensearch64\.5\-37\.6\-\-\-Visual\-Seeker \(Ours\)72\.227\.347\.616\.134\.739\.6Δ\\DeltaQwen3\-VL\-8B\-Instruct \(Agentic\)\+18\.4\+16\.4\+19\.2\+9\.4\+19\.3\+16\.6

### 4\.2Main Results

As shown in Table[1](https://arxiv.org/html/2606.15231#S4.T1), we evaluate proprietary MLLMs and multimodal search agents under three methods\. Most models performe poorly in the direct answer approach, which is related to the limited pre\-trained knowledge of the models; for example, Claude\-4\-Sonnet only achieves an average score of 13\.1 on five Benchmarks\. After integrating with our agent workflow, all models show significant performance improvements, with Claude\-4\-Sonnet achieving a 145\.8% increase\. This demonstrates the robustness of our workflow and its applicability to various models\.

Our agent achieves an average accuracy of 39\.6% across five benchmarks, outperforming all current multimodal deep search agents and even competing with some proprietary models\. Compared to Qwen3\-VL\-8B\-Instruct \(Agentic\) baseline, our model achieves nearly double the performance on every benchmark\.

In MMSearch\-Plus, which features complex multi\-entity image queries, our model demonstrates strong competitiveness, proving a significant improvement in our visual understanding capabilities\. In MM\-BrowseComp and VisBrowse\-Bench, where visual evidence is required during the search process, our model outperforms even the proprietary models GPT\-5 and Gemini\-2\.5\-Pro\.

### 4\.3Ablation Study and Analysis

##### Data Ablation\.

To validate the effectiveness of our data synthesis pipeline, we perform ablation analysis on data from different sources and modalities\. Specifically, we incrementally add four types of data to the training set to train the base model using SFT\. As shown in Table[2](https://arxiv.org/html/2606.15231#S4.T2), training the model using open\-source multimodal query and text query trajectories allows it to learn tool calls and reasoning patterns, resulting in a slight performance improvement\. The multimodal trajectories synthesized based on our data pipeline brings significant benefits to the performance of multi\-hop search\. For benchmark MMSearch\-Plus requiring fine\-grained visual perception, our data obtains a 17\.2% performance improvement\. After training with data infused with visual evidence, the model’s ability to search for visual information improved, resulting in significant improvements on MM\-BrowseComp and VisBrowse\.

Table 2:Ablation results of training data\. Four types of data, including FVQA, QA generated by our data pipeline, VQA without visual evidence injection and VQA with visual evidence injection\.DataMMS\+MM\-BCVisAvg\.Qwen3\-VL\-8B\-Instruct10\.96\.715\.411\.0\+ FVQA \+ QA traj\.20\.96\.310\.712\.6\+ wo/ VEI VQA traj\.24\.511\.120\.118\.6\+ w/ VEI VQA traj\.27\.316\.134\.726\.0
##### Tool Ablation\.

Our approach relies on visual reasoning and visual evidence search to achieve visual\-native search, with two core toolsimage\_cropandsearch\_image\. To verify the effectiveness of the two tools, we remove both tools respectively, and the ablation results are shown in Table[3](https://arxiv.org/html/2606.15231#S4.T3), removing either tool will significantly reduce the model’s performance across all benchmarks\. The largest decline after removing theimage\_croptool occurs on the VisBrowse benchmark, indicating that image queries on this benchmark have multiple complex entities and it is difficult to obtain the semantics of the target entity through entire image search\. The largest decline after removing thesearch\_imagetool also occurs on the VisBrowse benchmark, indicating that this benchmark needs to integrate visual evidence to arrive at the correct answer in the search trajectory\. After removing both two tools, the model’s ability of active visual grounding and visual evidence collection degrades\.

Table 3:Ablation results of tools\. ‘w/o IC’ represents removing theimage\_croptool and ‘w/o SI’ represents removing thesearch\_imagetool\. TheΔ\\Deltarepresents the performance reduction compared to the model equipped with full tools\.ToolsMMS\+MM\-BCVisAvg\.Visual\-Seeker27\.316\.134\.726\.0w/o IC23\.712\.525\.120\.4Δ\\Delta\-3\.6\-3\.6\-9\.6\-5\.6w/o SI22\.711\.720\.118\.2Δ\\Delta\-4\.6\-4\.4\-14\.6\-7\.8w/o IC & SI21\.59\.919\.917\.1Δ\\Delta\-5\.8\-6\.2\-14\.8\-8\.9
##### Analysis of Tool Usage\.

As shown in Figure[4\.3](https://arxiv.org/html/2606.15231#S4.SS3.SSS0.Px3)\(a\), we statistically analyze the average number of rounds the model interacted with the tool across five benchmarks\. For relatively simple benchmarks, such as MMSearch, our model’s average number of interaction turns is only 4\.3\. And for more challenging benchmarks, such as MM\-BrowseComp, the tool’s interaction turns increase to 14\.1\. As shown in Figure[4\.3](https://arxiv.org/html/2606.15231#S4.SS3.SSS0.Px3)\(b\), to analyze the patterns of tool usgae, we calculate the distribution of different tools\. Across all benchmarks, the model tends to invoketext\_searchtool because textual evidence dominates the search trajectory for each benchmark\. Compared to other benchmarks that only require a single inverse image search to obtain the semantics of an image query, VisBrowse requires morereverse\_image\_searchandsearch\_imagetool calls\. This indicates that the benchmark relies on obtaining visual evidence from web pages\. The case study of our search trajectory can be found in Appendix[A\.3](https://arxiv.org/html/2606.15231#A1.SS3)\.

Figure 5:\(a\) Average number of turns of tool interactions required per sample across the five benchmarks\. \(b\) Distribution \(%\) of different tool types across five benchmarks\.## 5Conclusion

In this paper, we formalize the limitations of existing search agents, text\-only systems suffer from visual blindness, while multimodal extensions treat vision as a passive input\. To address these limitation, we proposeVisual\-Seeker, a visual\-native multimodal deep search agent that unifies fine\-grained visual entity perception with active visual evidence harvesting across multi\-hop trajectories\. We further design a active visual reasoning data synthesis pipeline that extracts complex entities from multi\-entity real\-world images and strategically injects visual evidence, yielding 5K high\-quality training trajectories\. Our agent learns active visual reasoning capbaility from these data, achieving the state\-of\-the\-art performance across five challenging benchmarks, particularly in scenarios demanding precise multi\-entity grounding and cross\-modal evidence integration\.

## References

## Appendix AAppendix

### A\.1Data Example

Based on our data synthesis pipeline, we synthesized a 5K multi\-hop VQA dataset containing complex entity queries and visual evidence\. Figure[6](https://arxiv.org/html/2606.15231#A1.F6)shows a data example without visual evidence injection, and Figure[7](https://arxiv.org/html/2606.15231#A1.F7)shows a data example with visual evidence injection\.

![Refer to caption](https://arxiv.org/html/2606.15231v1/x4.png)Figure 6:Data examples without visual evidence injection\.![Refer to caption](https://arxiv.org/html/2606.15231v1/x5.png)Figure 7:Data examples with visual evidence injection\.
### A\.2Prompt

System PromptYou are a Web Information Seeking Master\. Your task is to thoroughly seek the internet for information and provide accurate answers to visual questions\.As you proceed, adhere to the following principles:1\. Decompose the original visual question into sub\-questions and solve them step by step\. Summarize the knowledge obtained from the previous round of dialogue, then think about what is next sub\-question\.2\. Whether you can answer the question or not, you should describe the image in detail\. if the image includes multiple sub\-image, you should describe each one separately\.3\. Before calling any tools, you must provide a brief explanation of why you are calling the tool and what you expect to achieve\.4\. You should provide the final answer within 15 turns, regardless of whether all valid information has been collected\.

Input PromptYou are an intelligent agent engaged in a conversation with a user\. The user poses a question and provides a corresponding image for context\. As an agent, you approach the problem with care and methodical precision, following a multi\-step process to arrive at a solution\. You utilize a variety of tools, ensuring that the information gathered from each one is cross\-validated before you reach a final answer\. Rather than relying on any single tool for accuracy, you employ multiple tools iteratively to prioritize the comprehensiveness and reliability of your responses\.To be successful, it is very important to follow the following rules:1\. The assistant starts with one or more cycles of \(thinking about which tool to use \-\> performing tool call \-\> waiting for tool response\), and ends with \(thinking about the answer \-\> answer of the question\)\.2\. If additional visual information is needed during the search process, ’search\_image’ tool can be used to search for images\.3\. You can use ’image\_crop’ tool to zoom in on a specific region of the image and search for it\.4\. You can only make one tool call per round and wait for the tool’s response\.5\. Your answer should be inside ’<answer\></answer\>’ tags, and the answer must be the most concise output\.Input Question:Input image\_url:

LLM\-as\-Judge PromptYou are a rigorous question\-and\-answer quality assessment expert\. Given a question, your task is to compare the "groud truth" and the "model answer", determine whether they are semantically equivalent, and output "yes" or "no"\.question:ground truth:model answer:

### A\.3Case Study

The tasks in VisBrowse\-Bench include both fine\-grained entity extraction and visual evidence collection\. Therefore, we conduct a case study on this benchmark to analyze the visual\-native ability of our model\.

##### Question:

The person in the picture is wearing a necklace from a certain brand\. In 2018, a documentary about the founder of that brand was released\. What fruit is the protagonist eating in the documentary poster?

##### Ground Truth:

Banana

![Refer to caption](https://arxiv.org/html/2606.15231v1/figures/44.png)Figure 8:Visual queryTrajectory in VisBrowse\-BenchStep 1Assistant:I’ll analyze this step by step\. Let me first examine the image closely to identify the necklace brand\.<tool\_call\>image\_crop \(\[0\.25, 0\.35, 0\.75, 0\.75\]\)</tool\_call\>Tool:The url of cropped image: https://…Step 2Assistant:I can see a pearl necklace with what appears to be a Vivienne Westwood Orb logo pendant\. Let me confirm this by searching the image\.<tool\_call\>reverse\_image\_search \("http://…"\)</tool\_call\>Tool:The title of webpage: Starship Pearl Beaded Statement Choker Necklace \- Silver \| Fashion Nova;The url of image: https://…;The url of webpage: https://……Step 3Assistant:The necklace appears to be a Vivienne Westwood piece\. The orb/crown logo is clearly visible on the pearl choker\. Now let me search for the documentary about Vivienne Westwood’s founder in 2018\.<tool\_call\>text\_search \("documentary 2018 Vivienne Westwood founder Alexander McQueen"\)</tool\_call\>

Tool:A Google search for ’documentary 2018 Vivienne Westwood founder Alexander McQueen’ found 5 results:Web Results1\. \[Watch an exclusive clip for the new Vivienne Westwood documentary\]\(https://www…\)Date published: Apr 30, 2018Visually, the film will be a mixture of archival footage, reconstructions and interviews with Westwood’s extensive network of collaborators\.…Step 4Assistant:I see two documentaries from 2018: "Westwood: Punk, Icon, Activist" about Vivienne Westwood, and "McQueen" about Alexander McQueen\. The question says "a necklace from a certain brand" and then asks about "a documentary about the founder of that brand\."The necklace is Vivienne Westwood, and Vivienne Westwood is a brand\. But wait \- the question could also refer to Alexander McQueen since McQueen worked closely with Vivienne Westwood\. Let me re\-read: "a documentary about the founder of that brand\." Vivienne Westwood is the founder of Vivienne Westwood\. The documentary is "Westwood: Punk, Icon, Activist" \(2018\)\.Let me search for the poster of this documentary to see what fruit the protagonist is eating\.<tool\_call\>search\_image \("Westwood Punk Icon Activist documentary 2018 poster"\)</tool\_call\>Tool:The title of webpage: Westwood: Punk, Icon, Activist \(2018\) \- IMDbThe url of image: https://m\.media…The url of webpage: https://www\.imdb……Step 5Assistant:I found the documentary poster for "West wood: Punk, Icon, Activist" \(2018\)\. In the poster, Vivienne Westwood \(the protagonist\) is clearly shown eating a \*\*banana\*\*\. The Japanese poster even has text that reads "Vivienne eating banana\."Let me verify this by looking at the poster image more closely\.<tool\_call\>visit\(goal="Find the poster of the documentary and identify what fruit Vivienne Westwood is eating", url="https://www\.imdb\.com/…"\)</tool\_call\>Tool:The useful information in https://www\.imdb\.com… for user goal Find the poster of the documentary and identify what fruit Vivienne Westwood is eating as follows:Evidence in page:

…Summary:…Step 6Assistant:I need to find more usefule information about Westwood<tool\_call\>reverse\_image\_search \("http://…"\)</tool\_call\>Tool:The title of webpage: Amazon\.com: Westwood: Punk, Icon, Activist : Vivienne Westwood, Lorna Tucker: Movies & TVThe url of image: https://m\.media…The url of webpage: https://www……Step 7Assistant:The evidence is clear\. In the documentary poster for "Westwood: Punk, Icon, Activist" \(2018\), Vivienne Westwood is shown eating a \*\*banana\*\*\. The Japanese poster even includes text confirming "Vivienne eating banana\."<answer\>Banana</answer\>

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Similar Articles

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

From Web to Pixels: Bringing Agentic Search into Visual Perception

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Submit Feedback

Similar Articles

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

From Web to Pixels: Bringing Agentic Search into Visual Perception

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search