Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

multimodal-agents deep-search data-evolution on-policy-training visual-reasoning qwen-vl

Summary

This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

Original Article

View Cached Full Text

Cached at: 05/13/26, 08:11 AM

Paper page - Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Source: https://huggingface.co/papers/2605.10832

Abstract

A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks.

Multimodal deep searchrequires an agent to solve open-world problems by chaining search, tool use, andvisual reasoningover evolving textual and visual context. Two bottlenecks limit current systems. First, existingtool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent’s evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on animage bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs aclosed-loop data generatorthat refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round’s data target what the current policy still needs to learn. The same framework supports both diversesupervised fine-tuningdata andpolicy-aware reinforcement learningdata curation, covering the full training lifecycle of the target agent. Across 8multimodal deep searchbenchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, whilerollout-feedback evolutionyields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.10832

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10832 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10832 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10832 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Paper page - Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Native Active Perception as Reasoning for Omni-Modal Understanding

Submit Feedback

Similar Articles

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Native Active Perception as Reasoning for Omni-Modal Understanding