HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
Summary
HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.
View Cached Full Text
Cached at: 05/11/26, 07:20 AM
Paper page - HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
Source: https://huggingface.co/papers/2605.07177
Abstract
HyperEyes is a parallel multimodal search agent that enables concurrent entity searches while optimizing inference efficiency through dual-grained reinforcement learning and a specialized benchmark for evaluating both accuracy and efficiency.
Existingmultimodal search agentsprocess target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiplegrounded queriesconcurrently within a round. To this end, we present HyperEyes, aparallel multimodal searchagent that fusesvisual groundingandretrievalinto a single atomic action, enabling concurrent search across multiple entities while treatinginference efficiencyas a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories viaProgressive Rejection Sampling. Building on this, our central contribution, aDual-Grained Efficiency-Aware Reinforcement Learningframework, operates at two levels. At the macro level, we proposeTRACE(Tool-use Reference-Adaptive Cost Efficiency), atrajectory-level rewardwhose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adaptOn-Policy Distillationto inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating thecredit-assignment deficiencyofsparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewertool-call roundson average.
View arXiv pageView PDFGitHub6Add to collection
Get this paper in your agent:
hf papers read 2605\.07177
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.07177 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.07177 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.07177 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
Visual-Seeker proposes a visual-native multimodal deep search agent that actively reasons over fine-grained visual details and synthesizes multimodal evidence, achieving state-of-the-art performance on five challenging multimodal search benchmarks.
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping, reducing tool-call rounds by 17-58% while maintaining accuracy on benchmarks like GAIA, BrowseComp, and XBenchDeepSearch.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
This paper introduces SkillLens, a hierarchical framework for adaptive multi-granularity skill reuse in LLM agents, demonstrating improved accuracy and cost-efficiency on benchmark tasks.