HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Hugging Face Daily Papers Papers

Summary

HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:20 AM

Paper page - HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Source: https://huggingface.co/papers/2605.07177

Abstract

HyperEyes is a parallel multimodal search agent that enables concurrent entity searches while optimizing inference efficiency through dual-grained reinforcement learning and a specialized benchmark for evaluating both accuracy and efficiency.

Existingmultimodal search agentsprocess target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiplegrounded queriesconcurrently within a round. To this end, we present HyperEyes, aparallel multimodal searchagent that fusesvisual groundingandretrievalinto a single atomic action, enabling concurrent search across multiple entities while treatinginference efficiencyas a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories viaProgressive Rejection Sampling. Building on this, our central contribution, aDual-Grained Efficiency-Aware Reinforcement Learningframework, operates at two levels. At the macro level, we proposeTRACE(Tool-use Reference-Adaptive Cost Efficiency), atrajectory-level rewardwhose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adaptOn-Policy Distillationto inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating thecredit-assignment deficiencyofsparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewertool-call roundson average.

View arXiv pageView PDFGitHub6Add to collection

Get this paper in your agent:

hf papers read 2605\.07177

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.07177 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.07177 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07177 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Hugging Face Daily Papers

MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Hugging Face Daily Papers

This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.