HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

Summary

HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.

Original Article

View Cached Full Text

Cached at: 05/11/26, 07:20 AM

Paper page - HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Source: https://huggingface.co/papers/2605.07177

Abstract

HyperEyes is a parallel multimodal search agent that enables concurrent entity searches while optimizing inference efficiency through dual-grained reinforcement learning and a specialized benchmark for evaluating both accuracy and efficiency.

Existingmultimodal search agentsprocess target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiplegrounded queriesconcurrently within a round. To this end, we present HyperEyes, aparallel multimodal searchagent that fusesvisual groundingandretrievalinto a single atomic action, enabling concurrent search across multiple entities while treatinginference efficiencyas a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories viaProgressive Rejection Sampling. Building on this, our central contribution, aDual-Grained Efficiency-Aware Reinforcement Learningframework, operates at two levels. At the macro level, we proposeTRACE(Tool-use Reference-Adaptive Cost Efficiency), atrajectory-level rewardwhose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adaptOn-Policy Distillationto inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating thecredit-assignment deficiencyofsparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewertool-call roundson average.

View arXiv page View PDF GitHub6 Add to collection

Get this paper in your agent:

hf papers read 2605\.07177

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.07177 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.07177 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07177 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Paper page - HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

Submit Feedback

Similar Articles

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents