From Web to Pixels: Bringing Agentic Search into Visual Perception
Summary
This paper introduces WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agentic approach that connects search results to visual annotations.
View Cached Full Text
Cached at: 05/13/26, 04:11 AM
Paper page - From Web to Pixels: Bringing Agentic Search into Visual Perception
Source: https://huggingface.co/papers/2605.12497
Abstract
Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge asPerception Deep Researchand introduce WebEye, anobject-anchored benchmarkwithverifiable evidence,knowledge-intensive queries,precise box/mask annotations, and three task views:Search-based Grounding,Search-based Segmentation, andSearch-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further proposePixel-Searcher, anagentic search-to-pixel workflowthat resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show thatPixel-Searcherachieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, andvisual instance binding.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.12497
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12497 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12497 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12497 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.
InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
InterLV-Search is a new benchmark introduced in this paper to evaluate interleaved language-vision agentic search, highlighting limitations in current systems regarding visual evidence seeking and multimodal integration.
Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
This paper introduces Pi-Serini, a BM25-based agentic search system that demonstrates lexical retrieval can suffice for deep search when agents refine queries, achieving high accuracy and reducing costs compared to default settings.
I gave AI agents eyes on my PC
The author introduces Pupil, an open-source tool that enables AI agents to visually inspect PC UIs and identify click targets without relying on screenshots.
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.