From Web to Pixels: Bringing Agentic Search into Visual Perception

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

computer-vision agentic-search visual-perception object-localization benchmark reasoning

Summary

This paper introduces WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agentic approach that connects search results to visual annotations.

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 04:11 AM

Paper page - From Web to Pixels: Bringing Agentic Search into Visual Perception

Source: https://huggingface.co/papers/2605.12497

Abstract

Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge asPerception Deep Researchand introduce WebEye, anobject-anchored benchmarkwithverifiable evidence,knowledge-intensive queries,precise box/mask annotations, and three task views:Search-based Grounding,Search-based Segmentation, andSearch-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further proposePixel-Searcher, anagentic search-to-pixel workflowthat resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show thatPixel-Searcherachieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, andvisual instance binding.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.12497

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12497 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12497 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12497 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

From Web to Pixels: Bringing Agentic Search into Visual Perception

Paper page - From Web to Pixels: Bringing Agentic Search into Visual Perception

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

I gave AI agents eyes on my PC

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Submit Feedback

Similar Articles

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

I gave AI agents eyes on my PC

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction