From Web to Pixels: Bringing Agentic Search into Visual Perception

Hugging Face Daily Papers Papers

Summary

This paper introduces WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agentic approach that connects search results to visual annotations.

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 04:11 AM

Paper page - From Web to Pixels: Bringing Agentic Search into Visual Perception

Source: https://huggingface.co/papers/2605.12497

Abstract

Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning.

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge asPerception Deep Researchand introduce WebEye, anobject-anchored benchmarkwithverifiable evidence,knowledge-intensive queries,precise box/mask annotations, and three task views:Search-based Grounding,Search-based Segmentation, andSearch-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further proposePixel-Searcher, anagentic search-to-pixel workflowthat resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show thatPixel-Searcherachieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, andvisual instance binding.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.12497

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12497 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12497 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12497 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

I gave AI agents eyes on my PC

Reddit r/AI_Agents

The author introduces Pupil, an open-source tool that enables AI agents to visually inspect PC UIs and identify click targets without relying on screenshots.

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Hugging Face Daily Papers

The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.