WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
Summary
WebShaper is a formalization-driven framework for synthesizing information-seeking datasets using set theory and Knowledge Projections, achieving state-of-the-art performance on GAIA and WebWalkerQA benchmarks among open-source agents.
View Cached Full Text
Cached at: 06/01/26, 01:01 PM
Paper page - WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
Source: https://huggingface.co/papers/2507.15061 Published on Jul 20, 2025
Abstract
WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.
The advent ofLarge Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks throughset theory. Central to the formalization is the concept ofKnowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, anagentic Expanderexpands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents onGAIAandWebWalkerQA benchmarks.
View arXiv pageView PDFGitHub19.1kAdd to collection
Get this paper in your agent:
hf papers read 2507\.15061
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### Alibaba-NLP/WebShaper-32B 33B• UpdatedAug 28, 2025 • 68 • 1
Datasets citing this paper2
#### Alibaba-NLP/WebShaper Viewer• UpdatedJul 22, 2025 • 500 • 424 • 26 #### JingmingChen/PathRefiner Viewer• Updated25 days ago • 2.94k • 69
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2507.15061 in a Space README.md to link it from this page.
Collections including this paper10
Similar Articles
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping, reducing tool-call rounds by 17-58% while maintaining accuracy on benchmarks like GAIA, BrowseComp, and XBenchDeepSearch.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Apple Research introduces Weblica, a framework for creating scalable and reproducible training environments for visual web agents using HTTP caching and LLM-based synthesis.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
WebWatcher is a multimodal agent for deep research that uses synthetic trajectories and reinforcement learning to achieve superior performance in complex visual and textual information retrieval tasks. The paper also introduces BrowseComp-VL, a new benchmark for evaluating multimodal agents.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.
Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking
Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal deep information seeking, achieving significant accuracy improvements over existing vision-language models and deep research agents.