WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Papers with Code Trending 07/20/25, 05:53 PM Papers

data-synthesis information-seeking web-agents llm-agents knowledge-projections benchmark

Summary

WebShaper is a formalization-driven framework for synthesizing information-seeking datasets using set theory and Knowledge Projections, achieving state-of-the-art performance on GAIA and WebWalkerQA benchmarks among open-source agents.

The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.

Original Article

View Cached Full Text

Cached at: 06/01/26, 01:01 PM

Paper page - WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Source: https://huggingface.co/papers/2507.15061 Published on Jul 20, 2025

Abstract

WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.

The advent ofLarge Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks throughset theory. Central to the formalization is the concept ofKnowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, anagentic Expanderexpands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents onGAIAandWebWalkerQA benchmarks.

View arXiv page View PDF GitHub19.1k Add to collection

Get this paper in your agent:

hf papers read 2507\.15061

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### Alibaba-NLP/WebShaper-32B 33B• UpdatedAug 28, 2025 • 68 • 1

Datasets citing this paper2

#### Alibaba-NLP/WebShaper Viewer• UpdatedJul 22, 2025 • 500 • 424 • 26 #### JingmingChen/PathRefiner Viewer• Updated25 days ago • 2.94k • 69

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2507.15061 in a Space README.md to link it from this page.

Collections including this paper10

Browse 10 collections that include this paper

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Paper page - WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Abstract

Models citing this paper1

Datasets citing this paper2

Spaces citing this paper0

Collections including this paper10

Similar Articles

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

@hwchase17: https://x.com/hwchase17/status/2071963622298050997

Submit Feedback

Similar Articles

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

WebSwarm: Recursive Multi-Agent Orchestration for Deep-and-Wide Web Search

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

@hwchase17: https://x.com/hwchase17/status/2071963622298050997