SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
Summary
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping, reducing tool-call rounds by 17-58% while maintaining accuracy on benchmarks like GAIA, BrowseComp, and XBenchDeepSearch.
View Cached Full Text
Cached at: 06/09/26, 12:42 PM
Paper page - SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
Source: https://huggingface.co/papers/2606.07074 Published on Jun 5
·
Submitted byhttps://huggingface.co/prayerdan
danon Jun 9
Abstract
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping to reduce computational costs while maintaining accuracy.
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven byaccuracy-focused training paradigms, current models adoptbrute-force strategiescharacterized by blind tool dependency andperformative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across bothSupervised Fine-Tuning(SFT) andReinforcement Learning(RL). In the SFT stage, SlimSearcher employsPareto-efficient filtrationto distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduceAdaptive Reward Gating, a dynamicreward-shaping mechanismthat evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces averagetool-call roundsby 17%-58% while maintaining or improving accuracy.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.07074
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07074 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07074 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07074 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization
WebShaper is a formalization-driven framework for synthesizing information-seeking datasets using set theory and Knowledge Projections, achieving state-of-the-art performance on GAIA and WebWalkerQA benchmarks among open-source agents.
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
SAAS introduces a reinforcement learning framework that enhances agent self-awareness to reduce unnecessary searches in LLM-based question answering systems, balancing accuracy and computational cost.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.
SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research
This paper introduces SearchSwarm, a model trained on synthesized delegation intelligence to improve long-horizon deep research tasks via task decomposition and subagent coordination, achieving state-of-the-art results on BrowseComp benchmarks.
ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.