Tag
DailyReport is an open-ended benchmark for evaluating search agents on daily search tasks, featuring 150 tasks and 3,546 rubrics for interpretable, user-centric evaluation.
This paper introduces EvoBrowseComp, a dynamic benchmark of 400 English and 400 Chinese complex questions that are synthesized via live-web traversal to evaluate search agents without test-set contamination, ensuring robustness against parametric memorization.
LoHoSearch is a new benchmark for evaluating long-horizon search agents, built from a knowledge graph of 7 million Wikipedia entities. It introduces questions with large search spaces and structural complexity to exceed human-authored difficulty ceilings, and shows that the best model achieves only 34.74% accuracy.
EvoBrowseComp is an evolving benchmark with 800 contamination-free questions for evaluating search agents, designed to prevent parametric memorization and maintain temporal freshness through a three-agent framework.
FORT-Searcher introduces a framework for synthesizing shortcut-resistant training data for deep search agents by identifying and mitigating four shortcut risks. The resulting agent, trained via supervised fine-tuning, achieves state-of-the-art performance among comparable open-source search agents.
Harness-1 is a 20B search agent trained with reinforcement learning using a stateful search harness, achieving strong results on retrieval benchmarks and outperforming other open search subagents.
ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.
Harness-1 introduces a state-externalizing harness that separates routine bookkeeping from policy decisions in search agents, enabling a 20B model to outperform larger frontier searchers across multiple benchmarks.
Proposes COMPASS, a cognitive MCTS-guided process alignment framework to enhance safety in LLM-powered search agents by synthesizing attack trajectories and isolating risky actions, achieving a favorable safety-utility trade-off with less training data.
Introduces Harness-1, a 20B open search agent trained with state-externalizing harnesses, achieving strong retrieval performance and outperforming larger frontier models on several benchmarks.
GrepSeek trains LLM search agents to directly interact with a text corpus using shell commands like grep, using a two-stage training pipeline with cold-start dataset construction and GRPO refinement, achieving strong F1 and Exact Match on open-domain QA benchmarks.
EVE-Agent introduces a framework for self-evolving search agents that ensure evidence verifiability by generating questions, answers, and evidence spans, and training on marginal accuracy gain of evidence. This improves grounded correctness without human annotations.
QUEST is an open family of deep research agents trained with synthetic data and reinforcement learning, achieving strong performance across diverse long-horizon search tasks, approaching frontier closed-source agents.
OpenSeeker fully open-sources training data and models for 30B-scale ReAct-based search agents, achieving state-of-the-art performance on multiple benchmarks including BrowseComp and Humanity's Last Exam. It is the first purely academic project to reach frontier search benchmark performance while releasing complete training data.
This paper introduces a method using knowledge-graph paths as intermediate supervision to improve self-evolving search agents. It addresses bottlenecks in Search Self-Play by grounding question construction in relational context and introducing a Waypoint Coverage Reward for graded partial credit.
OpenSearch-VL is an open-source framework and paper introducing a recipe for training frontier multimodal search agents using reinforcement learning, featuring specialized data curation and a novel training algorithm.