Tag
This paper introduces EvoBrowseComp, a dynamic benchmark of 400 English and 400 Chinese complex questions that are synthesized via live-web traversal to evaluate search agents without test-set contamination, ensuring robustness against parametric memorization.
EvoBrowseComp is an evolving benchmark with 800 contamination-free questions for evaluating search agents, designed to prevent parametric memorization and maintain temporal freshness through a three-agent framework.