Tag
This paper introduces LongMemEval-V2, a benchmark for evaluating long-term memory systems in web agents, along with two memory methods: AgentRunbook-R and AgentRunbook-C.
Qwen released WebWorld, an open-source model series for web agents (8B/14B/32B) under Apache 2.0, which improves performance on MiniWob++ and WebArena benchmarks.
Apple Research introduces Weblica, a framework for creating scalable and reproducible training environments for visual web agents using HTTP caching and LLM-based synthesis.
This paper introduces Region4Web, a framework that improves web agent performance by organizing observation spaces into functional regions rather than individual elements. It demonstrates that this approach reduces observation length and increases task success rates on the WebArena benchmark.
This paper introduces WebStep, a benchmark and framework for process-level evaluation of web agents using semantic state tracking. It reveals detailed performance differences and error localization beyond terminal success metrics.
WebShaper is a formalization-driven framework for synthesizing information-seeking datasets using set theory and Knowledge Projections, achieving state-of-the-art performance on GAIA and WebWalkerQA benchmarks among open-source agents.