Tag
Introduces Ko-WideSearch, a Korean breadth-search benchmark for web agents that evaluates exhaustive set enumeration across 228 tables. Findings show agents have high item recall but struggle with row completion, especially for open-ended cells.
This paper introduces SkillMigrator, an LLM web agent that learns reusable skills and transfers them across websites by matching layout structure rather than domain-specific metadata, reducing LLM action count by 8-10% on WebArena and Mind2Web benchmarks.
This paper introduces SkillMigrator, an agent that learns reusable web skills as transferable interaction patterns (TIPs) and transfers them across websites by matching layout structure, reducing LLM action counts by 8-10% on benchmarks.
The browser_use team achieved the #1 spot on the Odysseys benchmark, a challenging evaluation for long-horizon web agents, outperforming models like Opus 4.6 and GPT-5.4.
This paper investigates whether online skill and memory modules for web agents are worth their token cost under a fixed inference budget, finding that a budget-matched vanilla baseline often matches or outperforms augmented methods across three domains and models.
This paper introduces WebDecept, a framework for injecting deceptive interface patterns into web environments to evaluate the safety of autonomous web agents. Experiments show current agents are highly susceptible to such manipulations, highlighting safety challenges for real-world deployment.
The paper proposes Signal-Driven Observation (SDO), a method for web agents to avoid context degradation by only reading task-relevant parts of the DOM and re-invoking observation only when triggered by specific signals, rather than reading the full page state at every action step.
AsyncWebRL introduces an asynchronous multi-step reinforcement learning system for vision-language web agents, achieving up to 2.9x training speedup and setting a new state-of-the-art on WebGym by replacing per-trajectory normalization with a constant to reduce trajectory length inefficiency.
SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping, reducing tool-call rounds by 17-58% while maintaining accuracy on benchmarks like GAIA, BrowseComp, and XBenchDeepSearch.
This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.
Proposes SCALE, a framework for self-improving web agents using cognitive-aware exploration with three adversarial roles and a graph exploration strategy. Also introduces a large-scale dataset SCALE-20k from real websites, showing significant improvements in MLLM-based web agents.
OpenWebRL presents an open framework for training visual web agents using online multi-turn reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. Their 4B-parameter model outperforms prior open agents and competes with proprietary systems like OpenAI CUA and Gemini CUA.
This paper introduces GTA, a scalable framework for automatically generating long-horizon, multi-hop web agent tasks with executable trajectories, addressing the lack of process-level supervision in web agent benchmarks. The framework integrates crawling, retrieval-based seeding, and automated quality control to produce realistic tasks across multiple websites.
Google demonstrated at Google I/O a new workflow of Chrome DevTools with AI agents, including APIs such as WebMCP and HTML-in-Canvas, aiming to make it easy for developers to expose web page functionality to AI agents while maintaining semantics, accessibility, and security boundaries.
DRIVE proposes a dual-level skill modeling framework that separates reasoning knowledge from interaction knowledge for web agents under continual learning, achieving a 52.8% task success rate on WebArena, outperforming the skill-free baseline by 7.3 percentage points.
Weasel is a trajectory selection method for offline training of web agents that improves out-of-domain generalization by balancing importance and diversity. It achieves up to 12.5x training speedups and improved performance across several benchmarks.
Accio is a speculative execution framework that reduces cost and latency for web agents by leveraging offline site-structure profiling and online selection of fast paths, achieving a 1.9x reduction in per-task cost and 33.4% latency reduction while maintaining accuracy.
ShopGym is a framework that converts live e-commerce storefronts into self-contained sandbox shops for realistic, controllable, and reproducible benchmarking of web agents, with synthetic tasks across seven skill categories.
SimPersona learns discrete buyer personas from raw clickstreams using a VQ-VAE and maps them to persona tokens for LLM-based web agents, achieving high conversion-rate alignment across many live storefronts.
WebHarbor packages 15 real websites (Amazon, GitHub, BBC, etc.) as self-contained Flask+SQLite apps in a single Docker image with sub-second reset, designed for reproducible web agent evaluation and training. The project invites community contributions to expand to 100+ sites, with co-authorship opportunities.