Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
Summary
This paper introduces WebStep, a benchmark and framework for process-level evaluation of web agents using semantic state tracking. It reveals detailed performance differences and error localization beyond terminal success metrics.
View Cached Full Text
Cached at: 06/16/26, 11:32 AM
Paper page - Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
Source: https://huggingface.co/papers/2606.15673
Abstract
WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss.
Web agentsact through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct aprocess-level analysisofweb agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty andautomatic semantic state tracking. Each website exposes a deterministicsemantic MDPalongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on thesemantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge inexploration reachversusexecution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain.Bifurcation analysisfurther localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Ourprocess-level analysisopens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.15673
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15673 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15673 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15673 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems
StepFinder is a lightweight framework that uses LLMs only in the feature construction phase to encode execution logs into temporal semantic sequences, then applies parameter-efficient temporal and attention modules for failure attribution in multi-agent systems. It reduces inference time by 79% compared to the fastest LLM-based method on the Who&When benchmark.
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
This paper introduces a claim-centric auditing framework for identifying error spans in deep-research agent trajectories, along with a new benchmark TELBench, improving process-level reliability assessment.
Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces
This paper introduces WebDecept, a framework for injecting deceptive interface patterns into web environments to evaluate the safety of autonomous web agents. Experiments show current agents are highly susceptible to such manipulations, highlighting safety challenges for real-world deployment.
Evaluating agents is really hard
The article discusses the challenge of evaluating LLM-based agents that perform multi-step reasoning, noting that scoring only the final output is insufficient because agents may take wrong paths and recover by accident, and raises questions about how to evaluate the trajectory without manual review.
STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based computing environments, enabling scalable, state-based evaluation of LLM-powered agents.