哪里出错了?基于语义状态跟踪的Web代理过程级评估
Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
April 8, 2026
作者: Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim
cs.AI
摘要
Web智能体通过长交互序列执行任务,但现有基准仅评估终端成功率,不仅丢弃所有过程信息,也难以提供改进指导。本研究对Web智能体进行过程级分析,提出WebStep基准,包含1,800个具有可控难度和自动语义状态追踪的任务实例。每个网站同时暴露确定性语义MDP与图形用户界面:智能体在界面上操作,而环境在后台记录高层状态与转换,无需人工标注即可实现细粒度分析。基于语义轨迹,我们首先证明过程指标能揭示结果评估无法捕捉的差异:三个成功率集中在31-33%的智能体在探索触及度与执行准确性上出现分化。接着,按技能分解可刻画这些差异的本质,暴露出同一网站内隐藏的逐技能反向排名:例如在住房领域,OpenAI CUA在提交操作上比Qwen3.5高出23.7%,但在过滤操作上却低15.6%,精准定位了该领域内可改进的具体技能。分岔分析进一步锁定导致任务失败的决策性错误,且该错误具有智能体特异性而非普遍性。最后,随着任务难度增加,这些差异逐渐扩大:在简单任务上成功率相近,但当探索要求提高时则出现明显分化。我们的过程级分析为Web智能体评估开辟了新路径,提供细粒度且可操作的洞见,阐明每个智能体应在何处及如何改进。
English
Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.