哪里出错了？基于语义状态跟踪的Web代理过程级评估

摘要

Web智能体通过长交互序列执行任务，但现有基准仅评估终端成功率，不仅丢弃所有过程信息，也难以提供改进指导。本研究对Web智能体进行过程级分析，提出WebStep基准，包含1,800个具有可控难度和自动语义状态追踪的任务实例。每个网站同时暴露确定性语义MDP与图形用户界面：智能体在界面上操作，而环境在后台记录高层状态与转换，无需人工标注即可实现细粒度分析。基于语义轨迹，我们首先证明过程指标能揭示结果评估无法捕捉的差异：三个成功率集中在31-33%的智能体在探索触及度与执行准确性上出现分化。接着，按技能分解可刻画这些差异的本质，暴露出同一网站内隐藏的逐技能反向排名：例如在住房领域，OpenAI CUA在提交操作上比Qwen3.5高出23.7%，但在过滤操作上却低15.6%，精准定位了该领域内可改进的具体技能。分岔分析进一步锁定导致任务失败的决策性错误，且该错误具有智能体特异性而非普遍性。最后，随着任务难度增加，这些差异逐渐扩大：在简单任务上成功率相近，但当探索要求提高时则出现明显分化。我们的过程级分析为Web智能体评估开辟了新路径，提供细粒度且可操作的洞见，阐明每个智能体应在何处及如何改进。

English

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.