哪裡出錯了?基於語意狀態追蹤的網路代理流程層級評估
Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
April 8, 2026
作者: Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim
cs.AI
摘要
網頁代理程式透過長互動序列運作,然而現有基準僅評估最終成功與否,忽略了所有過程資訊,也幾乎無法提供改進方向。在本研究中,我們對網頁代理程式進行了過程層級的分析。我們提出了 WebStep,一個包含 1,800 個任務實例的基準測試,具備可控的難度與自動語義狀態追蹤。每個網站在圖形使用者介面(GUI)之外,還暴露了一個確定性的語義馬可夫決策過程(MDP):代理程式在介面上操作,而環境在背景中記錄高層次狀態與轉換,從而無需人工標註即可進行細粒度分析。基於語義軌跡,我們首先證明過程指標能揭示結果評估無法察覺的差異:三個成功率集中在 31% 至 33% 的代理程式,在探索範圍與執行準確度上表現各異。接著,按技能進行分解,說明了這些差異的本質,揭露了在同一網站內部隱藏的、技能層級相反的排名:例如,在 Housing 網站上,OpenAI CUA 在提交操作上優於 Qwen3.5 23.7%,但在篩選操作上卻落後 15.6%,這精確指出了即使在同一個領域中也存在具體可改進的技能。分岔分析進一步定位了導致任務失敗的關鍵錯誤,並顯示此錯誤是代理程式特有的,而非共通的。最後,這些差異隨著任務難度增加而擴大:在簡單任務中成功率相似,但當探索需求增加時,成功率則明顯分化。我們的過程層級分析為網頁代理程式的評估開闢了新途徑,提供了細粒度且可操作的洞察,指出每個代理程式該在何處及如何加以改進。
English
Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.