ChatPaper.aiChatPaper

修正策略引發的錯誤:面向穩健GUI代理的基準測試與軌跡合成

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

May 28, 2026
作者: Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang
cs.AI

摘要

儘管GUI代理已取得快速進展,但它們往往缺乏從自身錯誤中恢復的魯棒性,阻礙了實際部署。為了解決評估與資料層面的差距,我們引入了GUI-RobustEval與魯棒性驅動的軌跡合成。GUI-RobustEval包含1,216個可執行測試案例,能夠系統性地衡量在廣泛且真實的錯誤模式下的錯誤恢復能力。在資料層面,RoTS是一個可擴展的合成框架,透過基於樹狀結構的管道主動探索多樣化的錯誤模式,並合成對應的恢復步驟,從而生成80萬筆高品質資料。我們在該資料集上微調的RoTS-7B與RoTS-32B兩個模型,在GUI-RobustEval及傳統GUI基準測試上均展現出顯著提升。尤其RoTS-32B在OSWorld上達到最先進的表現,成功率為47.4%,All-Pass@4分數為33.8%,這表明改善長週期錯誤恢復能力有助於同時提升魯棒性與整體性能。我們的程式碼已開源於 https://github.com/AlibabaResearch/RoTS。
English
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.