修正策略引發的錯誤：面向穩健GUI代理的基準測試與軌跡合成

摘要

儘管GUI代理已取得快速進展，但它們往往缺乏從自身錯誤中恢復的魯棒性，阻礙了實際部署。為了解決評估與資料層面的差距，我們引入了GUI-RobustEval與魯棒性驅動的軌跡合成。GUI-RobustEval包含1,216個可執行測試案例，能夠系統性地衡量在廣泛且真實的錯誤模式下的錯誤恢復能力。在資料層面，RoTS是一個可擴展的合成框架，透過基於樹狀結構的管道主動探索多樣化的錯誤模式，並合成對應的恢復步驟，從而生成80萬筆高品質資料。我們在該資料集上微調的RoTS-7B與RoTS-32B兩個模型，在GUI-RobustEval及傳統GUI基準測試上均展現出顯著提升。尤其RoTS-32B在OSWorld上達到最先進的表現，成功率為47.4%，All-Pass@4分數為33.8%，這表明改善長週期錯誤恢復能力有助於同時提升魯棒性與整體性能。我們的程式碼已開源於 https://github.com/AlibabaResearch/RoTS。

English

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.