ポリシー起因のエラーの回復：ロバストなGUIエージェントのためのベンチマーキングと軌跡合成

要旨

GUIエージェントは急速に進歩しているものの、自身の誤りから回復する堅牢性に欠けることが多く、現実世界への展開を妨げている。このギャップを評価レベルとデータレベルの両方で埋めるため、我々はGUI-RobustEvalを導入し、堅牢性駆動型軌跡合成（RoTS）を提案する。GUI-RobustEvalは1,216個の実行可能なテストケースを含み、広範囲かつ現実的なエラーモードにわたって誤り回復能力を体系的に測定する。データレベルでは、RoTSはスケーラブルな合成フレームワークであり、ツリーベースのパイプラインを通じて多様なエラーモードを積極的に発見し、対応する回復ステップを合成することで80万件の高品質データを生成する。我々のデータセットで微調整された2つのモデル、RoTS-7BとRoTS-32Bは、GUI-RobustEvalおよび従来のGUIベンチマークの両方で顕著な性能向上を示した。特筆すべきことに、RoTS-32BはOSWorldで最先端の性能を達成し、成功率47.4%およびAll-Pass@4スコア33.8%を記録した。これは、長期的な誤り回復能力の向上が堅牢性と全体的な性能の両方に寄与することを示唆している。我々のコードはhttps://github.com/AlibabaResearch/RoTSで公開されている。

English

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.