恢复策略诱导的错误：面向鲁棒GUI智能体的基准测试与轨迹合成

摘要

虽然GUI代理取得了快速发展，但它们在从自身错误中恢复的能力上往往缺乏鲁棒性，这限制了其在实际场景中的部署。为弥补这一在评估与数据层面的差距，我们引入了GUI-RobustEval，并提出了基于鲁棒性的轨迹合成方法（Robustness-driven Trajectory Synthesis, RoTS）。GUI-RobustEval包含1,216个可执行测试用例，能够在广泛且真实的错误模式范围内系统地衡量错误恢复能力。在数据层面，RoTS是一个可扩展的合成框架，通过基于树的流水线主动探索多样化的错误模式并合成相应的恢复步骤，生成80万条高质量数据。基于该数据集微调的两款模型RoTS-7B和RoTS-32B在GUI-RobustEval及传统GUI基准测试中均展现出显著提升。值得注意的是，RoTS-32B在OSWorld上实现了当前最优性能，成功率达47.4%，All-Pass@4得分达33.8%，这表明增强的长周期错误恢复能力有助于同时提升鲁棒性和整体性能。我们的代码已开源至https://github.com/AlibabaResearch/RoTS。

English

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains 1,216 executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates 800k high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a 47.4% success rate and a 33.8% All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.