ChatPaper.aiChatPaper

UI-Copilot:通过工具集成策略优化推进长程图形用户界面自动化

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

April 15, 2026
作者: Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
cs.AI

摘要

基於多模態大語言模型的圖形用戶界面智能體已在複雜的界面交互任務中展現出強大能力。然而,長週期任務場景仍是重大挑戰,這類智能體常需處理超越其內在能力的任務,存在記憶衰退、進程混亂和數學幻覺等問題。為應對這些挑戰,我們提出UI-Cilot協作框架:GUI智能體專注任務執行,輕量級輔助模塊則提供按需記憶檢索與數值計算支持。通過記憶解耦技術,我們將持久化觀測數據與瞬態執行上下文分離,並訓練策略智能體根據任務需求選擇性調用檢索器或計算器兩種輔助模式。為實現有效的工具調用學習,我們提出工具集成策略優化算法,該算法通過單輪預測單獨優化工具選擇策略,並基於在線多輪推演優化任務執行策略。實驗結果表明,UI-Copilot-7B在挑戰性基準MemGUI-Bench上達到最優性能,優於GUI-Owl-7B和UI-TARS-1.5-7B等強力同規模智能體。此外,在AndroidWorld數據集上較基礎Qwen模型實現17.1%的絕對性能提升,彰顯了該框架在真實GUI任務中的強大泛化能力。
English
MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.
PDF41April 17, 2026