ChatPaper.aiChatPaper

UI-Copilot:通过工具集成策略优化推进长程图形用户界面自动化

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

April 15, 2026
作者: Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
cs.AI

摘要

基于多模态大语言模型的图形用户界面智能体已在复杂界面交互任务中展现出强大能力。然而,长周期任务场景仍存在挑战,由于承担了超出其内在能力的任务,这些智能体普遍面临记忆衰退、进度混淆和数学幻觉等问题。为解决这些难题,我们提出UI-Copilot协同框架:GUI智能体专注任务执行,轻量级协处理器则按需提供记忆检索与数值计算支持。我们通过记忆解耦技术将持久化观察数据与瞬时执行上下文分离,并训练策略智能体根据任务需求选择性调用检索器或计算器模式的协处理器。为有效学习工具调用能力,我们提出工具集成策略优化算法,该算法通过单轮预测单独优化工具选择策略,基于策略的多轮推演优化任务执行效果。实验结果表明,UI-Copilot-7B在具有挑战性的MemGUI-Bench上实现了最先进性能,显著优于GUI-Owl-7B、UI-TARS-1.5-7B等同类7B规模GUI智能体。此外,在AndroidWorld测试集上,UI-Copilot-7B较基础Qwen模型实现了17.1%的绝对性能提升,彰显了该框架对真实世界GUI任务的强大泛化能力。
English
MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.
PDF41April 17, 2026