ChatPaper.aiChatPaper

UI-TARS-2技術報告:基於多輪強化學習的圖形用戶界面代理進展

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

September 2, 2025
作者: Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
cs.AI

摘要

圖形用戶界面(GUI)自主代理的開發在人工智能領域面臨重大挑戰。儘管近期原生代理模型的進展通過端到端學習統一了感知、推理、行動和記憶,展現出潛力,但在數據可擴展性、多輪強化學習(RL)、僅限GUI操作的局限性以及環境穩定性方面仍存在未解難題。在本技術報告中,我們介紹了UI-TARS-2,這是一種以GUI為核心的原生代理模型,通過系統化的訓練方法應對這些挑戰:用於可擴展數據生成的數據飛輪、穩定的多輪RL框架、整合文件系統和終端的混合GUI環境,以及用於大規模部署的統一沙盒平台。實證評估表明,UI-TARS-2相較其前身UI-TARS-1.5取得了顯著進步。在GUI基準測試中,其在Online-Mind2Web上達到88.2分,在OSWorld上達到47.5分,在WindowsAgentArena上達到50.6分,在AndroidWorld上達到73.3分,超越了Claude和OpenAI代理等強勁基線。在遊戲環境中,其在15款遊戲套件中的平均標準化得分為59.8,約為人類水平的60%,並在LMGame-Bench上與前沿專有模型(如OpenAI o3)保持競爭力。此外,該模型能夠泛化至長時程信息搜索任務和軟件工程基準測試,凸顯了其在多樣化代理任務中的魯棒性。對訓練動態的詳細分析進一步提供了在大規模代理RL中實現穩定性和效率的見解。這些結果彰顯了UI-TARS-2在推進GUI代理技術狀態方面的潛力,並展現出在現實世界互動場景中的強大泛化能力。
English
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
PDF974September 3, 2025