UI-TARS-2技术报告:通过多轮强化学习推进图形用户界面代理
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
September 2, 2025
作者: Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
cs.AI
摘要
图形用户界面(GUI)自主智能体的开发在人工智能领域面临重大挑战。尽管近期原生智能体模型通过端到端学习统一感知、推理、行动和记忆取得了进展,但在数据可扩展性、多轮强化学习(RL)、纯GUI操作的局限性以及环境稳定性方面仍存在开放性问题。本技术报告介绍了UI-TARS-2,一种以GUI为中心的原生智能体模型,通过系统化的训练方法应对这些挑战:用于可扩展数据生成的数据飞轮、稳定的多轮RL框架、集成文件系统和终端的混合GUI环境,以及用于大规模部署的统一沙盒平台。实证评估表明,UI-TARS-2相较于其前身UI-TARS-1.5取得了显著提升。在GUI基准测试中,它在Online-Mind2Web上达到88.2分,OSWorld上47.5分,WindowsAgentArena上50.6分,AndroidWorld上73.3分,超越了Claude和OpenAI智能体等强劲基线。在游戏环境中,它在15款游戏套件中平均标准化得分为59.8,约为人类水平的60%,并在LMGame-Bench上与前沿专有模型(如OpenAI o3)保持竞争力。此外,该模型能够泛化至长程信息检索任务和软件工程基准测试,展现了其在多样化智能体任务中的鲁棒性。对训练动态的详细分析进一步揭示了在大规模智能体RL中实现稳定性和效率的洞见。这些结果凸显了UI-TARS-2在推进GUI智能体技术状态方面的潜力,并展示了其在现实世界交互场景中的强大泛化能力。
English
The development of autonomous agents for graphical user interfaces (GUIs)
presents major challenges in artificial intelligence. While recent advances in
native agent models have shown promise by unifying perception, reasoning,
action, and memory through end-to-end learning, open problems remain in data
scalability, multi-turn reinforcement learning (RL), the limitations of
GUI-only operation, and environment stability. In this technical report, we
present UI-TARS-2, a native GUI-centered agent model that addresses these
challenges through a systematic training methodology: a data flywheel for
scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI
environment that integrates file systems and terminals, and a unified sandbox
platform for large-scale rollouts. Empirical evaluation demonstrates that
UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on
WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines
such as Claude and OpenAI agents. In game environments, it attains a mean
normalized score of 59.8 across a 15-game suite-roughly 60% of human-level
performance-and remains competitive with frontier proprietary models (e.g.,
OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to
long-horizon information-seeking tasks and software engineering benchmarks,
highlighting its robustness across diverse agent tasks. Detailed analyses of
training dynamics further provide insights into achieving stability and
efficiency in large-scale agent RL. These results underscore UI-TARS-2's
potential to advance the state of GUI agents and exhibit strong generalization
to real-world interactive scenarios.