ChatPaper.aiChatPaper

UI-TARS:與本地代理人進行自動化GUI互動的先驅

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

January 21, 2025
作者: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
cs.AI

摘要

本文介紹了UI-TARS,一種本地GUI代理模型,僅將螢幕截圖視為輸入並執行類似人類的互動(例如鍵盤和滑鼠操作)。不同於依賴嚴重封裝商業模型(例如GPT-4o)並具有專家製作的提示和工作流程的主流代理框架,UI-TARS是一種端到端模型,優於這些複雜的框架。實驗證明了其優異表現:UI-TARS在評估感知、基礎和GUI任務執行的10多個GUI代理基準測試中實現了SOTA性能。值得注意的是,在OSWorld基準測試中,UI-TARS在50個步驟下達到了24.6的分數,在15個步驟下達到了22.7的分數,優於Claude(分別為22.0和14.9)。在AndroidWorld中,UI-TARS取得了46.6的分數,超越了GPT-4o(34.5)。UI-TARS融合了幾項關鍵創新:(1)增強感知:利用大規模GUI螢幕截圖數據集,實現對UI元素的上下文感知理解和精確標註;(2)統一動作建模,將動作標準化為跨平台統一空間,通過大規模動作跟踪實現精確基礎和互動;(3)System-2推理,將深思熟慮的推理納入多步決策制定中,涉及多種推理模式,如任務分解、反思思考、里程碑識別等;(4)反思式在線跟踪的迭代訓練,通過在數百個虛擬機器上自動收集、過濾和反思性地完善新的互動跟踪,解決了數據瓶頸問題。通過迭代訓練和反思調整,UI-TARS不斷從錯誤中學習,並在最小程度的人為干預下適應未知情況。我們還分析了GUI代理的演進路徑,以指導該領域的進一步發展。
English
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Summary

AI-Generated Summary

PDF585January 22, 2025