ChatPaper.aiChatPaper

UI-Venus技术报告:利用RFT构建高性能UI智能体

UI-Venus Technical Report: Building High-performance UI Agents with RFT

August 14, 2025
作者: Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang
cs.AI

摘要

我们推出UI-Venus,一款仅以屏幕截图作为输入的原生UI代理,其基于多模态大语言模型构建。通过基于Qwen2.5-VL的强化微调(RFT),UI-Venus仅需数十万高质量训练样本,便在UI定位与导航任务上实现了SOTA性能。具体而言,UI-Venus的7B与72B版本在标准定位基准测试Screenspot-V2/Pro上分别取得了94.1%/50.8%与95.3%/61.9%的成绩,超越了包括开源模型GTA1及闭源模型UI-TARS-1.5在内的先前SOTA基线。为展示UI-Venus的总结与规划能力,我们还在AndroidWorld这一在线UI导航竞技场对其进行了评估,其中7B与72B版本分别达到了49.1%与65.9%的成功率,同样优于现有模型。为此,我们精心设计了针对UI定位与导航任务的奖励函数及相应的高效数据清洗策略。为进一步提升导航性能,我们提出了自我进化轨迹历史对齐与稀疏动作增强方法,优化历史推理轨迹并平衡关键稀疏动作的分布,从而在复杂UI任务中实现更连贯的规划与更好的泛化能力。我们的贡献包括发布了SOTA开源UI代理、全面的数据清洗协议以及提升导航性能的自我进化框架,这些成果将激励社区进一步的研究与开发。代码已发布于https://github.com/antgroup/UI-Venus。
English
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.
PDF91August 15, 2025