UI-Venus技術報告：利用RFT構建高性能UI代理

摘要

我們推出UI-Venus，這是一款基於多模態大語言模型的原生UI代理，僅以螢幕截圖作為輸入。UI-Venus通過基於Qwen2.5-VL的強化微調（RFT），僅使用數十萬高質量訓練樣本，便在UI定位與導航任務上達到了SOTA性能。具體而言，UI-Venus的7B與72B版本在標準定位基準測試Screenspot-V2 / Pro上分別取得了94.1% / 50.8%與95.3% / 61.9%的成績，超越了包括開源GTA1與閉源UI-TARS-1.5在內的先前SOTA基線。為展示UI-Venus的總結與規劃能力，我們還在AndroidWorld這一線上UI導航競技場上對其進行了評估，其中我們的7B與72B版本分別達到了49.1%與65.9%的成功率，同樣超越了現有模型。為實現這一成果，我們針對UI定位與導航任務精心設計了獎勵函數及相應的高效數據清理策略。為進一步提升導航性能，我們提出了自我進化的軌跡歷史對齊與稀疏動作增強方法，該方法精煉了歷史推理軌跡並平衡了稀疏但關鍵動作的分佈，從而在複雜UI任務中實現了更連貫的規劃與更好的泛化能力。我們的貢獻包括發布了SOTA開源UI代理、全面的數據清理協議以及一個新穎的自我進化框架，用於提升導航性能，這些都將激勵社區進一步的研究與開發。代碼已發佈於https://github.com/antgroup/UI-Venus。

English

We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

UI-Venus技術報告：利用RFT構建高性能UI代理

UI-Venus Technical Report: Building High-performance UI Agents with RFT

摘要

Support