UI-Venus Technischer Bericht: Entwicklung von Hochleistungs-UI-Agenten mit RFT

papers.abstract

Wir präsentieren UI-Venus, einen nativen UI-Agenten, der ausschließlich Screenshots als Eingabe basierend auf einem multimodalen Large Language Model verarbeitet. UI-Venus erreicht State-of-the-Art (SOTA) Leistung sowohl bei UI-Grounding- als auch bei Navigationsaufgaben, wobei nur mehrere hunderttausend hochwertige Trainingsdaten durch Reinforcement Fine-Tuning (RFT) auf Basis von Qwen2.5-VL verwendet werden. Konkret erzielen die 7B- und 72B-Varianten von UI-Venus 94,1 % / 50,8 % bzw. 95,3 % / 61,9 % auf den Standard-Grounding-Benchmarks, d.h. Screenspot-V2 / Pro, und übertreffen damit die bisherigen SOTA-Baselines, einschließlich des Open-Source-Modells GTA1 und des Closed-Source-Modells UI-TARS-1.5. Um die Zusammenfassungs- und Planungsfähigkeit von UI-Venus zu demonstrieren, evaluieren wir es auch auf AndroidWorld, einer Online-UI-Navigationsarena, auf der unsere 7B- und 72B-Varianten Erfolgsraten von 49,1 % bzw. 65,9 % erreichen und damit bestehende Modelle übertreffen. Um dies zu erreichen, führen wir sorgfältig gestaltete Belohnungsfunktionen für sowohl UI-Grounding- als auch Navigationsaufgaben sowie entsprechende effiziente Datenbereinigungsstrategien ein. Um die Navigationsleistung weiter zu steigern, schlagen wir Self-Evolving Trajectory History Alignment & Sparse Action Enhancement vor, das historische Argumentationsspuren verfeinert und die Verteilung von spärlichen, aber kritischen Aktionen ausgleicht, was zu kohärenterer Planung und besserer Generalisierung bei komplexen UI-Aufgaben führt. Unsere Beiträge umfassen die Veröffentlichung von SOTA Open-Source-UI-Agenten, umfassende Datenbereinigungsprotokolle und ein neuartiges, selbstentwickelndes Framework zur Verbesserung der Navigationsleistung, das weitere Forschung und Entwicklung in der Community fördert. Der Code ist verfügbar unter https://github.com/antgroup/UI-Venus.

English

We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

UI-Venus Technischer Bericht: Entwicklung von Hochleistungs-UI-Agenten mit RFT

UI-Venus Technical Report: Building High-performance UI Agents with RFT

papers.abstract

Support