UI-Venus 기술 보고서: RFT를 활용한 고성능 UI 에이전트 구축

초록

스크린샷만을 입력으로 받는 멀티모달 대형 언어 모델 기반의 네이티브 UI 에이전트인 UI-Venus를 소개합니다. UI-Venus는 Qwen2.5-VL 기반의 강화 미세조정(RFT)을 통해 수십만 개의 고품질 학습 샘플만으로도 UI 그라운딩 및 네비게이션 작업에서 SOTA(State-of-the-Art) 성능을 달성했습니다. 구체적으로, UI-Venus의 7B와 72B 변형은 표준 그라운딩 벤치마크인 Screenspot-V2 / Pro에서 각각 94.1% / 50.8%와 95.3% / 61.9%의 성능을 보이며, 오픈소스 GTA1과 클로즈드소스 UI-TARS-1.5를 포함한 기존 SOTA 베이스라인을 능가했습니다. UI-Venus의 요약 및 계획 능력을 보여주기 위해, 온라인 UI 네비게이션 아레나인 AndroidWorld에서도 평가를 진행했으며, 7B와 72B 변형은 각각 49.1%와 65.9%의 성공률을 기록하여 기존 모델들을 뛰어넘었습니다. 이를 달성하기 위해, UI 그라운딩 및 네비게이션 작업을 위한 신중하게 설계된 보상 함수와 이에 상응하는 효율적인 데이터 클리닝 전략을 도입했습니다. 또한, 네비게이션 성능을 더욱 향상시키기 위해, Self-Evolving Trajectory History Alignment & Sparse Action Enhancement를 제안하여 역사적 추적을 정제하고 희소하지만 중요한 액션의 분포를 균형 있게 조정함으로써 복잡한 UI 작업에서 더 일관된 계획과 더 나은 일반화를 이끌어냈습니다. 우리의 기여는 SOTA 오픈소스 UI 에이전트의 공개, 포괄적인 데이터 클리닝 프로토콜, 그리고 네비게이션 성능을 개선하기 위한 새로운 자기 진화 프레임워크를 포함하며, 이를 통해 커뮤니티의 추가 연구와 개발을 촉진하고자 합니다. 코드는 https://github.com/antgroup/UI-Venus에서 확인할 수 있습니다.

English

We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

UI-Venus 기술 보고서: RFT를 활용한 고성능 UI 에이전트 구축

UI-Venus Technical Report: Building High-performance UI Agents with RFT

초록

Support