VenusBench-Mobile: 역량 진단 기능을 갖춘 모바일 GUI 에이전트를 위한 도전적이고 사용자 중심 벤치마크

초록

기존 모바일 GUI 에이전트 온라인 벤치마크는 대부분 애플리케이션 중심에 작업이 균일하여, 실제 모바일 사용 환경의 다양성과 불안정성을 반영하지 못하고 있습니다. 이를 위해 본 연구에서는 현실적이고 사용자 중심의 조건에서 범용 모바일 GUI 에이전트를 평가하기 위한 도전적인 온라인 벤치마크인 VenusBench-Mobile을 소개합니다. VenusBench-Mobile은 두 가지 핵심 평가 축을 구축합니다. 첫째, 실제 모바일 사용을 반영하는 사용자 의도 기반 작업 설계를 통해 '무엇을 평가할 것인가'를 정의하고, 둘째, 세분화된 에이전트 행동 분석을 위한 역량 중심 주석 체계를 통해 '어떻게 평가할 것인가'를 제시합니다. 최신 모바일 GUI 에이전트에 대한 포괄적 평가 결과, 기존 벤치마크 대비 큰 성능 격차가 확인되어 VenusBench-Mobile이 훨씬 더 도전적이고 현실적인 작업을 제시하며, 현재 에이전트들이 신뢰할 수 있는 현실 세계 배포에는 아직 멀었음을 보여줍니다. 진단 분석에 따르면 실패 원인은 주로 인식 및 메모리 결함에 기인하며, 이러한 문제는 대부분 coarse-grained 평가에서는 드러나지 않습니다. 또한 가장 강력한 에이전트들도 환경 변화 하에서는 성공률이 거의 제로에 가까워, 현실적 설정에서의 취약성을 강조합니다. 이러한 통찰을 바탕으로, VenusBench-Mobile이 강건한 모바일 GUI 에이전트의 현실 세계 배포를 위한 중요한 초석이 될 것으로 믿습니다. 코드와 데이터는 https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile에서 이용 가능합니다.

English

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.

VenusBench-Mobile: 역량 진단 기능을 갖춘 모바일 GUI 에이전트를 위한 도전적이고 사용자 중심 벤치마크

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

초록

Support