TreeCUA: 트리 구조 검증 가능한 진화를 통한 효율적인 GUI 자동화 확장

초록

GUI 자동화의 효과적인 확장은 컴퓨터 사용 에이전트(CUA)에 필수적이지만, 기존 연구는 주로 더 정교한 데이터 수집이 필요한 GUI 계획보다는 GUI 기반 구축의 확장에 중점을 두고 있습니다. 실제로 CUA의 앱/데스크톱/웹 페이지 간 탐색 과정은 일반적으로 트리 구조를 따르며, 초기 기능 진입점이 더 빈번히 탐색되는 경향이 있습니다. 따라서 대규모 궤적을 트리 구조로 구성하면 데이터 비용을 절감하고 GUI 계획의 데이터 확장을 효율화할 수 있습니다. 본 연구에서는 트리 구조의 검증 가능한 진화를 통해 GUI 자동화를 효율적으로 확장하는 TreeCUA를 제안합니다. 환경 탐색, 행동 검증, 궤적 요약, 품질 평가를 수행하는 다중 에이전트 협업 프레임워크를 제안하여 고품질의 확장 가능한 GUI 궤적을 생성합니다. 효율성 향상을 위해 중복 탐색 노드를 저장 및 재생하는 새로운 트리 기반 토폴로지를 설계하고, 깊이(궤적 난이도)와 너비(궤적 다양성)의 균형을 맞추는 적응형 탐색 알고리즘을 고안했습니다. 또한 저품질 생성을 방지하기 위한 세계 지식 안내 및 전역 메모리 역추적 기법을 개발했습니다. 마지막으로 풍부한 트리 노드 정보를 바탕으로 TreeCUA-DPO 방법을 자연스럽게 확장 제안하여 인접 궤적의 분기 정보를 참조함으로써 GUI 계획 능력을 향상시킵니다. 실험 결과 TreeCUA와 TreeCUA-DPO가 뚜렷한 성능 향상을 보였으며, 외부 도메인(OOD) 연구를 통해 강력한 일반화 능력을 추가로 입증했습니다. 모든 궤적 노드 정보와 코드는 https://github.com/UITron-hub/TreeCUA에서 공개될 예정입니다.

English

Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (i.e., trajectory difficulty) and breadth (i.e., trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron-hub/TreeCUA.

TreeCUA: 트리 구조 검증 가능한 진화를 통한 효율적인 GUI 자동화 확장

TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

초록

Support