데이터 장벽을 넘어서 - 작업 일반화를 통해 GUI 에이전트 구축하기

초록

그래픽 사용자 인터페이스(GUI) 에이전트는 복잡한 디지털 작업을 자동화하기 위한 크로스 플랫폼 솔루션을 제공하며, 생산성 워크플로우를 혁신할 잠재력이 큽니다. 그러나 이러한 에이전트의 성능은 고품질 궤적 데이터의 부족으로 인해 종종 제한을 받습니다. 이러한 한계를 해결하기 위해, 우리는 데이터가 풍부하고 추론이 집중적으로 필요한 작업에 대해 비전 언어 모델(VLM)을 전용 중간 훈련 단계에서 학습시키고, 이러한 작업을 통합함으로써 GUI 계획 시나리오로의 일반화가 어떻게 촉진되는지 조사합니다. 구체적으로, 우리는 GUI 인식, 다중모드 추론, 텍스트 추론 등 즉시 사용 가능한 지시 튜닝 데이터가 있는 다양한 작업을 탐구합니다. 11개의 중간 훈련 작업에 걸친 광범위한 실험을 통해 우리는 다음과 같은 결과를 입증했습니다: (1) 작업 일반화는 매우 효과적이며, 대부분의 설정에서 상당한 개선을 가져옵니다. 예를 들어, 다중모드 수학 추론은 AndroidWorld에서 절대적으로 6.3%의 성능 향상을 가져왔습니다. 특히, 텍스트 전용 수학 데이터는 GUI 웹 에이전트 성능을 크게 향상시켜 WebArena에서 5.6%, AndroidWorld에서 5.4%의 개선을 달성했으며, 이는 텍스트 기반에서 시각적 영역으로의 주목할 만한 크로스 모달 일반화를 강조합니다; (2) 이전의 가정과 달리, GUI 에이전트 작업과 밀접하게 연관되어 있다고 여겨지고 널리 훈련에 사용되었던 GUI 인식 데이터는 최종 성능에 상대적으로 제한된 영향을 미칩니다; (3) 이러한 통찰을 바탕으로, 우리는 가장 효과적인 중간 훈련 작업을 식별하고 최적화된 혼합 데이터셋을 구성하여 WebArena에서 8.0%, AndroidWorld에서 12.2%의 절대적인 성능 향상을 달성했습니다. 우리의 연구는 GUI 에이전트를 위한 크로스 도메인 지식 전달에 대한 귀중한 통찰을 제공하며, 이 신흥 분야에서 데이터 부족 문제를 해결하기 위한 실용적인 접근 방식을 제시합니다. 코드, 데이터 및 모델은 https://github.com/hkust-nlp/GUIMid에서 확인할 수 있습니다.

English

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

데이터 장벽을 넘어서 - 작업 일반화를 통해 GUI 에이전트 구축하기

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

초록

Support