GUI-Reflection: 자기 성찰을 통해 다중 모드 GUI 모델의 역량 강화 행동

초록

멀티모달 대형 언어 모델(MLLMs)은 그래픽 사용자 인터페이스(GUI) 자동화를 혁신할 수 있는 큰 잠재력을 보여주고 있습니다. 그러나 기존의 GUI 모델들은 대부분 오류가 거의 없는 오프라인 궤적 데이터를 학습하는 데 의존하고 있어, 반성과 오류 복구 능력이 부족합니다. 이러한 격차를 해소하기 위해, 우리는 GUI-Reflection이라는 새로운 프레임워크를 제안합니다. 이 프레임워크는 GUI 특화 사전 학습, 오프라인 지도 미세 조정(SFT), 그리고 온라인 반성 튜닝이라는 전용 학습 단계를 통해 자기 반성과 오류 수정 능력을 엔드투엔드 멀티모달 GUI 모델에 명시적으로 통합합니다. GUI-Reflection은 인간의 주석 없이도 완전히 자동화된 데이터 생성과 학습 과정을 통해 자기 반성 행동의 출현을 가능하게 합니다. 구체적으로, 1) 우리는 먼저 기존의 성공적인 궤적 데이터로부터 반성 및 오류 수정 데이터를 자동으로 구성하는 확장 가능한 데이터 파이프라인을 제안합니다. 기존 GUI 모델들이 주로 기반화 및 UI 이해 능력에 초점을 맞추는 반면, 우리는 GUI-Reflection Task Suite를 제안하여 반성 지향 능력을 명시적으로 학습하고 평가합니다. 2) 또한, 모바일 기기에서 GUI 모델의 온라인 학습 및 데이터 수집을 위한 다양하고 효율적인 환경을 구축했습니다. 3) 우리는 제안된 환경을 활용한 반복적인 온라인 반성 튜닝 알고리즘을 제시하여, 모델이 지속적으로 반성 및 오류 수정 능력을 향상시킬 수 있도록 합니다. 우리의 프레임워크는 GUI 에이전트에 자기 반성 및 수정 능력을 부여함으로써, 더 강력하고 적응적이며 지능적인 GUI 자동화의 길을 열어줍니다. 모든 데이터, 모델, 환경 및 도구는 공개될 예정입니다.

English

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

GUI-Reflection: 자기 성찰을 통해 다중 모드 GUI 모델의 역량 강화 행동

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

초록

Support