GUI-Reflection：賦能多模態GUI模型的自省行為

摘要

多模態大型語言模型（MLLMs）在革新圖形用戶界面（GUI）自動化方面展現了巨大潛力。然而，現有的GUI模型主要依賴於從近乎無誤的離線軌跡中學習，因而缺乏反思與錯誤恢復能力。為彌補這一差距，我們提出了GUI-Reflection，這是一個新穎的框架，它通過專門的訓練階段——GUI特定預訓練、離線監督微調（SFT）和在線反思調優——將自我反思與錯誤修正能力明確整合到端到端的多模態GUI模型中。GUI-Reflection實現了自我反思行為的湧現，通過全自動的數據生成與學習過程，無需任何人類標註。具體而言，1）我們首先提出了可擴展的數據管道，從現有的成功軌跡中自動構建反思與錯誤修正數據。雖然現有的GUI模型主要關注於基礎與UI理解能力，但我們提出了GUI-Reflection任務套件，專門學習與評估反思導向的能力。2）此外，我們構建了一個多樣化且高效的環境，用於在移動設備上進行GUI模型的在線訓練與數據收集。3）我們還提出了一種迭代的在線反思調優算法，利用所提出的環境，使模型能夠持續增強其反思與錯誤修正能力。我們的框架賦予了GUI代理自我反思與修正的能力，為更健壯、適應性更強且智能的GUI自動化鋪平了道路，所有數據、模型、環境與工具都將公開發布。

English

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

GUI-Reflection：賦能多模態GUI模型的自省行為

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

摘要

Support