GUI-反思：通过自我反思赋能多模态GUI模型行为

摘要

多模态大语言模型（MLLMs）在革新图形用户界面（GUI）自动化方面展现出巨大潜力。然而，现有的GUI模型主要依赖于从近乎无错误的离线轨迹中学习，因而缺乏反思与错误恢复能力。为弥补这一不足，我们提出了GUI-Reflection框架，该框架创新性地将自我反思与错误纠正能力整合到端到端多模态GUI模型中，通过专门的训练阶段实现：GUI特定预训练、离线监督微调（SFT）及在线反思调优。GUI-Reflection框架实现了自我反思行为的自发生成，其数据生成与学习过程完全自动化，无需任何人工标注。具体而言，1）我们首先设计了可扩展的数据管道，能够从现有成功轨迹中自动构建反思与错误纠正数据。针对现有GUI模型主要关注基础与UI理解能力的情况，我们提出了GUI-Reflection任务套件，专门用于学习与评估反思导向的能力。2）此外，我们构建了一个多样且高效的环境，用于在移动设备上进行GUI模型的在线训练与数据收集。3）我们还提出了一种迭代式在线反思调优算法，利用所构建的环境，使模型能够持续增强其反思与错误纠正能力。本框架赋予GUI代理自我反思与纠正的能力，为打造更稳健、适应性强且智能的GUI自动化铺平道路，所有数据、模型、环境及工具均将公开发布。

English

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

GUI-反思：通过自我反思赋能多模态GUI模型行为

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

摘要

Support