三思而后行：用于GUI自动化术前错误诊断的GUI-Critic-R1模型

摘要

近年来，多模态大语言模型（MLLMs）已被广泛应用于包括图形用户界面（GUI）自动化在内的多模态推理任务中。与一般的离线多模态任务不同，GUI自动化是在在线交互环境中执行的，需要根据环境的实时状态进行逐步决策。该任务对每一步决策错误的容忍度较低，因为任何错误都可能累积性地破坏流程，甚至导致删除或支付等不可逆的后果。为解决这些问题，我们引入了一种术前批评机制，该机制通过推理潜在结果和行动的正确性，在实际执行前提供有效反馈。具体而言，我们提出了一种建议感知梯度相对策略优化（S-GRPO）策略，以构建我们的术前批评模型GUI-Critic-R1，并引入了一种新颖的建议奖励，以增强模型反馈的可靠性。此外，我们开发了一种基于推理引导的数据收集管道，创建了GUI-Critic-Train和GUI-Critic-Test，填补了现有GUI批评数据的空白。在移动和网页领域的GUI-Critic-Test上的静态实验表明，与当前的多模态大语言模型相比，我们的GUI-Critic-R1在批评准确性方面具有显著优势。在GUI自动化基准上的动态评估进一步凸显了我们模型的有效性和优越性，这体现在成功率和操作效率的提升上。

English

In recent years, Multimodal Large Language Models (MLLMs) have been extensively utilized for multimodal reasoning tasks, including Graphical User Interface (GUI) automation. Unlike general offline multimodal tasks, GUI automation is executed in online interactive environments, necessitating step-by-step decision-making based on real-time status of the environment. This task has a lower tolerance for decision-making errors at each step, as any mistakes may cumulatively disrupt the process and potentially lead to irreversible outcomes like deletions or payments. To address these issues, we introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution, by reasoning about the potential outcome and correctness of actions. Specifically, we propose a Suggestion-aware Gradient Relative Policy Optimization (S-GRPO) strategy to construct our pre-operative critic model GUI-Critic-R1, incorporating a novel suggestion reward to enhance the reliability of the model's feedback. Furthermore, we develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI critic data. Static experiments on the GUI-Critic-Test across both mobile and web domains reveal that our GUI-Critic-R1 offers significant advantages in critic accuracy compared to current MLLMs. Dynamic evaluation on GUI automation benchmark further highlights the effectiveness and superiority of our model, as evidenced by improved success rates and operational efficiency.