三思而後行:用於GUI自動化術前錯誤診斷的GUI-Critic-R1模型
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation
June 5, 2025
作者: Yuyang Wanyan, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Jiabo Ye, Yutong Kou, Ming Yan, Fei Huang, Xiaoshan Yang, Weiming Dong, Changsheng Xu
cs.AI
摘要
近年來,多模態大型語言模型(MLLMs)已被廣泛應用於多模態推理任務,包括圖形用戶界面(GUI)自動化。與一般的離線多模態任務不同,GUI自動化是在線交互環境中執行的,需要根據環境的實時狀態進行逐步決策。這項任務對每一步決策錯誤的容忍度較低,因為任何錯誤都可能累積性地破壞流程,並可能導致不可逆的結果,如刪除或支付。為了解決這些問題,我們引入了一種預操作批評機制,通過推理行動的潛在結果和正確性,在實際執行前提供有效反饋。具體而言,我們提出了一種建議感知梯度相對策略優化(S-GRPO)策略,構建了我們的預操作批評模型GUI-Critic-R1,並引入了一種新穎的建議獎勵來增強模型反饋的可靠性。此外,我們開發了一種基於推理引導的數據收集管道,創建了GUI-Critic-Train和GUI-Critic-Test,填補了現有GUI批評數據的空白。在GUI-Critic-Test上進行的靜態實驗顯示,我們的GUI-Critic-R1在移動和網頁領域的批評準確性上相比現有的MLLMs具有顯著優勢。在GUI自動化基準上的動態評估進一步凸顯了我們模型的有效性和優越性,這體現在成功率和操作效率的提升上。
English
In recent years, Multimodal Large Language Models (MLLMs) have been
extensively utilized for multimodal reasoning tasks, including Graphical User
Interface (GUI) automation. Unlike general offline multimodal tasks, GUI
automation is executed in online interactive environments, necessitating
step-by-step decision-making based on real-time status of the environment. This
task has a lower tolerance for decision-making errors at each step, as any
mistakes may cumulatively disrupt the process and potentially lead to
irreversible outcomes like deletions or payments. To address these issues, we
introduce a pre-operative critic mechanism that provides effective feedback
prior to the actual execution, by reasoning about the potential outcome and
correctness of actions. Specifically, we propose a Suggestion-aware Gradient
Relative Policy Optimization (S-GRPO) strategy to construct our pre-operative
critic model GUI-Critic-R1, incorporating a novel suggestion reward to enhance
the reliability of the model's feedback. Furthermore, we develop a
reasoning-bootstrapping based data collection pipeline to create a
GUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI critic
data. Static experiments on the GUI-Critic-Test across both mobile and web
domains reveal that our GUI-Critic-R1 offers significant advantages in critic
accuracy compared to current MLLMs. Dynamic evaluation on GUI automation
benchmark further highlights the effectiveness and superiority of our model, as
evidenced by improved success rates and operational efficiency.