InfiGUI-R1:將多模態GUI代理從被動執行者推進為深思熟慮的推理者
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
April 19, 2025
作者: Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, Fei Wu
cs.AI
摘要
多模态大语言模型(MLLMs)已赋能图形用户界面(GUI)代理,在自动化计算设备任务方面展现出潜力。近期研究开始探索GUI任务中的推理,并取得了鼓舞人心的成果。然而,当前许多方法依赖于手动设计的推理模板,这可能导致推理在面对复杂GUI环境时不够稳健和适应性强。同时,一些现有代理仍作为反应型执行者运作,主要依赖隐式推理,可能缺乏执行需要规划与错误恢复的GUI任务所需的深度。我们认为,推动这些代理进步需要从反应型执行转向基于深思熟虑的推理执行。为促进这一转变,我们引入了InfiGUI-R1,一个通过我们的Actor2Reasoner框架开发的MLLM基础GUI代理,该框架是一个以推理为核心、分两阶段训练的方法,旨在逐步将代理从反应型执行者进化为深思熟虑的推理者。第一阶段,推理注入,专注于建立基础推理器。我们采用空间推理蒸馏,通过包含明确推理步骤的轨迹,将跨模态空间推理能力从教师模型迁移至MLLMs,使模型在生成行动前能整合GUI视觉空间信息与逻辑推理。第二阶段,深思熟虑增强,利用强化学习将基础推理器精炼为深思熟虑型。此阶段引入两种方法:子目标引导,奖励模型生成准确的中间子目标;错误恢复场景构建,从易出错步骤中创建失败与恢复的训练场景。实验结果显示,InfiGUI-R1在GUI定位与轨迹任务中表现出色。资源详见https://github.com/Reallm-Labs/InfiGUI-R1。
English
Multimodal Large Language Models (MLLMs) have powered Graphical User
Interface (GUI) Agents, showing promise in automating tasks on computing
devices. Recent works have begun exploring reasoning in GUI tasks with
encouraging results. However, many current approaches rely on manually designed
reasoning templates, which may result in reasoning that is not sufficiently
robust and adaptive for complex GUI environments. Meanwhile, some existing
agents continue to operate as Reactive Actors, relying primarily on implicit
reasoning that may lack sufficient depth for GUI tasks demanding planning and
error recovery. We argue that advancing these agents requires a shift from
reactive acting towards acting based on deliberate reasoning. To facilitate
this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed
through our Actor2Reasoner framework, a reasoning-centric, two-stage training
approach designed to progressively evolve agents from Reactive Actors to
Deliberative Reasoners. The first stage, Reasoning Injection, focuses on
establishing a basic reasoner. We employ Spatial Reasoning Distillation to
transfer cross-modal spatial reasoning capabilities from teacher models to
MLLMs through trajectories with explicit reasoning steps, enabling models to
integrate GUI visual-spatial information with logical reasoning before action
generation. The second stage, Deliberation Enhancement, refines the basic
reasoner into a deliberative one using Reinforcement Learning. This stage
introduces two approaches: Sub-goal Guidance, which rewards models for
generating accurate intermediate sub-goals, and Error Recovery Scenario
Construction, which creates failure-and-recovery training scenarios from
identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves
strong performance in GUI grounding and trajectory tasks. Resources at
https://github.com/Reallm-Labs/InfiGUI-R1.Summary
AI-Generated Summary