InfiGUI-R1：將多模態GUI代理從被動執行者推進為深思熟慮的推理者

摘要

多模态大语言模型（MLLMs）已赋能图形用户界面（GUI）代理，在自动化计算设备任务方面展现出潜力。近期研究开始探索GUI任务中的推理，并取得了鼓舞人心的成果。然而，当前许多方法依赖于手动设计的推理模板，这可能导致推理在面对复杂GUI环境时不够稳健和适应性强。同时，一些现有代理仍作为反应型执行者运作，主要依赖隐式推理，可能缺乏执行需要规划与错误恢复的GUI任务所需的深度。我们认为，推动这些代理进步需要从反应型执行转向基于深思熟虑的推理执行。为促进这一转变，我们引入了InfiGUI-R1，一个通过我们的Actor2Reasoner框架开发的MLLM基础GUI代理，该框架是一个以推理为核心、分两阶段训练的方法，旨在逐步将代理从反应型执行者进化为深思熟虑的推理者。第一阶段，推理注入，专注于建立基础推理器。我们采用空间推理蒸馏，通过包含明确推理步骤的轨迹，将跨模态空间推理能力从教师模型迁移至MLLMs，使模型在生成行动前能整合GUI视觉空间信息与逻辑推理。第二阶段，深思熟虑增强，利用强化学习将基础推理器精炼为深思熟虑型。此阶段引入两种方法：子目标引导，奖励模型生成准确的中间子目标；错误恢复场景构建，从易出错步骤中创建失败与恢复的训练场景。实验结果显示，InfiGUI-R1在GUI定位与轨迹任务中表现出色。资源详见https://github.com/Reallm-Labs/InfiGUI-R1。

English

Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.

InfiGUI-R1：將多模態GUI代理從被動執行者推進為深思熟慮的推理者

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

摘要

Support