InfiGUI-R1: 반응형 행위자에서 숙고형 추론자로 발전하는 멀티모달 GUI 에이전트

초록

멀티모달 대형 언어 모델(MLLMs)은 그래픽 사용자 인터페이스(GUI) 에이전트를 강화하여 컴퓨팅 장치에서의 작업 자동화에 유망한 가능성을 보여주고 있습니다. 최근 연구들은 GUI 작업에서의 추론을 탐구하며 고무적인 결과를 보여주고 있습니다. 그러나 현재의 많은 접근 방식들은 수동으로 설계된 추론 템플릿에 의존하고 있어, 복잡한 GUI 환경에서 충분히 강력하고 적응적인 추론을 제공하지 못할 수 있습니다. 한편, 일부 기존 에이전트들은 여전히 반응형 행위자(Reactive Actors)로 작동하며, 주로 암묵적 추론에 의존함으로써 계획과 오류 복구가 필요한 GUI 작업에 충분한 깊이를 제공하지 못할 수 있습니다. 우리는 이러한 에이전트의 발전을 위해서는 반응적 행위에서 신중한 추론에 기반한 행위로의 전환이 필요하다고 주장합니다. 이러한 변화를 촉진하기 위해, 우리는 Actor2Reasoner 프레임워크를 통해 개발된 MLLM 기반 GUI 에이전트인 InfiGUI-R1을 소개합니다. 이 프레임워크는 추론 중심의 두 단계 훈련 접근법으로, 에이전트를 반응형 행위자에서 신중한 추론자(Deliberative Reasoners)로 점진적으로 발전시키도록 설계되었습니다. 첫 번째 단계인 '추론 주입(Reasoning Injection)'은 기본 추론자를 구축하는 데 초점을 맞춥니다. 우리는 공간 추론 증류(Spatial Reasoning Distillation)를 사용하여 교사 모델로부터 MLLM으로 교차 모달 공간 추론 능력을 전달하며, 명시적 추론 단계가 포함된 궤적을 통해 모델이 행동 생성 전에 GUI 시각-공간 정보와 논리적 추론을 통합할 수 있도록 합니다. 두 번째 단계인 '신중성 강화(Deliberation Enhancement)'는 강화 학습을 사용하여 기본 추론자를 신중한 추론자로 정제합니다. 이 단계에서는 두 가지 접근법을 도입합니다: '하위 목표 안내(Sub-goal Guidance)'는 모델이 정확한 중간 하위 목표를 생성할 때 보상을 제공하며, '오류 복구 시나리오 구성(Error Recovery Scenario Construction)'은 오류가 발생하기 쉬운 단계에서 실패 및 복구 훈련 시나리오를 생성합니다. 실험 결과는 InfiGUI-R1이 GUI 기반 및 궤적 작업에서 강력한 성능을 달성함을 보여줍니다. 자원은 https://github.com/Reallm-Labs/InfiGUI-R1에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.

InfiGUI-R1: 반응형 행위자에서 숙고형 추론자로 발전하는 멀티모달 GUI 에이전트

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

초록

Support