InfiGUI-R1: マルチモーダルGUIエージェントの進化 - 反応的なアクターから熟慮型の推論者へ

要旨

マルチモーダル大規模言語モデル（MLLM）は、グラフィカルユーザーインターフェース（GUI）エージェントを強化し、コンピューティングデバイス上のタスク自動化において有望な成果を示しています。最近の研究では、GUIタスクにおける推論の探求が始まり、励みになる結果が得られています。しかし、現在の多くのアプローチは手動で設計された推論テンプレートに依存しており、複雑なGUI環境に対して十分に堅牢で適応的な推論が得られない可能性があります。一方、既存のエージェントの一部は依然として反応的アクターとして動作し、主に暗黙の推論に依存しているため、計画やエラー回復を必要とするGUIタスクに対して十分な深さを欠く場合があります。これらのエージェントを進化させるためには、反応的な行動から意図的な推論に基づく行動への移行が必要であると主張します。この変革を促進するために、我々はInfiGUI-R1を紹介します。これは、アクターから推論者への進化を段階的に促す、推論中心の2段階トレーニングアプローチであるActor2Reasonerフレームワークを通じて開発されたMLLMベースのGUIエージェントです。第1段階の「推論注入」では、基本的な推論者を確立することに焦点を当てます。我々は、空間推論蒸留を採用し、教師モデルからMLLMへ、明示的な推論ステップを含む軌跡を通じてクロスモーダル空間推論能力を転移させ、モデルがアクション生成前にGUIの視覚空間情報と論理推論を統合できるようにします。第2段階の「熟慮強化」では、強化学習を用いて基本的な推論者を熟慮型に洗練させます。この段階では、2つのアプローチを導入します。1つは、正確な中間サブゴールを生成するモデルを報酬する「サブゴールガイダンス」、もう1つは、エラーが発生しやすいステップから失敗と回復のトレーニングシナリオを作成する「エラー回復シナリオ構築」です。実験結果は、InfiGUI-R1がGUIグラウンディングと軌跡タスクにおいて優れた性能を達成することを示しています。リソースはhttps://github.com/Reallm-Labs/InfiGUI-R1にあります。

English

Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.

InfiGUI-R1: マルチモーダルGUIエージェントの進化 - 反応的なアクターから熟慮型の推論者へ

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

要旨

Support