UI-Ins: マルチ視点の命令推論によるGUIグラウンディングの強化

要旨

GUIグラウンディング（自然言語指示を操作可能なUI要素に対応付ける技術）は、GUIエージェントの中核的な能力である。従来の研究では、指示はユーザー意図の静的な代理として扱われることが多く、指示の多様性と品質がグラウンディング性能に与える影響が見落とされてきた。既存のグラウンディングデータセットを詳細に調査した結果、指示には23.3%の欠陥率が存在し、推論時に指示の多様性を活用することで最大76%という大幅な相対的性能向上が得られることを明らかにした。本論文では、指示を動的な分析的経路として捉え、異なる視点を提供し、推論中にモデルが最も効果的な経路を選択できるようにする「指示as推論」パラダイムを提案する。これを実現するため、合成された多様な指示による教師ありファインチューニング（SFT）で多視点推論能力を習得させ、その後強化学習（RL）で経路選択と構成を最適化する、2段階のトレーニングフレームワークを構築した。結果として得られたモデルUI-Ins-7BおよびUI-Ins-32Bは、5つの難易度の高いグラウンディングベンチマークでState-of-the-Artを達成し、推論時に新規の指示経路を選択的に構成・合成する創発的推論能力を示した。特にUI-Ins-32Bは最高のグラウンディング精度を達成し、UI-I2E-Benchで87.3%、ScreenSpot-Proで57.0%、MMBench-GUI L2で84.9%のスコアを記録した。さらに、当モデルは強力なエージェント能力を示し、UI-Ins-7Bを実行器としてAndroidWorldで74.1%の成功率を達成した。詳細な分析により、推論がグラウンディング性能を阻害ではなく強化するようにどのように定式化できるか、また本手法がSFT+RLフレームワークにおけるポリシー崩壊をどのように緩和するかといったさらなる知見が得られた。すべてのコードとモデルチェックポイントはhttps://github.com/alibaba/UI-Ins で公開予定である。

English

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.

UI-Ins: マルチ視点の命令推論によるGUIグラウンディングの強化

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

要旨

Support