ChatPaper.aiChatPaper

UI-Ins:透過多視角指令即推理增強圖形使用者介面基礎理解

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

October 23, 2025
作者: Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi
cs.AI

摘要

圖形使用者介面基礎定位(GUI Grounding)作為將自然語言指令映射至可操作UI元素的核心能力,是GUI代理的關鍵技術。現有研究大多將指令視為用戶意圖的靜態代理,忽略了指令多樣性與質量對基礎定位性能的影響。透過對現有基礎定位資料集的細緻分析,我們發現其中23.3%的指令存在缺陷,並證實推理階段利用指令多樣性能帶來高達76%的相對性能提升。本文提出「指令即推理」範式,將指令視為提供獨特視角的動態分析路徑,使模型能在推理過程中選擇最有效的路徑。為實現此目標,我們設計兩階段訓練框架:首先透過合成多樣化指令的監督微調(SFT)注入多視角推理能力,再通過強化學習(RL)優化路徑選擇與組合策略。由此產生的UI-Ins-7B與UI-Ins-32B模型在五項高難度基礎定位基準測試中取得最先進成果,並展現出新興推理能力——在推理時選擇性組合與合成新型指令路徑。其中UI-Ins-32B以87.3%的準確率創下UI-I2E-Bench最佳成績,在ScreenSpot-Pro與MMBench-GUI L2分別達到57.0%與84.9%。此外,我們的模型展現出強大的智能體潛力,以UI-Ins-7B作為執行器在AndroidWorld任務中實現74.1%的成功率。深度分析揭示了更多洞見:如何構建推理機制以增強而非阻礙基礎定位性能,以及我們的方法如何緩解SFT+RL框架中的策略崩潰問題。所有程式碼與模型檢查點將公開於https://github.com/alibaba/UI-Ins。
English
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.
PDF232December 17, 2025