UI-Ins:通过多视角指令推理增强图形用户界面理解能力
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
October 23, 2025
作者: Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi
cs.AI
摘要
图形用户界面基础任务(GUI Grounding)作为GUI智能体的核心能力,其本质是将自然语言指令映射为可操作的界面元素。现有研究大多将指令视为用户意图的静态代理,忽视了指令多样性与质量对基础任务性能的影响。通过对现有基础任务数据集的细致分析,我们发现其中23.3%的指令存在缺陷,并证明在推理阶段利用指令多样性可实现高达76%的相对性能提升。本文提出"指令即推理"新范式,将指令视为提供独特视角的动态分析路径,使模型能够在推理过程中选择最优路径。为实现这一目标,我们设计了两阶段训练框架:首先通过基于合成多样化指令的监督微调(SFT)注入多视角推理能力,随后采用强化学习(RL)优化路径选择与组合策略。最终得到的UI-Ins-7B和UI-Ins-32B模型在五大挑战性基础任务基准上取得最先进性能,并展现出新兴推理能力——在推理时能选择性组合并生成新颖指令路径。其中UI-Ins-32B获得最佳基础任务准确率:在UI-I2E-Bench达87.3%,ScreenSpot-Pro达57.0%,MMBench-GUI L2达84.9%。此外,我们的模型展现出强大智能体潜力,使用UI-Ins-7B作为执行器在AndroidWorld环境中实现74.1%的任务成功率。深度分析揭示了更多洞见:如何构建推理机制以增强而非阻碍基础任务性能,以及我们的方法如何缓解SFT+RL框架中的策略坍塌问题。所有代码与模型检查点将公开于https://github.com/alibaba/UI-Ins。
English
GUI grounding, which maps natural-language instructions to actionable UI
elements, is a core capability of GUI agents. Prior works largely treats
instructions as a static proxy for user intent, overlooking the impact of
instruction diversity and quality on grounding performance. Through a careful
investigation of existing grounding datasets, we find a 23.3% flaw rate in
their instructions and show that inference-time exploitation of instruction
diversity yields up to a substantial 76% relative performance improvement. In
this paper, we introduce the Instruction-as-Reasoning paradigm, treating
instructions as dynamic analytical pathways that offer distinct perspectives
and enabling the model to select the most effective pathway during reasoning.
To achieve this, we propose a two-stage training framework: supervised
fine-tuning (SFT) on synthesized, diverse instructions to instill
multi-perspective reasoning, followed by reinforcement learning (RL) to
optimize pathway selection and composition. Our resulting models, UI-Ins-7B and
UI-Ins-32B, achieve state-of-the-art results on five challenging grounding
benchmarks and exhibit emergent reasoning, selectively composing and
synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B
attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on
ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model
demonstrates strong agentic potential, achieving a 74.1% success rate on
AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals
additional insights such as how reasoning can be formulated to enhance rather
than hinder grounding performance, and how our method mitigates policy collapse
in the SFT+RL framework. All code and model checkpoints will be publicly
released in https://github.com/alibaba/UI-Ins.