UI-Ins：通过多视角指令推理增强图形用户界面理解能力

摘要

图形用户界面基础任务（GUI Grounding）作为GUI智能体的核心能力，其本质是将自然语言指令映射为可操作的界面元素。现有研究大多将指令视为用户意图的静态代理，忽视了指令多样性与质量对基础任务性能的影响。通过对现有基础任务数据集的细致分析，我们发现其中23.3%的指令存在缺陷，并证明在推理阶段利用指令多样性可实现高达76%的相对性能提升。本文提出"指令即推理"新范式，将指令视为提供独特视角的动态分析路径，使模型能够在推理过程中选择最优路径。为实现这一目标，我们设计了两阶段训练框架：首先通过基于合成多样化指令的监督微调（SFT）注入多视角推理能力，随后采用强化学习（RL）优化路径选择与组合策略。最终得到的UI-Ins-7B和UI-Ins-32B模型在五大挑战性基础任务基准上取得最先进性能，并展现出新兴推理能力——在推理时能选择性组合并生成新颖指令路径。其中UI-Ins-32B获得最佳基础任务准确率：在UI-I2E-Bench达87.3%，ScreenSpot-Pro达57.0%，MMBench-GUI L2达84.9%。此外，我们的模型展现出强大智能体潜力，使用UI-Ins-7B作为执行器在AndroidWorld环境中实现74.1%的任务成功率。深度分析揭示了更多洞见：如何构建推理机制以增强而非阻碍基础任务性能，以及我们的方法如何缓解SFT+RL框架中的策略坍塌问题。所有代码与模型检查点将公开于https://github.com/alibaba/UI-Ins。

English

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.

UI-Ins：通过多视角指令推理增强图形用户界面理解能力

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

摘要

Support