POINTS-GUI-G：图形用户界面接地之旅

摘要

视觉语言模型的快速发展催生了GUI智能体的兴起，这些智能体在自动化复杂任务（从在线购物到航班预订）方面具有巨大潜力，从而减轻重复性数字工作流程的负担。作为基础能力，GUI定位通常被确立为端到端任务执行的前提条件，它使模型能够精确定位界面元素（如文本和图标），以执行点击、输入等精准操作。与先前基于已具备强空间感知能力的模型（如Qwen3-VL）进行微调的研究不同，我们旨在从基础能力较弱的模型（如POINTS-1.5）起步，掌握完整技术链条。我们推出的POINTS-GUI-G-8B模型实现了业界领先性能，在ScreenSpot-Pro上得分59.9，OSWorld-G上66.0，ScreenSpot-v2上95.7，UI-Vision上49.9。该模型的成功源于三大关键因素：（1）精炼的数据工程，通过统一多样化开源数据集格式，并采用数据增强、筛选及难度分级等精细化策略；（2）优化的训练策略，包括持续微调视觉编码器以提升感知精度，保持训练与推理阶段的分辨率一致性；（3）基于可验证奖励的强化学习。传统上强化学习主要用于增强推理能力，但我们证明其能显著提升感知密集型GUI定位任务的精度。此外，GUI定位任务天然适合强化学习，因为奖励机制易于验证且具有高准确性。

English

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.