POINTS-GUI-G:图形用户界面接地之旅
POINTS-GUI-G: GUI-Grounding Journey
February 6, 2026
作者: Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou
cs.AI
摘要
視覺語言模型的快速發展催生了圖形用戶界面(GUI)代理的興起,這類代理在自動化複雜任務(從網購到航班預訂)方面展現巨大潛力,從而有效減輕重複性數字工作流程的負擔。作為基礎能力,GUI定位通常被確立為端到端任務執行的先決條件,使模型能精確定位文本、圖標等界面元素,以執行點擊、輸入等精準操作。有別於以往對已具備強空間感知能力模型(如Qwen3-VL)進行微調的研究,我們旨在通過從定位能力最基礎的模型(如POINTS-1.5)起步,掌握完整技術鏈路。我們推出的POINTS-GUI-G-8B模型實現了業界領先性能:在ScreenSpot-Pro得分59.9、OSWorld-G得分66.0、ScreenSpot-v2得分95.7、UI-Vision得分49.9。該模型的成功源於三大關鍵因素:(1)精細化數據工程:統一多源開源數據集格式,並採用數據增強、篩選與難度分級的綜合策略;(2)優化訓練策略:通過對視覺編碼器持續微調提升感知精度,保持訓練與推理階段的解析度一致性;(3)可驗證獎勵的強化學習(RL)。傳統RL主要用於增強推理能力,我們則證實其在感知密集的GUI定位任務中能顯著提升精度。此外,GUI定位為RL提供了天然優勢,因其獎勵機制易於驗證且具備高準確性。
English
The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.