POINTS-GUI-G: GUI接地ジャーニー

要旨

視覚言語モデルの急速な進歩は、GUIエージェントの出現を促進しており、オンラインショッピングから航空券予約に至る複雑なタスクの自動化を通じて、反復的なデジタルワークフローの負担を軽減するという多大な可能性を秘めています。基礎的な能力として、GUIグラウンディングは通常、エンドツーエンドのタスク実行における前提条件として確立されます。これはモデルがテキストやアイコンなどのインターフェース要素を正確に位置特定し、クリックやタイピングといった正確な操作を実行することを可能にします。強力な空間認識能力を既に有するモデル（例：Qwen3-VL）をファインチューニングする従来の研究とは異なり、我々はPOINTS-1.5のような最小限のグラウンディング能力しか持たないベースモデルから始めることで、技術パイプライン全体を習得することを目指します。我々が導入するPOINTS-GUI-G-8Bは、ScreenSpot-Proで59.9、OSWorld-Gで66.0、ScreenSpot-v2で95.7、UI-Visionで49.9というスコアでState-of-the-Artの性能を達成しました。本モデルの成功は、以下の3つの主要因に支えられています：(1) 洗練されたデータエンジニアリング：多様なオープンソースデータセットの形式統一と、高度なデータ拡張、フィルタリング、難易度評価戦略の実施。(2) 改善された学習戦略：知覚精度向上のためのビジョンエンコーダの連続的ファインチューニング、および学習と推論間の解像度一貫性の維持。(3) 検証可能な報酬による強化学習（RL）。RLは伝統的に推論能力の強化に用いられてきましたが、我々はそれが知覚集約型タスクであるGUIグラウンディングの精度を大幅に向上させることを実証します。さらに、GUIグラウンディングは報酬が容易に検証可能かつ高精度であるため、RLに対して自然な利点を提供します。

English

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

POINTS-GUI-G: GUI接地ジャーニー

POINTS-GUI-G: GUI-Grounding Journey

要旨

Support