POINTS-GUI-G: GUI 기반 접근법의 여정

초록

비전-언어 모델의 급속한 발전은 GUI 에이전트의 출현을 촉진했으며, 온라인 쇼핑부터 항공권 예약에 이르기까지 복잡한 작업을 자동화함으로써 반복적인 디지털 워크플로우 부담을 완화할 막대한 잠재력을 지니고 있습니다. 핵심 기반 능력으로서 GUI 그라운딩은 일반적으로 종단간 작업 실행을 위한 전제 조건으로 확립됩니다. 이는 모델이 텍스트 및 아이콘과 같은 인터페이스 요소를 정확히 위치시켜 클릭 및 입력과 같은 정확한 조작을 수행할 수 있게 합니다. 강력한 공간 인식을 이미 갖춘 모델(예: Qwen3-VL)을 미세 조정한 기존 연구와 달리, 우리는 POINTS-1.5와 같이 최소한의 그라운딩 능력을 가진 기본 모델부터 시작하여 전체 기술 파이프라인을 숙달하는 것을 목표로 합니다. 우리는 ScreenSpot-Pro에서 59.9점, OSWorld-G에서 66.0점, ScreenSpot-v2에서 95.7점, UI-Vision에서 49.9점이라는 최첨단 성능을 달성한 POINTS-GUI-G-8B를 소개합니다. 우리 모델의 성공은 세 가지 핵심 요소에 기인합니다: (1) 다양한 오픈소스 데이터셋 형식의 통합과 정교한 증강, 필터링, 난이도 등급 지정 전략을 포함한 정제된 데이터 엔지니어링; (2) 지각 정확도 향상을 위한 비전 인코더의 지속적 미세 조정과 훈련 및 추론 간 해상도 일관성 유지를 포함한 개선된 훈련 전략; (3) 검증 가능한 보상을 활용한 강화 학습(RL). 강화 학습은 전통적으로 추론 능력 강화에 사용되지만, 우리는 이가 지각 집약적인 GUI 그라운딩 작업에서 정밀도를 크게 향상시킴을 입증합니다. 더 나아가 GUI 그라운딩은 보상이 쉽게 검증 가능하고 매우 정확하기 때문에 강화 학습에 자연스러운 이점을 제공합니다.

English

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

POINTS-GUI-G: GUI 기반 접근법의 여정

POINTS-GUI-G: GUI-Grounding Journey

초록

Support