GoClick: 자율 GUI 상호작용을 위한 경량 요소 그라운딩 모델

초록

그래픽 사용자 인터페이스(GUI) 요소 기반화(자연어 지시에 따라 스크린샷 내 요소를 정확히 위치 지정)는 GUI와 상호작용하는 에이전트의 기초 기능입니다. 낮은 지연 시간을 요구하는 GUI 에이전트의 경우, 모바일 폰과 같은 자원이 제한된 기기에서 이 기능을 직접 구동하는 것은 점점 더 중요해지고 있습니다. 그러나 현재 시각 기반화 방법들은 일반적으로 대규모 시각-언어 모델(VLM)(25억 개 이상의 매개변수)을 사용하므로, 메모리 및 계산 자원 제약으로 인해 기기 내 실행이 사실상 불가능해 중요한 과제에 직면해 있습니다. 이를 해결하기 위해 본 논문은 2억 3천만 개의 매개변수만을 가진 경량 GUI 요소 기반화 VLM인 GoClick을 소개합니다. GoClick은 훨씬 더 큰 모델들과 견줄 만한 우수한 시각 기반화 정확도를 달성합니다. 기존 디코더 전용 VLM을 단순히 축소하는 것은 경량 모델을 설계하는 직관적인 방법이지만, 우리의 실험 결과 이 접근 방식은 최적의 결과를 내지 못합니다. 대신 우리는 인코더-디코더 아키텍처를 선택했으며, 이는 GUI 기반화 작업에서 작은 매개변수 규모에서 디코더 전용 대안들을 능가하는 성능을 보입니다. 또한, 소규모 VLM의 제한된 용량은 과제 유형 필터링과 데이터 비율 조정을 활용하여 1,080만 개의 원시 데이터셋에서 38만 개 샘플의 고품질 코어 세트를 추출하는 점진적 데이터 정제 파이프라인을 개발하도록 장려했습니다. 이 코어 세트를 사용하여 GoClick을 학습시키면 뚜렷한 기반화 정확도 향상을 가져옵니다. 우리의 실험 결과, GoClick은 작은 크기와 높은 추론 속도를 유지하면서 여러 GUI 요소 기반화 벤치마크에서 뛰어난 성능을 발휘함을 보여줍니다. GoClick은 또한 기기-클라우드 협업 프레임워크에 통합될 때 GUI 에이전트 성능을 향상시킵니다. 이 프레임워크에서 GoClick은 클라우드 기반 작업 플래너가 정확한 요소 위치 지정을 수행하고 더 높은 성공률을 달성하도록 돕습니다. 우리의 방법이 GUI 에이전트 커뮤니티 내에서 의미 있는 탐구 사례로 활용되기를 바랍니다.

English

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.

GoClick: 자율 GUI 상호작용을 위한 경량 요소 그라운딩 모델

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

초록

Support