GoClick:面向自主图形界面交互的轻量化元素定位模型
GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction
April 27, 2026
作者: Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
cs.AI
摘要
图形用户界面(GUI)元素定位(基于自然语言指令在屏幕截图中精确定位元素)是GUI交互智能体的基础能力。对于需要低延迟的GUI智能体而言,将这种能力直接部署在手机等资源受限设备上日益关键。然而这一目标面临重大挑战:当前视觉定位方法通常采用超过25亿参数的大型视觉语言模型(VLM),受限于内存和计算资源而难以在终端设备运行。为此,本文提出GoClick——一个仅含2.3亿参数的轻量级GUI元素定位VLM,其在保持卓越视觉定位精度的同时,甚至可与规模大得多的模型相媲美。单纯缩小现有仅解码器VLM虽是设计轻量模型的直接方案,但实验表明该方法效果欠佳。我们最终选择编码器-解码器架构,该架构在GUI定位任务的小参数规模下优于仅解码器方案。此外,小型VLM的有限能力促使我们开发渐进式数据优化流程,通过任务类型筛选和数据比例调整,从1080万原始数据集中提炼出380万样本的高质量核心集。使用该核心集训练GoClick带来了显著的定位精度提升。实验表明,GoClick在多个GUI元素定位基准测试中表现优异,同时保持小体积和高推理速度。当集成至端-云协作框架时,GoClick能帮助云端任务规划器实现精确元素定位,进而提升GUI智能体的任务成功率。我们希望该方法能为GUI智能体领域提供有价值的探索路径。
English
Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.