GoClick：面向自主GUI交互的轻量化元素定位模型

摘要

图形用户界面(GUI)元素定位（根据自然语言指令在屏幕截图中精确定位元素）是GUI交互智能体的基础能力。对于需要低延迟的GUI智能体而言，在手机等资源受限设备上直接部署该功能日益关键。然而当前视觉定位方法通常采用参数量超过25亿的大型视觉语言模型(VLM)，受限于内存和计算资源而难以在设备端运行。为此，本文提出GoClick——仅需2.3亿参数的轻量化GUI元素定位VLM，其视觉定位精度可与参数量大得多的模型相媲美。虽然直接缩减仅解码器架构VLM是设计轻量模型的直观方案，但实验表明该方法效果欠佳。我们最终采用编码器-解码器架构，该架构在小参数规模下对GUI定位任务表现更优。此外，小规模VLM的有限能力促使我们开发渐进式数据优化流程：通过任务类型筛选和数据比例调整，从1080万原始数据集中提炼出380万高质量核心数据集。基于该数据集训练的GoClick实现了显著的定位精度提升。实验表明，GoClick在多个GUI元素定位基准测试中表现优异，同时保持小体积和高推理速度。当集成至端-云协同框架时，GoClick能辅助云端任务规划器实现精准元素定位，提升GUI智能体任务成功率。我们希望该方法能为GUI智能体领域提供有价值的探索。

English

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.