ChatPaper.aiChatPaper

GUI-AIMA:通过上下文锚点对齐内在多模态注意力以实现GUI定位

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

November 2, 2025
作者: Shijie Zhou, Viet Dac Lai, Hao Tan, Jihyung Kil, Wanrong Zhu, Changyou Chen, Ruiyi Zhang
cs.AI

摘要

图形用户界面(GUI)定位是计算机使用代理的核心功能,其任务是将自然语言指令映射至可操作的屏幕区域。现有基于多模态大语言模型(MLLM)的方法通常将其视为基于文本的坐标生成任务,但直接从视觉输入生成精确坐标仍存在挑战且计算成本高昂。一种直观的实现方式是先筛选与指令相关的视觉区块,再在这些区块内确定精确点击位置。基于通用MLLM的注意力机制中天然蕴含定位能力的发现,我们提出GUI-AIMA——一种基于注意力机制且无需坐标监督的微调框架,用于实现高效GUI定位。该框架通过多头聚合简化后的查询-视觉注意力矩阵,自适应计算多样化用户指令的区块级定位信号,使MLLM固有的多模态注意力与定位信号对齐。其无坐标特性可轻松集成即插即用的局部放大模块。仅用8.5万张屏幕截图训练的GUI-AIMA-3B模型展现出卓越的数据效率,验证了轻量训练即可激发MLLM原生定位能力。该模型在3B参数规模中达到最先进性能,于ScreenSpot-Pro和OSWorld-G数据集上分别取得58.6%和62.2%的平均准确率。项目页面:https://github.com/sjz5202/GUI-AIMA
English
Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 58.6% on ScreenSpot-Pro and 62.2% on OSWorld-G. Project page: https://github.com/sjz5202/GUI-AIMA
PDF31January 19, 2026