PAGER:弥合点精确几何图形用户界面控制中的语义-执行鸿沟
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
May 15, 2026
作者: Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan
cs.AI
摘要
大型视觉语言模型显著推动了GUI代理的发展,使其能够在网页、移动和桌面界面中执行可交互操作。然而,这些进步很大程度上依赖于一种宽容的区域容忍范式——同一组件内邻近的像素点均被视为有效。精确几何构建打破了这一假设:动作必须落在连续画布空间中的特定点上,而非容忍区域内。由于几何基元具有本体论依赖关系,局部坐标误差会引发级联拓扑失效,从而扭曲下游对象并最终导致构建无效。我们将此任务定义为精度敏感型GUI任务,要求点级精度、几何感知验证以及对依赖驱动的错误传播的鲁棒性。为进行基准测试,我们引入了PAGE Bench,包含4,906个问题及超过22.4万条过程监督的像素级GUI动作。我们进一步提出PAGER,一种拓扑感知代理,将构建过程分解为依赖结构化的规划和像素级执行。基于像素级监督调优建立了可执行动作语法,而精度对齐的强化学习通过状态条件化的几何反馈缓解了 rollout 带来的暴露偏差。实验揭示了显著的语义-执行鸿沟:通用多模态模型的动作类型准确率可超过88%,但任务成功率仍低于6%。PAGER弥合了这一差距,在任务成功率上比最强通用基线高出4.1倍,并将步骤成功率从GUI专用代理的不足9%提升至62%以上,为点精确GUI控制树立了新的最优水平。
English
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.