PAGER：弥合点精确几何图形用户界面控制中的语义-执行鸿沟

摘要

大型视觉语言模型显著推动了GUI代理的发展，使其能够在网页、移动和桌面界面中执行可交互操作。然而，这些进步很大程度上依赖于一种宽容的区域容忍范式——同一组件内邻近的像素点均被视为有效。精确几何构建打破了这一假设：动作必须落在连续画布空间中的特定点上，而非容忍区域内。由于几何基元具有本体论依赖关系，局部坐标误差会引发级联拓扑失效，从而扭曲下游对象并最终导致构建无效。我们将此任务定义为精度敏感型GUI任务，要求点级精度、几何感知验证以及对依赖驱动的错误传播的鲁棒性。为进行基准测试，我们引入了PAGE Bench，包含4,906个问题及超过22.4万条过程监督的像素级GUI动作。我们进一步提出PAGER，一种拓扑感知代理，将构建过程分解为依赖结构化的规划和像素级执行。基于像素级监督调优建立了可执行动作语法，而精度对齐的强化学习通过状态条件化的几何反馈缓解了 rollout 带来的暴露偏差。实验揭示了显著的语义-执行鸿沟：通用多模态模型的动作类型准确率可超过88%，但任务成功率仍低于6%。PAGER弥合了这一差距，在任务成功率上比最强通用基线高出4.1倍，并将步骤成功率从GUI专用代理的不足9%提升至62%以上，为点精确GUI控制树立了新的最优水平。

English

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.