PAGER: 点精度の幾何学的GUI制御における意味実行ギャップの橋渡し

要旨

大規模視覚言語モデルはGUIエージェントを大幅に進化させ、Web、モバイル、デスクトップインターフェース間での実行可能な対話を可能にした。しかし、これらの進歩は主に寛容な領域許容パラダイムに依存している。このパラダイムでは、同一コンポーネント内の近傍ピクセルが依然として有効とみなされる。精密な幾何学的構築はこの前提を覆す。すなわち、アクションは許容領域ではなく連続的なキャンバス空間上の点に正確に着地しなければならない。幾何プリミティブには本質的な依存関係が存在するため、局所的な座標誤差が連鎖的なトポロジー障害を引き起こし、後続のオブジェクトを歪め、最終的な構築を無効にする可能性がある。我々はこの領域を高精度を要するGUIタスクと特定し、点レベルの精度、幾何を考慮した検証、依存関係によるエラー伝播に対するロバスト性を必要とする。これをベンチマークするために、我々はPAGE Benchを導入する。これは4,906の問題と22万4千以上のプロセス監視型ピクセルレベルGUIアクションを含む。さらに我々はPAGERを提案する。これはトポロジー認識エージェントであり、構築を依存関係に基づく構造化計画とピクセルレベルの実行に分解する。ピクセルに基づく教師付きチューニングは実行可能な行動文法を確立し、精度整合型強化学習は状態条件付き幾何フィードバックを通じてロールアウトによる露出バイアスを緩和する。実験は顕著な意味-実行ギャップを明らかにした。すなわち、汎用マルチモーダルモデルは88%を超える行動タイプ精度を達成できる一方、タスク成功率は6%未満にとどまる。PAGERはこのギャップを埋め、評価された最も強力な汎用ベースラインよりも4.1倍高いタスク成功率を達成し、GUI特化エージェントのステップ成功率を9%未満から62%以上に引き上げ、点精度のGUI制御において新たな最先端を確立した。

English

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.