PAGER: 점-정밀 기하학적 GUI 제어에서 의미-실행 격차 연결

초록

대규모 시각-언어 모델은 GUI 에이전트를 크게 발전시켜 웹, 모바일, 데스크톱 인터페이스 전반에서 실행 가능한 상호작용을 가능하게 했다. 그러나 이러한 성과는 동일한 구성 요소 내의 인접 픽셀들이 여전히 유효한 관대한 영역 허용 패러다임에 크게 의존한다. 정밀한 기하학적 구성은 이러한 가정을 깨뜨린다. 즉, 동작이 관대한 영역이 아닌 연속적인 캔버스 공간의 점에 정확히 위치해야 하기 때문이다. 기하학적 기본 요소는 존재론적 의존성을 수반하므로, 국소적 좌표 오류는 연쇄적인 위상 실패를 유발하여 하위 객체를 왜곡하고 최종 구성을 무효화할 수 있다. 우리는 이러한 체계를 점 수준의 정확성, 형상 인식 검증, 의존성 기반 오류 전파에 대한 강건성을 요구하는 정밀 민감 GUI 작업으로 규정한다. 이를 벤치마킹하기 위해 4,906개의 문제와 224,000개 이상의 프로세스 감독 픽셀 수준 GUI 행동을 포함한 PAGE Bench를 소개한다. 또한, 구성을 의존성 구조화 계획과 픽셀 수준 실행으로 분해하는 토폴로지 인식 에이전트 PAGER를 제안한다. 픽셀 기반 지도 학습 튜닝은 실행 가능한 행동 문법을 확립하고, 정밀 정렬 강화 학습은 상태 조건 기하학적 피드백을 통해 롤아웃 유발 노출 편향을 완화한다. 실험 결과, 두드러진 의미-실행 격차가 드러났다: 일반 멀티모달 모델은 88% 이상의 행동 유형 정확도를 달성하지만 작업 성공률은 6% 미만으로 유지된다. PAGER는 이 격차를 해소하여, 가장 강력한 평가된 일반 기준선 대비 4.1배 높은 작업 성공률을 제공하고, GUI 특화 에이전트의 단계 성공률을 9% 미만에서 62% 이상으로 끌어올려 점 정밀 GUI 제어 분야에서 새로운 최첨단을 확립한다.

English

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.