GUI 그라운딩을 위한 테스트 시간 강화 학습: 영역 일관성 기반

초록

그래픽 사용자 인터페이스(GUI) 그라운딩은 자연어 명령을 정확한 화면 좌표로 매핑하는 작업으로, 자율 GUI 에이전트의 기본적인 기능입니다. 기존 방법들은 광범위한 지도 학습 또는 레이블된 보상을 사용한 강화 학습을 통해 강력한 성능을 달성했지만, 픽셀 수준의 주석 비용과 가용성에 제약을 받고 있습니다. 우리는 모델이 동일한 GUI 요소에 대해 여러 예측을 생성할 때, 공간적 중첩 패턴이 암시적인 신뢰 신호를 제공하여 더 정확한 위치 지정을 유도할 수 있다는 점을 관찰했습니다. 이러한 통찰을 바탕으로, 우리는 GUI-RC(Region Consistency)를 제안합니다. 이는 테스트 시간 스케일링 방법으로, 여러 샘플링된 예측으로부터 공간 투표 그리드를 구성하여 모델이 가장 높은 일치를 보이는 합의 영역을 식별합니다. 추가 학습 없이도 GUI-RC는 ScreenSpot 벤치마크에서 다양한 아키텍처에 걸쳐 정확도를 2-3% 향상시킵니다. 더 나아가, 우리는 GUI-RCPO(Region Consistency Policy Optimization)를 소개합니다. 이는 이러한 일관성 패턴을 테스트 시간 강화 학습을 위한 보상으로 변환합니다. 각 예측이 집단적 합의와 얼마나 잘 일치하는지를 계산함으로써, GUI-RCPO는 추론 중에 레이블이 없는 데이터에서 모델이 출력을 반복적으로 개선할 수 있도록 합니다. 광범위한 실험을 통해 우리의 접근 방식의 일반성을 입증했습니다: GUI-RC는 ScreenSpot-v2에서 Qwen2.5-VL-3B-Instruct의 정확도를 80.11%에서 83.57%로 향상시켰으며, GUI-RCPO는 자기 지도 최적화를 통해 이를 85.14%로 더욱 개선했습니다. 우리의 접근 방식은 GUI 그라운딩을 위한 테스트 시간 스케일링과 테스트 시간 강화 학습의 잠재력을 발굴하여, 더 견고하고 데이터 효율적인 GUI 에이전트를 향한 유망한 길을 제시합니다.

English

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

GUI 그라운딩을 위한 테스트 시간 강화 학습: 영역 일관성 기반

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

초록

Support