基於區域一致性的圖形用戶界面測試時強化學習

摘要

圖形用戶界面（GUI）定位任務，即將自然語言指令映射到精確的屏幕座標，是自主GUI代理的基礎。儘管現有方法通過大量監督訓練或帶有標記獎勵的強化學習取得了強勁性能，但它們仍受制於像素級註釋的成本和可用性。我們觀察到，當模型對同一GUI元素生成多個預測時，空間重疊模式揭示了可以指導更精確定位的隱含置信信號。基於這一洞察，我們提出了GUI-RC（區域一致性），這是一種測試時擴展方法，它從多個採樣預測中構建空間投票網格，以識別模型表現出最高一致性的共識區域。在無需任何訓練的情況下，GUI-RC在ScreenSpot基準測試中將各種架構的準確率提高了2-3%。我們進一步引入了GUI-RCPO（區域一致性策略優化），它將這些一致性模式轉化為測試時強化學習的獎勵。通過計算每個預測與集體共識的對齊程度，GUI-RCPO使模型能夠在推理過程中迭代地優化其在未標記數據上的輸出。大量實驗證明了我們方法的通用性：GUI-RC將Qwen2.5-VL-3B-Instruct在ScreenSpot-v2上的準確率從80.11%提升至83.57%，而GUI-RCPO通過自監督優化進一步將其提升至85.14%。我們的方法揭示了測試時擴展和測試時強化學習在GUI定位中的未開發潛力，為構建更健壯且數據高效的GUI代理提供了一條有前景的路徑。

English

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

基於區域一致性的圖形用戶界面測試時強化學習

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

摘要

Support