基于区域一致性的图形用户界面测试时强化学习
Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
August 7, 2025
作者: Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen
cs.AI
摘要
图形用户界面(GUI)定位任务,即将自然语言指令映射到精确的屏幕坐标,是自主GUI代理的基础。尽管现有方法通过大量监督训练或带有标注奖励的强化学习取得了强劲性能,但它们仍受限于像素级标注的成本和可用性。我们观察到,当模型对同一GUI元素生成多个预测时,空间重叠模式揭示了可引导更精准定位的隐含置信度信号。基于这一洞察,我们提出了GUI-RC(区域一致性),一种测试时扩展方法,通过从多个采样预测构建空间投票网格,以识别模型表现出最高一致性的共识区域。无需任何训练,GUI-RC在ScreenSpot基准测试上,将多种架构的准确率提升了2-3%。我们进一步引入了GUI-RCPO(区域一致性策略优化),将这些一致性模式转化为测试时强化学习的奖励。通过计算每个预测与集体共识的契合度,GUI-RCPO使模型能够在推理过程中对未标注数据迭代优化其输出。大量实验证明了我们方法的普适性:GUI-RC将Qwen2.5-VL-3B-Instruct在ScreenSpot-v2上的准确率从80.11%提升至83.57%,而GUI-RCPO通过自监督优化进一步将其提升至85.14%。我们的方法揭示了测试时扩展和测试时强化学习在GUI定位中未被开发的潜力,为构建更健壮、数据效率更高的GUI代理开辟了一条有前景的道路。
English
Graphical User Interface (GUI) grounding, the task of mapping natural
language instructions to precise screen coordinates, is fundamental to
autonomous GUI agents. While existing methods achieve strong performance
through extensive supervised training or reinforcement learning with labeled
rewards, they remain constrained by the cost and availability of pixel-level
annotations. We observe that when models generate multiple predictions for the
same GUI element, the spatial overlap patterns reveal implicit confidence
signals that can guide more accurate localization. Leveraging this insight, we
propose GUI-RC (Region Consistency), a test-time scaling method that constructs
spatial voting grids from multiple sampled predictions to identify consensus
regions where models show highest agreement. Without any training, GUI-RC
improves accuracy by 2-3% across various architectures on ScreenSpot
benchmarks. We further introduce GUI-RCPO (Region Consistency Policy
Optimization), which transforms these consistency patterns into rewards for
test-time reinforcement learning. By computing how well each prediction aligns
with the collective consensus, GUI-RCPO enables models to iteratively refine
their outputs on unlabeled data during inference. Extensive experiments
demonstrate the generality of our approach: GUI-RC boosts
Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO
further improves it to 85.14% through self-supervised optimization. Our
approach reveals the untapped potential of test-time scaling and test-time
reinforcement learning for GUI grounding, offering a promising path toward more
robust and data-efficient GUI agents.