テスト時強化学習による領域整合性を考慮したGUIグラウンディング

要旨

グラフィカルユーザーインターフェース（GUI）グラウンディングは、自然言語の指示を正確な画面座標にマッピングするタスクであり、自律型GUIエージェントにとって基本的な技術です。既存の手法は、大規模な教師あり学習やラベル付き報酬を用いた強化学習を通じて高い性能を達成していますが、ピクセルレベルのアノテーションのコストと可用性に制約されています。我々は、モデルが同じGUI要素に対して複数の予測を生成する際、空間的な重なりパターンが暗黙的な信頼度信号を提供し、より正確な位置特定を導くことができることを観察しました。この洞察を活かし、我々はGUI-RC（Region Consistency）を提案します。これは、複数のサンプリングされた予測から空間的な投票グリッドを構築し、モデルが最も高い一致を示すコンセンサス領域を特定するテストタイムスケーリング手法です。学習を一切必要とせず、GUI-RCはScreenSpotベンチマークにおいて、様々なアーキテクチャで精度を2-3%向上させます。さらに、我々はGUI-RCPO（Region Consistency Policy Optimization）を導入します。これは、これらの一貫性パターンを報酬に変換し、テストタイム強化学習を可能にします。各予測が集団的なコンセンサスとどの程度一致するかを計算することで、GUI-RCPOはモデルが推論中にラベルなしデータに対して出力を反復的に改善することを可能にします。広範な実験により、我々のアプローチの汎用性が実証されています：GUI-RCはQwen2.5-VL-3B-InstructのScreenSpot-v2における精度を80.11%から83.57%に向上させ、GUI-RCPOは自己教師あり最適化を通じてさらに85.14%に改善します。我々のアプローチは、GUIグラウンディングにおけるテストタイムスケーリングとテストタイム強化学習の未開拓の可能性を明らかにし、より堅牢でデータ効率の良いGUIエージェントへの有望な道筋を提供します。

English

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

テスト時強化学習による領域整合性を考慮したGUIグラウンディング

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

要旨

Support