正しい教師を信頼せよ：GUIグラウンディングのための品質認識自己蒸留

要旨

グラフィカルユーザインタフェース（GUI）グラウンディングでは、視覚言語モデル（VLM）が高解像度スクリーンショット内の小さなターゲット要素を特定し、正確な画面座標を予測する必要がある。オン方策自己蒸留（OPSD）は、ハードな座標ラベルを超えた密なトークンレベルの教師信号を提供するため、この座標に敏感なタスクに対する有望なポストトレーニング手法である。しかし、ナイーブなOPSDはGUIグラウンディングにはあまり適していない。OPSDは生徒が生成したプレフィックスに対して教師を評価するため、プレフィックスがすでにターゲット座標から逸脱している場合、座標トークンの教師信号の品質が低下し、信頼性の低い教師信号につながる可能性がある。この問題を軽減するために、我々はVLMベースのGUIグラウンディングのための品質認識自己蒸留を提案する。これは、ソフトな正確性認識ゲーティングと教師確率スケーリングを通じて、座標トークンの教師信号品質を向上させる。ソフトな正確性認識ゲートは、教師の現在の座標トークン予測が、生徒が生成したプレフィックスの下で、まだ正解ボックスに完成できるかどうかをチェックする。もしできない場合、対応する教師信号の重みを下げる。その後、教師確率スケーリングは、教師の信頼度を軽量な要素として使用し、ゲート付き監視の強度をさらに調整する。重要な実証的発見として、どちらかのコンポーネントだけでは全体的な性能は向上しないが、それらを組み合わせると一貫して性能が向上する。これは、この2つのメカニズムが相補的な役割を果たしていることを示唆している。正確性認識ゲーティングは信頼性の低い座標トークン監視を抑制し、教師確率スケーリングは残りの信号の強度を調整する。6つのGUIグラウンディングベンチマークにおける実験では、我々の手法がベースモデルを一貫して改善し、強力なベースラインを凌駕することが示された。

English

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.