ChatPaper.aiChatPaper

信任正確的教師:品質感知的自蒸餾於GUI定位

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

June 16, 2026
作者: Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu
cs.AI

摘要

圖形使用者介面(GUI)定位要求視覺語言模型(VLM)在高解析度螢幕截圖中識別微小目標元素,並預測精確的螢幕座標。同策略自蒸餾(OPSD)是一種有前景的後訓練方法,適用於這類座標敏感任務,因為它能提供超越硬座標標籤的密集 token 層級教師訊號。然而,單純的 OPSD 並不適合 GUI 定位:OPSD 會在學生生成的前綴上評估教師,而當該前綴已偏離目標座標時,座標 token 的教師訊號品質可能下降,導致不可靠的教師訊號。為緩解此問題,我們提出基於品質感知的自蒸餾方法,應用於 VLM 的 GUI 定位,透過軟性的正確性感知閘控與教師機率縮放來改善座標 token 的教師訊號品質。軟正確性感知閘控會檢查:在學生生成的前綴下,教師當前的座標 token 預測是否仍能完成為真實標籤框。若無法完成,則相應的教師訊號會被降低權重。接著,教師機率縮放利用教師的置信度作為輕量因子,進一步校準閘控監督的強度。關鍵的實驗發現是,單獨使用任一組件都無法提升整體效能,而兩者結合則能一致地改善表現。這表明兩個機制具有互補作用:正確性感知閘控抑制不可靠的座標 token 監督,而教師機率縮放則校準剩餘訊號的強度。在六個 GUI 定位基準上的實驗顯示,我們的方法能一致地提升基礎模型,並優於強基線方法。
English
Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.