信赖正确教师：面向GUI定位的质量感知自蒸馏

摘要

图形用户界面（GUI）定位要求视觉语言模型（VLM）在高分辨率截图中识别微小目标元素并预测精确的屏幕坐标。在策略自蒸馏（OPSD）是一种有前景的后训练方法，适用于这种坐标敏感任务，因为它能提供超出硬坐标标签的密集令牌级教师信号。然而，朴素OPSD并不完全适合GUI定位：OPSD在学生生成的前缀上评估教师，当前缀已偏离目标坐标时，坐标令牌教师信号的质量可能下降，导致不可靠的教师信号。为缓解这一问题，我们提出了一种面向VLM的GUI定位的质量感知自蒸馏方法，通过软正确性感知门控和教师概率缩放来提升坐标令牌教师信号的质量。软正确性感知门控检查：在当前学生生成的前缀下，教师的坐标令牌预测是否仍能完成真实边界框。若不能，则对应的教师信号被降低权重。随后，教师概率缩放利用教师的置信度作为轻量级因子，进一步校准门控监督的强度。一个关键实验发现是：单独使用任意一个组件均无法提升整体性能，而两者组合则能持续改进性能。这表明两种机制发挥互补作用：正确性感知门控抑制不可靠的坐标令牌监督，教师概率缩放则校准剩余信号的强度。在六个GUI定位基准上的实验表明，我们的方法能持续改进基础模型，并优于强基线方法。

English

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.