올바른 교사를 신뢰하라: GUI 그라운딩을 위한 품질 인식 자기 증류

초록

그래픽 사용자 인터페이스(GUI) 접지는 시각-언어 모델(VLM)이 고해상도 스크린샷 내의 작은 대상 요소를 식별하고 정확한 화면 좌표를 예측하도록 요구한다. 온-정책 자기 증류(OPSD)는 이러한 좌표에 민감한 작업을 위한 유망한 사후 훈련 접근법인데, 이는 하드 좌표 레이블을 넘어서는 조밀한 토큰 수준 교사 신호를 제공하기 때문이다. 그러나 순진한 OPSD는 GUI 접지에 잘 적응하지 못한다: OPSD는 교사를 학생 생성 접두사에 대해 평가하는데, 접두사가 대상 좌표에서 이미 벗어난 경우 좌표 토큰 교사 신호의 품질이 저하되어 신뢰할 수 없는 교사 신호로 이어질 수 있다. 이를 완화하기 위해, 우리는 VLM 기반 GUI 접지를 위한 품질 인식 자기 증류를 제안하며, 이는 소프트 정확도 인식 게이팅과 교사 확률 스케일링을 통해 좌표 토큰 교사 신호 품질을 개선한다. 소프트 정확도 인식 게이트는 학생 생성 접두사 하에서 교사의 현재 좌표 토큰 예측이 여전히 정답 상자로 완성될 수 있는지 확인한다. 그렇지 않은 경우, 해당 교사 신호의 가중치가 하향 조정된다. 이후 교사 확률 스케일링은 교사의 신뢰도를 경량 요소로 사용하여 게이팅된 감독의 강도를 추가로 보정한다. 핵심 실험적 발견은 두 구성 요소 중 어느 하나만으로는 전반적 성능이 개선되지 않는 반면, 이들을 결합하면 일관되게 성능이 향상된다는 점이다. 이는 두 메커니즘이 상호 보완적 역할을 수행함을 시사한다: 정확도 인식 게이팅은 신뢰할 수 없는 좌표 토큰 감독을 억제하고, 교사 확률 스케일링은 남은 신호의 강도를 보정한다. 여섯 개의 GUI 접지 벤치마크에 걸친 실험은 우리 방법이 기본 모델을 일관되게 개선하고 강력한 기준선을 능가함을 보여준다.

English

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.