VISTA: GUI 그라운딩을 위한 뷰 일관성 자기 검증 훈련

초록

Group Relative Policy Optimization(GRPO)을 GUI 그라운딩에 적용할 때, 롤아웃은 단일 스크린샷 뷰에서 샘플링되며, 그룹은 어려운 인스턴스에서는 모두 실패하거나 쉬운 인스턴스에서는 모두 성공하는 경우가 많아 유의미한 상대적 이점을 얻을 수 없습니다. 본 논문에서는 동일한 GUI 인스턴스의 여러 타겟 보존 뷰(target-preserving views)로 각 비교 그룹을 구성하는 GRPO 기반 훈련 프레임워크인 VISTA(View-Consistent Self-Verified Training)를 제안합니다. 각 뷰는 타겟 요소를 보이게 유지하고 해당 박스를 정확히 재매핑하는 크롭(crop)으로 생성되므로, 모델 롤아웃은 의미적으로는 동일하지만 기하학적으로는 다른 입력들 간에 비교됩니다. VISTA는 짧은 좌표 생성을 안정화하면서 강화 학습을 무조건적 모방으로 변질시키지 않기 위해, 자체 검증된 교차 뷰 앵커(self-verified cross-view anchor)를 추가합니다. 이는 이점 가중 손실(advantage-weighted loss)로 최적화된 오라클 답변(oracle answer)으로, 그룹 기준선에서 제외되며 모델이 최대 보상 롤아웃을 생성한 경우에만 활성화됩니다. 다섯 가지 GUI 그라운딩 벤치마크와 여러 Qwen 백본에 걸쳐 VISTA는 일관되게 그라운딩 정확도를 향상시킵니다. ScreenSpot-Pro에서는 Qwen3-VL 4B/8B/30B-A3B의 성능을 55.5/52.7/53.7에서 63.4/65.8/67.0으로 끌어올렸습니다. 강건성 분석에서는 최악 뷰 정확도가 더 높아지고 예측 반전률이 낮아지는 것을 추가로 보여줍니다.

English

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.