VISTA：用於GUI定位的視角一致自我驗證訓練

摘要

在對 GUI 定位應用群體相對策略優化（Group Relative Policy Optimization, GRPO）時，生成結果（rollouts）是從單一螢幕截圖視圖中採樣的；群組在困難樣本上往往全部失敗，在簡單樣本上則全部成功，致使無法產生有用的相對優勢。為此，我們提出 VISTA（View-Consistent Self-Verified Training，視圖一致自我驗證訓練），這是一種基於 GRPO 的訓練框架，透過從同一 GUI 實例的多個保留目標視圖（target-preserving views）來建構每個比較群組。每個視圖皆由裁切（crop）生成，裁切時保持目標元素可見並精確重新映射其邊界框，因此模型的生成結果是在語義等價但幾何不同的輸入之間進行比較。為穩定短座標生成而不致使強化學習淪為無條件模仿，VISTA 進一步加入一個自我驗證的跨視圖錨點（self-verified cross-view anchor）：即一個使用優勢加權損失（advantage-weighted loss）最佳化的神諭答案（oracle answer），該答案不納入群體基準，且僅在模型產出最大獎勵生成結果（maximum-reward rollout）時啟用。在五個 GUI 定位基準測試與多個 Qwen 骨幹模型上，VISTA 一致提升了定位準確率。在 ScreenSpot-Pro 上，它將 Qwen3-VL 4B/8B/30B-A3B 的準確率分別從 55.5/52.7/53.7 提升至 63.4/65.8/67.0。穩健性分析進一步顯示，最差視圖準確率更高，預測翻轉率更低。

English

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.