VISTA: 视角一致的自验证训练用于GUI定位

摘要

在应用组相对策略优化（GRPO）进行GUI元素定位时，轨迹采样仅来自单一截图视角；对于困难样本，组内常全部失败，而简单样本则全部成功，从而无法产生有效的相对优势。我们提出VISTA（视角一致自验证训练），一种基于GRPO的训练框架，该框架从同一GUI实例的多个保持目标元素的视角构建每个比较组。每个视角通过裁剪生成，确保目标元素可见并精确映射其边界框，从而使模型轨迹在语义等价但几何不同的输入之间进行比较。为了在不将强化学习退化为无条件模仿的情况下稳定短坐标生成，VISTA进一步引入了一种自验证跨视图锚点：一个采用优势加权损失优化的真实答案，该锚点被排除在组基线之外，且仅在模型产生最大奖励轨迹时激活。在五个GUI元素定位基准和多个Qwen骨干网络上，VISTA consistently提升了定位精度。在ScreenSpot-Pro上，它将Qwen3-VL 4B/8B/30B-A3B的准确率从55.5/52.7/53.7提升至63.4/65.8/67.0。鲁棒性分析进一步表明，最差视角准确率更高，预测翻转率更低。

English

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.