1回の順伝播が2回に勝る：正確かつ効率的なGUIグラウンディングのためのInnerZoom

要旨

MLLMベースのGUIグラウンディング手法は、一般的にターゲット定位を自己回帰座標生成として定式化し、モデルがMLLMの強力な指示追従能力と意味理解能力を活用できるようにする。しかし、この定式化では、GUIクリックが要求する空間精度で座標トークンをデコードしながら、領域レベルのターゲット証拠を保持することがモデルに求められる。我々の診断分析により、ターゲット領域認識は中間デコーダ層で出現するが、最終的な座標予測には保持されず、変換もされないことが明らかになった。既存のZoomInスタイルの手法は、外部のクロップ＆再実行パスを通じてこの問題に対処するが、これにより定位は改善されるものの、エンドツーエンドのレイテンシと計算コストが増加する。この追加コストなしに2パスズーミングの精度向上を維持するために、我々は単一フォワードフレームワークであるInnerZoomを提案する。これは層間証拠橋渡しのためのものである。InnerZoomは、元のフォワードパスからのターゲット関連の手がかりをコンパクトな層間証拠状態に変換し、その後、この状態を後続のデコード層全体にわたって保持、洗練、再注入し、座標予測を導く。広範な実験結果は、InnerZoom-4Bが6つのGUIグラウンディングベンチマークすべてで最先端の性能を達成し、OSWorld-Gで64.7、UI-Visionで40.2、OSWorld-GRで73.1、MMBench-GUIで87.6を獲得し、それぞれ従来の最高結果を4.1、3.2、2.9、2.3ポイント上回ったことを示している。制御された4B設定の下で、InnerZoomは同じSFT+RLベースラインを平均5.3ポイント改善し、2パスZoomInを平均1.3ポイント上回り、同時にエンドツーエンドレイテンシを最大31.8%削減し、TFLOPsを約29%削減する。コードとモデルは公開予定である。

English

MLLM-based GUI grounding methods commonly formulate target localization as autoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with the spatial precision demanded by GUI clicking. Our diagnostic analysis reveals that target-region awareness emerges in intermediate decoder layers but is neither retained nor translated into the final coordinate prediction. Existing ZoomIn-style methods address this issue through an external crop-and-rerun pass, which improves localization but increases end-to-end latency and computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework for cross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the same SFT+RL baseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducing end-to-end latency by up to 31.8% and TFLOPs by about 29%. Code and models will be publicly available.