Phi-Ground技術レポート：GUIグラウンディングにおける知覚の進展

要旨

マルチモーダル推論モデルの発展に伴い、「アイアンマン」のJarvisのようなコンピュータ利用エージェント（CUA）が現実のものとなりつつあります。GUIグラウンディングは、CUAが実際のアクションを実行するためのコアコンポーネントであり、ロボティクスにおける機械制御に似ており、システムの成功または失敗に直接つながります。これは、クリックやタイピングといったアクション、およびクリックの座標などの関連パラメータを決定します。現在のエンドツーエンドグラウンディングモデルは、ScreenSpot-proやUI-Visionのような挑戦的なベンチマークで65％未満の精度しか達成しておらず、デプロイの準備がまだ整っていないことを示しています。本論文では、グラウンディングモデルのトレーニングに関する実証研究を行い、データ収集からモデルトレーニングまでの詳細を検証しました。その結果、エージェント設定において10Bパラメータ未満のモデルで、すべての5つのグラウンディングベンチマークで最先端の性能を達成するPhi-Groundモデルファミリーを開発しました。エンドツーエンドモデル設定においても、我々のモデルはScreenSpot-proで\textbf{43.2}、UI-Visionで\textbf{27.2}のスコアを達成し、SOTAの結果を維持しています。本論文で議論されたさまざまな詳細と、我々の成功と失敗は、グラウンディングモデルの構築を明確にするだけでなく、他の知覚タスクにも役立つと信じています。プロジェクトホームページ: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

English

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textbf{43.2} on ScreenSpot-pro and \textbf{27.2} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

Phi-Ground技術レポート：GUIグラウンディングにおける知覚の進展

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

要旨

Support