Phi-Ground技術報告：推進圖形用戶界面理解中的感知能力

摘要

隨著多模態推理模型的發展，類似於《鋼鐵人》中賈維斯的計算機使用代理（CUAs）正逐漸成為現實。圖形用戶界面（GUI）定位是CUAs執行實際操作的核心組件，類似於機器人技術中的機械控制，它直接關係到系統的成敗。它決定了點擊和輸入等操作，以及相關參數，如點擊的座標。目前，端到端的定位模型在ScreenSpot-pro和UI-Vision等具有挑戰性的基準測試中仍未能達到65%的準確率，這表明它們遠未達到部署的標準。在本研究中，我們對定位模型的訓練進行了實證研究，從數據收集到模型訓練的細節進行了全面考察。最終，我們開發了Phi-Ground模型系列，在代理設置下，對於參數少於10B的模型，在所有五個定位基準測試中均達到了最先進的性能。在端到端模型設置中，我們的模型仍然取得了SOTA的成績，在ScreenSpot-pro上得分為\textbf{43.2}，在UI-Vision上得分為\textbf{27.2}。我們相信，本文討論的各種細節，以及我們的成功與失敗，不僅闡明了定位模型的構建，也對其他感知任務有所裨益。項目主頁：https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

English

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textbf{43.2} on ScreenSpot-pro and \textbf{27.2} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

Phi-Ground技術報告：推進圖形用戶界面理解中的感知能力

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

摘要

Support