Phi-Ground Technischer Bericht: Fortschritte in der Wahrnehmung für GUI-Grounding

papers.abstract

Mit der Entwicklung multimodaler Reasoning-Modelle werden Computer Use Agents (CUAs), ähnlich wie Jarvis aus „Iron Man“, zunehmend Realität. Die GUI-Grounding ist eine Kernkomponente für CUAs, um tatsächliche Aktionen auszuführen, vergleichbar mit der mechanischen Steuerung in der Robotik, und sie entscheidet direkt über den Erfolg oder Misserfolg des Systems. Sie bestimmt Aktionen wie Klicken und Tippen sowie zugehörige Parameter wie die Koordinaten für Klicks. Aktuelle end-to-end Grounding-Modelle erreichen auf anspruchsvollen Benchmarks wie ScreenSpot-pro und UI-Vision immer noch weniger als 65\% Genauigkeit, was zeigt, dass sie weit davon entfernt sind, einsatzbereit zu sein. In dieser Arbeit führen wir eine empirische Studie zum Training von Grounding-Modellen durch und untersuchen Details von der Datenerfassung bis zum Modelltraining. Letztendlich entwickelten wir die Phi-Ground-Modellfamilie, die in Agenten-Settings state-of-the-art Leistungen über alle fünf Grounding-Benchmarks für Modelle mit weniger als 10B Parametern erzielt. Im end-to-end Modell-Setting erreicht unser Modell weiterhin SOTA-Ergebnisse mit Werten von \textbf{43,2} auf ScreenSpot-pro und \textbf{27,2} auf UI-Vision. Wir glauben, dass die verschiedenen in dieser Arbeit diskutierten Details sowie unsere Erfolge und Misserfolge nicht nur den Aufbau von Grounding-Modellen klären, sondern auch anderen Wahrnehmungsaufgaben zugutekommen. Projekt-Homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

English

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textbf{43.2} on ScreenSpot-pro and \textbf{27.2} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

Phi-Ground Technischer Bericht: Fortschritte in der Wahrnehmung für GUI-Grounding

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

papers.abstract

Support