Phi-Ground Technisch Rapport: Vooruitgang in Waarneming voor GUI-Grounding

Samenvatting

Met de ontwikkeling van multimodale redeneermodellen worden Computer Use Agents (CUAs), vergelijkbaar met Jarvis uit "Iron Man", werkelijkheid. GUI-gronding is een kerncomponent voor CUAs om daadwerkelijke acties uit te voeren, vergelijkbaar met mechanische controle in robotica, en het bepaalt direct het succes of falen van het systeem. Het bepaalt acties zoals klikken en typen, evenals gerelateerde parameters zoals de coördinaten voor klikken. Huidige end-to-end grondingsmodellen halen nog steeds minder dan 65\% nauwkeurigheid op uitdagende benchmarks zoals ScreenSpot-pro en UI-Vision, wat aangeeft dat ze nog lang niet klaar zijn voor implementatie. % , aangezien een enkele misklik onacceptabele gevolgen kan hebben. In dit werk voeren we een empirische studie uit naar de training van grondingsmodellen, waarbij we details onderzoeken van gegevensverzameling tot modeltraining. Uiteindelijk ontwikkelden we de Phi-Ground model-familie, die state-of-the-art prestaties behaalt op alle vijf grondingsbenchmarks voor modellen met minder dan 10B parameters in agent-instellingen. In de end-to-end model-instelling behaalt ons model nog steeds SOTA-resultaten met scores van \textbf{43.2} op ScreenSpot-pro en \textbf{27.2} op UI-Vision. Wij geloven dat de verschillende details die in dit artikel worden besproken, samen met onze successen en mislukkingen, niet alleen de constructie van grondingsmodellen verduidelijken, maar ook andere perceptietaken ten goede komen. Project homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

English

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from "Iron Man", are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the Phi-Ground model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textbf{43.2} on ScreenSpot-pro and \textbf{27.2} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: https://zhangmiaosen2000.github.io/Phi-Ground/{https://zhangmiaosen2000.github.io/Phi-Ground/}

Phi-Ground Technisch Rapport: Vooruitgang in Waarneming voor GUI-Grounding

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Samenvatting

Support