OmniParser per Agenti GUI Basati Esclusivamente su Visione

Abstract

Il recente successo dei grandi modelli linguistici visivi mostra un grande potenziale nel guidare i sistemi agenti che operano sulle interfacce utente. Tuttavia, sosteniamo che il potere dei modelli multimodali come GPT-4V come agenti generali su più sistemi operativi e diverse applicazioni sia ampiamente sottovalutato a causa della mancanza di una tecnica robusta di analisi dello schermo in grado di: 1) identificare in modo affidabile le icone interagibili all'interno dell'interfaccia utente, e 2) comprendere la semantica dei vari elementi in uno screenshot e associare accuratamente l'azione intesa alla regione corrispondente sullo schermo. Per colmare queste lacune, introduciamo OmniParser, un metodo completo per analizzare gli screenshot delle interfacce utente in elementi strutturati, che migliora significativamente la capacità di GPT-4V di generare azioni che possono essere accuratamente ancorate alle regioni corrispondenti dell'interfaccia. Abbiamo prima curato un dataset di rilevamento delle icone interagibili utilizzando pagine web popolari e un dataset di descrizione delle icone. Questi dataset sono stati utilizzati per affinare modelli specializzati: un modello di rilevamento per analizzare le regioni interagibili sullo schermo e un modello di descrizione per estrarre la semantica funzionale degli elementi rilevati. OmniParser migliora significativamente le prestazioni di GPT-4V sul benchmark ScreenSpot. E sui benchmark Mind2Web e AITW, OmniParser con input solo screenshot supera le baseline di GPT-4V che richiedono informazioni aggiuntive al di fuori dello screenshot.

English

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OmniParser, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OmniParser with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.

OmniParser per Agenti GUI Basati Esclusivamente su Visione

OmniParser for Pure Vision Based GUI Agent

Abstract

Support