순수 비전 기반 GUI 에이전트를 위한 OmniParser

초록

최근 대형 시각-언어 모델의 성공은 사용자 인터페이스에서 작동하는 에이전트 시스템을 구동하는 데 있어 큰 잠재력을 보여주고 있습니다. 그러나 우리는 GPT-4V와 같은 멀티모달 모델이 다양한 운영 체제와 애플리케이션에서 범용 에이전트로서의 능력이 크게 과소평가되고 있다고 주장합니다. 이는 1) 사용자 인터페이스 내 상호작용 가능한 아이콘을 신뢰성 있게 식별하고, 2) 스크린샷 내 다양한 요소의 의미를 이해하며 해당 영역과 의도된 동작을 정확하게 연결할 수 있는 강력한 화면 파싱 기술의 부재 때문입니다. 이러한 격차를 해소하기 위해, 우리는 사용자 인터페이스 스크린샷을 구조화된 요소로 파싱하는 포괄적인 방법인 OmniParser를 소개합니다. 이는 GPT-4V가 인터페이스의 해당 영역에 정확히 기반한 동작을 생성하는 능력을 크게 향상시킵니다. 먼저, 인기 있는 웹페이지를 활용하여 상호작용 가능한 아이콘 감지 데이터셋과 아이콘 설명 데이터셋을 구축했습니다. 이 데이터셋은 화면 내 상호작용 가능한 영역을 파싱하기 위한 감지 모델과 감지된 요소의 기능적 의미를 추출하기 위한 캡션 모델을 미세 조정하는 데 사용되었습니다. OmniParser는 ScreenSpot 벤치마크에서 GPT-4V의 성능을 크게 개선했습니다. 또한 Mind2Web 및 AITW 벤치마크에서, 스크린샷만을 입력으로 사용한 OmniParser는 스크린샷 외 추가 정보를 요구하는 GPT-4V 기준선을 능가하는 성과를 보였습니다.

English

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce OmniParser, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. OmniParser significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, OmniParser with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.

순수 비전 기반 GUI 에이전트를 위한 OmniParser

OmniParser for Pure Vision Based GUI Agent

초록

Support