CogAgent: GUI 에이전트를 위한 시각적 언어 모델

초록

사람들은 그래픽 사용자 인터페이스(GUI), 예를 들어 컴퓨터나 스마트폰 화면을 통해 디지털 기기에 막대한 시간을 투자하고 있습니다. ChatGPT와 같은 대형 언어 모델(LLM)은 이메일 작성과 같은 작업을 지원할 수 있지만, GUI를 이해하고 상호작용하는 데 어려움을 겪어 자동화 수준을 높이는 데 한계가 있습니다. 본 논문에서는 GUI 이해 및 탐색에 특화된 180억 개의 파라미터를 가진 시각 언어 모델(VLM)인 CogAgent를 소개합니다. CogAgent는 저해상도와 고해상도 이미지 인코더를 모두 활용하여 1120*1120 해상도의 입력을 지원하며, 이를 통해 작은 페이지 요소와 텍스트를 인식할 수 있습니다. 일반적인 시각 언어 모델로서, CogAgent는 VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, POPE를 포함한 다섯 개의 텍스트 중심 벤치마크와 네 개의 일반 VQA 벤치마크에서 최첨단 성능을 달성했습니다. 스크린샷만을 입력으로 사용하는 CogAgent는 PC와 Android GUI 탐색 작업인 Mind2Web과 AITW에서 추출된 HTML 텍스트를 소비하는 LLM 기반 방법을 능가하며, 최첨단 기술을 발전시켰습니다. 모델과 코드는 https://github.com/THUDM/CogVLM에서 확인할 수 있습니다.

English

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.

CogAgent: GUI 에이전트를 위한 시각적 언어 모델

CogAgent: A Visual Language Model for GUI Agents

초록

Support