CogAgent：一個針對 GUI 代理程式的視覺語言模型

摘要

人們花費大量時間在數字設備上，透過圖形使用者介面（GUI），例如電腦或智慧型手機屏幕。大型語言模型（LLMs）如ChatGPT可以協助人們完成像是寫郵件之類的任務，但在理解和互動GUI方面遇到困難，因此限制了它們提高自動化水平的潛力。在本文中，我們介紹了CogAgent，一個擁有180億參數的視覺語言模型（VLM），專門用於GUI理解和導航。通過利用低分辨率和高分辨率圖像編碼器，CogAgent支持以1120*1120的分辨率輸入，使其能夠識別微小的頁面元素和文本。作為通用的視覺語言模型，CogAgent在五個文本豐富和四個通用VQA基準上取得了最新成果，包括VQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet和POPE。CogAgent僅使用截圖作為輸入，在PC和Android GUI導航任務上勝過基於LLM的方法，這些方法消耗提取的HTML文本--Mind2Web和AITW，推動了技術的最新進展。模型和代碼可在https://github.com/THUDM/CogVLM找到。

English

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.

CogAgent：一個針對 GUI 代理程式的視覺語言模型

CogAgent: A Visual Language Model for GUI Agents

摘要

Support