CogAgent：用于GUI代理的视觉语言模型

摘要

人们通过图形用户界面（GUI），例如计算机或智能手机屏幕，在数字设备上花费了大量时间。大型语言模型（LLMs）如ChatGPT可以帮助人们完成写邮件等任务，但在理解和与GUI互动方面存在困难，从而限制了它们提高自动化水平的潜力。在本文中，我们介绍了CogAgent，这是一个拥有180亿参数的视觉语言模型（VLM），专门用于GUI的理解和导航。通过利用低分辨率和高分辨率图像编码器，CogAgent支持以1120*1120的分辨率输入，使其能够识别微小的页面元素和文本。作为一种通用的视觉语言模型，CogAgent在五个文本丰富和四个通用VQA基准上取得了最先进的成果，包括VQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet和POPE。CogAgent仅使用屏幕截图作为输入，在PC和Android GUI导航任务上优于消耗提取的HTML文本的基于LLM的方法--Mind2Web和AITW，推动了技术的发展。该模型和代码可在https://github.com/THUDM/CogVLM 上获得。

English

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.

CogAgent：用于GUI代理的视觉语言模型

CogAgent: A Visual Language Model for GUI Agents

摘要

Support