CogAgent: GUIエージェントのための視覚言語モデル

要旨

人々はグラフィカルユーザーインターフェース（GUI）、例えばコンピュータやスマートフォンの画面を通じて、デジタルデバイスに膨大な時間を費やしています。ChatGPTのような大規模言語モデル（LLM）は、メールの作成などのタスクで人々を支援できますが、GUIを理解し操作するのに苦労し、自動化レベルの向上の可能性を制限しています。本論文では、GUIの理解とナビゲーションに特化した180億パラメータの視覚言語モデル（VLM）であるCogAgentを紹介します。低解像度と高解像度の画像エンコーダを活用することで、CogAgentは1120*1120の解像度での入力をサポートし、小さなページ要素やテキストを認識することが可能です。汎用視覚言語モデルとして、CogAgentはVQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet、POPEを含む5つのテキストリッチおよび4つの一般的なVQAベンチマークで最先端の性能を達成しています。CogAgentは、スクリーンショットのみを入力として使用し、PCおよびAndroidのGUIナビゲーションタスク（Mind2WebおよびAITW）において、抽出されたHTMLテキストを消費するLLMベースの手法を上回り、最先端の技術を進化させています。モデルとコードはhttps://github.com/THUDM/CogVLMで公開されています。

English

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.

CogAgent: GUIエージェントのための視覚言語モデル

CogAgent: A Visual Language Model for GUI Agents

要旨

Support