CogVLM：事前学習済み言語モデルのための視覚専門家

要旨

我々は、強力なオープンソースの視覚言語基盤モデルであるCogVLMを紹介します。一般的な浅いアライメント手法とは異なり、CogVLMは画像特徴を言語モデルの入力空間にマッピングするのではなく、凍結された事前学習済み言語モデルと画像エンコーダーの間のギャップを、アテンション層とFFN層における学習可能な視覚エキスパートモジュールによって橋渡しします。その結果、CogVLMはNLPタスクの性能を損なうことなく、視覚と言語の特徴を深く融合させることが可能です。CogVLM-17Bは、NoCaps、Flicker30kキャプショニング、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA、TDIUCといった10の古典的なクロスモーダルベンチマークで最先端の性能を達成し、VQAv2、OKVQA、TextVQA、COCOキャプショニングなどでは2位にランクインし、PaLI-X 55Bを上回るか同等の性能を示しています。コードとチェックポイントはhttps://github.com/THUDM/CogVLMで公開されています。

English

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

CogVLM：事前学習済み言語モデルのための視覚専門家

CogVLM: Visual Expert for Pretrained Language Models

要旨

Support