CogVLM：預訓練語言模型的視覺專家

摘要

我們介紹了 CogVLM，一個功能強大的開源視覺語言基礎模型。與流行的淺層對齊方法不同，該方法將圖像特徵映射到語言模型的輸入空間中，CogVLM 通過可訓練的視覺專家模塊在注意力和 FFN 層中橋接了凍結的預訓練語言模型和圖像編碼器之間的差距。因此，CogVLM 實現了視覺語言特徵的深度融合，而不會在自然語言處理任務上犧牲任何性能。 CogVLM-17B 在包括 NoCaps、Flicker30k 標題生成、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC 在內的 10 個經典跨模態基準測試中實現了最先進的性能，並在 VQAv2、OKVQA、TextVQA、COCO 標題生成等方面排名第二，超越或匹敵了 PaLI-X 55B。代碼和檢查點可在 https://github.com/THUDM/CogVLM 找到。

English

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

CogVLM：預訓練語言模型的視覺專家

CogVLM: Visual Expert for Pretrained Language Models

摘要

Support