CogVLM：预训练语言模型的视觉专家

摘要

我们介绍了CogVLM，这是一个强大的开源视觉语言基础模型。与流行的浅层对齐方法不同，该方法将图像特征映射到语言模型的输入空间中，CogVLM通过在注意力和前馈神经网络层中引入一个可训练的视觉专家模块来弥合预训练语言模型和图像编码器之间的差距。因此，CogVLM实现了对视觉语言特征的深度融合，而不会在自然语言处理任务上牺牲性能。CogVLM-17B在包括NoCaps、Flicker30k字幕生成、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA和TDIUC在内的10个经典跨模态基准上实现了最先进的性能，并在VQAv2、OKVQA、TextVQA、COCO字幕生成等任务中排名第二，超越或与PaLI-X 55B持平。代码和检查点可在https://github.com/THUDM/CogVLM找到。

English

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

CogVLM：预训练语言模型的视觉专家

CogVLM: Visual Expert for Pretrained Language Models

摘要

Support