CogVLM: 사전 학습된 언어 모델을 위한 시각 전문가

초록

우리는 강력한 오픈소스 시각 언어 기반 모델인 CogVLM을 소개합니다. 이미지 특징을 언어 모델의 입력 공간으로 매핑하는 일반적인 얕은 정렬(alignment) 방식과 달리, CogVLM은 고정된 사전 학습 언어 모델과 이미지 인코더 간의 간극을 주의력(attention) 및 FFN 계층에서 학습 가능한 시각 전문가(visual expert) 모듈을 통해 연결합니다. 이를 통해 CogVLM은 NLP 작업의 성능을 희생하지 않으면서도 시각과 언어 특징의 깊은 융합을 가능하게 합니다. CogVLM-17B은 NoCaps, Flicker30k 캡셔닝, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, TDIUC 등 10개의 고전적인 크로스모달 벤치마크에서 최첨단 성능을 달성하며, VQAv2, OKVQA, TextVQA, COCO 캡셔닝 등에서는 2위를 기록하여 PaLI-X 55B를 능가하거나 동등한 성능을 보입니다. 코드와 체크포인트는 https://github.com/THUDM/CogVLM에서 확인할 수 있습니다.

English

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

CogVLM: 사전 학습된 언어 모델을 위한 시각 전문가

CogVLM: Visual Expert for Pretrained Language Models

초록

Support