CogVLM2:面向图像与视频理解的多模态视觉语言模型
CogVLM2: Visual Language Models for Image and Video Understanding
August 29, 2024
作者: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang
cs.AI
摘要
自VisualGLM与CogVLM起,我们持续探索视觉语言模型,致力于提升视觉-语言融合能力、构建高效的高分辨率架构、拓展多模态应用边界。现推出新一代视觉语言模型CogVLM2系列,包括图像理解模型CogVLM2、视频理解模型CogVLM2-Video以及GLM-4V。作为图像理解模型,CogVLM2沿用了视觉专家架构,并在预训练与后训练阶段优化训练策略,最高支持1344×1344像素的输入分辨率。视频理解模型CogVLM2-Video创新性地融合带时间戳的多帧输入,并提出自动化的时序定位数据构建方法。值得关注的是,CogVLM2系列在MMBench、MM-Vet、TextVQA、MVBench及VCGBench等基准测试中均取得了领先性能。所有模型均已开源(https://github.com/THUDM/CogVLM2 与 https://github.com/THUDM/GLM-4),助力领域发展。
English
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in
pursuit of enhanced vision-language fusion, efficient higher-resolution
architecture, and broader modalities and applications. Here we propose the
CogVLM2 family, a new generation of visual language models for image and video
understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image
understanding model, CogVLM2 inherits the visual expert architecture with
improved training recipes in both pre-training and post-training stages,
supporting input resolution up to 1344 times 1344 pixels. As a video
understanding model, CogVLM2-Video integrates multi-frame input with timestamps
and proposes automated temporal grounding data construction. Notably, CogVLM2
family has achieved state-of-the-art results on benchmarks like MMBench,
MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in
https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4,
contributing to the advancement of the field.