ChatPaper.aiChatPaper

CogVLM2: Modelos de Linguagem Visual para Compreensão de Imagens e Vídeos

CogVLM2: Visual Language Models for Image and Video Understanding

August 29, 2024
Autores: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang
cs.AI

Resumo

从VisualGLM和CogVLM起步,我们持续探索视觉语言模型,致力于提升视觉-语言融合能力、构建高效的高分辨率架构,并拓展多模态应用场景。本文提出新一代视觉语言模型系列CogVLM2,包括图像理解模型CogVLM2、视频理解模型CogVLM2-Video以及GLM-4V。作为图像理解模型,CogVLM2继承了视觉专家架构,并改进了预训练与后训练阶段的训练方案,最高支持1344×1344像素的输入分辨率。作为视频理解模型,CogVLM2-Video融合了带时间戳的多帧输入技术,并提出了自动化的时序定位数据构建方法。值得关注的是,CogVLM2系列在MMBench、MM-Vet、TextVQA、MVBench和VCGBench等基准测试中均取得了最先进的性能。所有模型均已开源,代码仓库位于https://github.com/THUDM/CogVLM2 与 https://github.com/THUDM/GLM-4,以推动该领域的技术发展。
English
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 times 1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
PDF575February 8, 2026