ChatPaper.aiChatPaper

CogVLM2:用於圖像和視頻理解的視覺語言模型

CogVLM2: Visual Language Models for Image and Video Understanding

August 29, 2024
作者: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang
cs.AI

摘要

從VisualGLM和CogVLM開始,我們不斷探索VLM以追求增強的視覺語言融合、高效的高分辨率架構,以及更廣泛的模態和應用。在這裡,我們提出CogVLM2家族,這是一代新的視覺語言模型,用於圖像和視頻理解,包括CogVLM2、CogVLM2-Video和GLM-4V。作為圖像理解模型,CogVLM2繼承了視覺專家架構,並在預訓練和後訓練階段提供改進的訓練配方,支持高達1344乘1344像素的輸入分辨率。作為視頻理解模型,CogVLM2-Video將多幀輸入與時間戳集成,並提出自動化的時間基準數據構建。值得注意的是,CogVLM2家族在MMBench、MM-Vet、TextVQA、MVBench和VCGBench等基準測試中取得了最先進的結果。所有模型均在https://github.com/THUDM/CogVLM2和https://github.com/THUDM/GLM-4上開源,有助於推動該領域的發展。
English
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 times 1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
PDF575November 14, 2024