ChatPaper.aiChatPaper

Qwen2-VL:增强视觉-语言模型对世界的感知在任何分辨率下

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

September 18, 2024
作者: Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
cs.AI

摘要

我们介绍了Qwen2-VL系列,这是对之前的Qwen-VL模型的先进升级,重新定义了视觉处理中传统的预设分辨率方法。Qwen2-VL引入了朴素动态分辨率机制,使模型能够动态处理不同分辨率的图像,将其转换为不同数量的视觉令牌。这种方法使模型能够生成更高效和准确的视觉表示,与人类感知过程密切相关。该模型还整合了多模态旋转位置嵌入(M-RoPE),有助于有效融合文本、图像和视频之间的位置信息。我们采用统一的范式来处理图像和视频,增强了模型的视觉感知能力。为了探索大型多模态模型的潜力,Qwen2-VL研究了大视觉语言模型(LVLMs)的规模定律。通过扩展模型大小(2B、8B和72B参数版本)和训练数据量,Qwen2-VL系列取得了极具竞争力的性能。值得注意的是,Qwen2-VL-72B模型在各种多模态基准测试中取得了与领先模型(如GPT-4o和Claude3.5-Sonnet)可比的结果,胜过其他通用模型。代码可在https://github.com/QwenLM/Qwen2-VL找到。
English
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL.

Summary

AI-Generated Summary

PDF784November 16, 2024