NVLM:开放式前沿多模态LLM模型
NVLM: Open Frontier-Class Multimodal LLMs
September 17, 2024
作者: Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
cs.AI
摘要
我们介绍了NVLM 1.0,这是一系列前沿级多模态大型语言模型(LLMs),在视觉-语言任务上取得了最先进的结果,与领先的专有模型(如GPT-4o)和开放获取模型(如Llama 3-V 405B和InternVL 2)不相上下。值得注意的是,NVLM 1.0在多模态训练后显示出比其LLM骨干更好的纯文本性能。在模型设计方面,我们对解码器-仅多模态LLMs(如LLaVA)和基于交叉注意力的模型(如Flamingo)进行了全面比较。根据这两种方法的优势和劣势,我们提出了一种增强训练效率和多模态推理能力的新型架构。此外,我们为基于瓦片的动态高分辨率图像引入了一种1-D瓦片标记设计,这显著提升了在多模态推理和OCR相关任务上的性能。关于训练数据,我们精心策划并提供了关于我们的多模态预训练和监督微调数据集的详细信息。我们的研究结果表明,数据集质量和任务多样性比规模更重要,即使在预训练阶段,在所有架构中也是如此。值得注意的是,我们为NVLM-1.0模型开发了生产级多模态能力,使其在视觉-语言任务中表现出色,同时与其LLM骨干相比,甚至改善了纯文本性能。为了实现这一点,我们将高质量的纯文本数据集与多模态训练相结合,同时提供大量的多模态数学和推理数据,从而增强了各种模态的数学和编码能力。为推动该领域的研究,我们将发布模型权重,并将代码开源给社区:https://nvlm-project.github.io/。
English
We introduce NVLM 1.0, a family of frontier-class multimodal large language
models (LLMs) that achieve state-of-the-art results on vision-language tasks,
rivaling the leading proprietary models (e.g., GPT-4o) and open-access models
(e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved
text-only performance over its LLM backbone after multimodal training. In terms
of model design, we perform a comprehensive comparison between decoder-only
multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g.,
Flamingo). Based on the strengths and weaknesses of both approaches, we propose
a novel architecture that enhances both training efficiency and multimodal
reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for
tile-based dynamic high-resolution images, which significantly boosts
performance on multimodal reasoning and OCR-related tasks. Regarding training
data, we meticulously curate and provide detailed information on our multimodal
pretraining and supervised fine-tuning datasets. Our findings indicate that
dataset quality and task diversity are more important than scale, even during
the pretraining phase, across all architectures. Notably, we develop
production-grade multimodality for the NVLM-1.0 models, enabling them to excel
in vision-language tasks while maintaining and even improving text-only
performance compared to their LLM backbones. To achieve this, we craft and
integrate a high-quality text-only dataset into multimodal training, alongside
a substantial amount of multimodal math and reasoning data, leading to enhanced
math and coding capabilities across modalities. To advance research in the
field, we are releasing the model weights and will open-source the code for the
community: https://nvlm-project.github.io/.Summary
AI-Generated Summary