VILA：关于视觉语言模型的预训练

摘要

随着大型语言模型的最近成功，视觉语言模型（VLMs）取得了快速进展。人们已经在视觉指导调整方面付出了越来越多的努力，以扩展LLM以接受视觉输入，但缺乏对视觉语言预训练过程的深入研究，即模型学习如何在两种模态上进行联合建模。在这项工作中，我们通过逐步可控的比较，研究了VLM预训练的设计选项，通过将LLM朝向VLM的方式进行增强。我们提出了三个主要发现：（1）在预训练期间冻结LLM可以实现体面的零-shot性能，但缺乏上下文学习能力，这需要解冻LLM；（2）交错的预训练数据是有益的，而仅有图像-文本对并不是最佳选择；（3）在指导微调期间，重新混合仅文本指导数据到图像-文本数据不仅可以弥补仅文本任务的退化，还可以提高VLM任务的准确性。通过增强的预训练配方，我们构建了VILA，一个视觉语言模型系列，始终在主要基准测试中表现优于最先进的模型，例如LLaVA-1.5，而无需花哨的技巧。多模态预训练还有助于揭示VILA的吸引人特性，包括多图像推理、增强的上下文学习和更好的世界知识。

English

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

VILA：关于视觉语言模型的预训练

VILA: On Pre-training for Visual Language Models

摘要

Support