VILA：關於視覺語言模型的預訓練

摘要

隨著大型語言模型的最近成功，視覺語言模型（VLMs）迅速取得進展。近來在視覺指導調整方面已經有越來越多的努力，以擴展LLM並加入視覺輸入，但對視覺語言預訓練過程的深入研究尚不足，其中模型學習在兩種模態上進行聯合建模。在這項研究中，我們通過逐步可控的比較，檢視了VLM預訓練的設計選項，通過對LLM進行擴充以朝向VLM。我們提出三個主要發現：（1）在預訓練期間凍結LLMs可以實現不錯的零-shot表現，但缺乏上下文學習能力，需要解凍LLM；（2）交錯的預訓練數據是有益的，而僅有圖像-文本對並不是最佳的；（3）在指導微調期間，將僅有文本指導數據重新混合為圖像-文本數據，不僅補救了僅有文本任務的退化，還提高了VLM任務的準確性。通過增強的預訓練配方，我們建立了VILA，一個視覺語言模型系列，持續優於最先進的模型，例如LLaVA-1.5，在主要基準測試中，沒有花哨的功能。多模態預訓練還有助於揭示VILA的吸引人特性，包括多圖像推理、增強的上下文學習和更好的世界知識。

English

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

VILA：關於視覺語言模型的預訓練

VILA: On Pre-training for Visual Language Models

摘要

Support