視覺語言模型是否應以圖像數據進行預訓練?
Should VLMs be Pre-trained with Image Data?
March 10, 2025
作者: Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave
cs.AI
摘要
經過圖像數據進一步訓練的預訓練大型語言模型(LLMs)在視覺-語言任務上表現出色。雖然在第二訓練階段加入圖像有效釋放了這一能力,但尚不清楚這種兩步訓練流程相比於更早整合圖像的視覺語言模型(VLMs)能帶來多少增益或損失。為探究此問題,我們訓練了涵蓋多種數據集、規模、圖文比例及在引入視覺標記前已完成預訓練量的模型。隨後,我們對這些模型進行微調,並在一系列視覺-語言及純文本任務上評估其下游表現。我們發現,使用圖文混合數據進行預訓練的模型在視覺-語言任務上表現更佳,同時在純文本評估中保持強勁性能。在平均六項多樣化任務中,我們發現對於一個10億參數的模型,在預訓練進度達到80%時引入視覺標記,相比於在完全預訓練後引入視覺標記,平均提升了2%的性能。
English
Pre-trained LLMs that are further trained with image data perform well on
vision-language tasks. While adding images during a second training phase
effectively unlocks this capability, it is unclear how much of a gain or loss
this two-step pipeline gives over VLMs which integrate images earlier into the
training process. To investigate this, we train models spanning various
datasets, scales, image-text ratios, and amount of pre-training done before
introducing vision tokens. We then fine-tune these models and evaluate their
downstream performance on a suite of vision-language and text-only tasks. We
find that pre-training with a mixture of image and text data allows models to
perform better on vision-language tasks while maintaining strong performance on
text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B
model, introducing visual tokens 80% of the way through pre-training results in
a 2% average improvement over introducing visual tokens to a fully pre-trained
model.Summary
AI-Generated Summary