通過模態整合率解讀大型視覺語言模型中的跨模態對齊

摘要

我們提出了模態整合率（MIR），這是一個有效、穩健且通用的指標，用於指示大型視覺語言模型（LVLMs）的多模態預訓練質量。大規模預訓練在構建具備能力的LVLMs中扮演著關鍵角色，然而，在沒有昂貴的監督微調階段的情況下評估其訓練質量尚未得到充分探討。損失、困惑度和上下文評估結果通常用於大型語言模型（LLMs）的預訓練指標，然而我們觀察到，這些指標在將訓練良好的LLM與新的模態對齊時不夠具指示性。由於缺乏適當的指標，LVLMs在關鍵的預訓練階段的研究受到嚴重阻礙，包括訓練數據的選擇、高效模塊設計等。在本文中，我們提出從跨模態分佈距離的角度評估預訓練質量，並提出MIR，即模態整合率，其具有以下特點：1）能夠有效地代表預訓練質量，並與監督微調後的基準性能呈現正相關。2）對不同的訓練/評估數據具有穩健性。3）能夠泛化到不同的訓練配置和架構選擇。我們進行了一系列的預訓練實驗，以探索MIR的有效性，觀察到令人滿意的結果，表明MIR能夠指示關於訓練數據選擇、訓練策略安排和模型架構設計以獲得更好的預訓練結果。我們希望MIR可以成為構建具備能力的LVLMs的有用指標，並激發不同領域中有關模態對齊的後續研究。我們的程式碼位於：https://github.com/shikiw/Modality-Integration-Rate。

English

We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) Effective to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) Robust toward different training/evaluation data. 3) Generalize across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

通過模態整合率解讀大型視覺語言模型中的跨模態對齊

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

摘要

Support