大規模なビジョン言語モデルにおけるクロスモーダルアラインメントの解読とモダリティ統合率

要旨

我々は、大規模ビジョン言語モデル（LVLMs）の多モーダル事前学習の品質を示す効果的で堅牢で一般的な指標であるModality Integration Rate（MIR）を提案します。大規模事前学習は、能力のあるLVLMsを構築する上で重要な役割を果たしますが、高コストな教師ありファインチューニング段階なしでその訓練品質を評価することは未開拓です。損失、パープレキシティ、およびコンテキスト内評価結果は、大規模言語モデル（LLMs）の事前学習メトリクスとして一般的に使用されますが、我々は、これらのメトリクスが新しいモダリティに適切に調整された訓練済みLLMとの整合性が少ないことに気付きました。適切なメトリクスの欠如により、LVLMsの重要な事前学習段階における研究が大きく妨げられており、トレーニングデータの選択、効率的なモジュール設計などが含まれます。本論文では、事前学習品質を異モーダル分布距離の観点から評価し、事前学習品質を表現し、教師ありファインチューニング後のベンチマークパフォーマンスとの正の関係を示す効果的なModality Integration Rate（MIR）を提案します。2）異なるトレーニング/評価データに対して堅牢です。3）トレーニング構成とアーキテクチャ選択にわたって一般化します。MIRの効果を探るために一連の事前学習実験を実施し、MIRがトレーニングデータの選択、トレーニング戦略スケジュール、およびモデルアーキテクチャ設計について示唆的であり、より良い事前学習結果を得るための手助けとなることを観察しました。MIRが能力のあるLVLMsを構築するための有益なメトリクスとなり、異なる領域でのモダリティ整合性に関する研究を促進することを期待しています。弊社のコードは以下にあります：https://github.com/shikiw/Modality-Integration-Rate.

English

We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) Effective to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) Robust toward different training/evaluation data. 3) Generalize across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

大規模なビジョン言語モデルにおけるクロスモーダルアラインメントの解読とモダリティ統合率

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

要旨

Support