ChatPaper.aiChatPaper

利用模态集成率解读大型视觉-语言模型中的跨模态对齐

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

October 9, 2024
作者: Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
cs.AI

摘要

我们提出了模态集成率(MIR),这是一个有效、稳健且通用的度量标准,用于指示大规模视觉语言模型(LVLMs)的多模态预训练质量。大规模预训练在构建功能强大的LVLMs中起着关键作用,然而在没有昂贵的监督微调阶段的情况下评估其训练质量尚未得到充分探讨。Loss、困惑度和上下文评估结果通常用于大型语言模型(LLMs)的预训练度量标准,但我们观察到当将训练良好的LLM与新的模态对齐时,这些度量标准的指示性较低。由于缺乏适当的度量标准,LVLMs在关键的预训练阶段的研究受到极大阻碍,包括训练数据选择、高效模块设计等。在本文中,我们提出从跨模态分布距离的角度评估预训练质量,并提出模态集成率MIR,该指标具有以下特点:1)有效地表示预训练质量,并与监督微调后的基准性能呈正相关。2)对不同的训练/评估数据具有稳健性。3)在不同训练配置和架构选择中具有泛化性。我们进行了一系列预训练实验,探索了MIR的有效性,并观察到令人满意的结果,表明MIR对于训练数据选择、训练策略安排和模型架构设计以获得更好的预训练结果具有指示意义。我们希望MIR可以成为构建功能强大的LVLMs的有用度量标准,并激发关于不同领域中模态对齐的后续研究。我们的代码位于:https://github.com/shikiw/Modality-Integration-Rate。
English
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) Effective to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) Robust toward different training/evaluation data. 3) Generalize across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

Summary

AI-Generated Summary

PDF402November 16, 2024